# Assignment: Analyzing Wikipedia Pages

In this assignment, you will analyze a small fraction of [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) pages. For manipulating Wikipedia pages, you will use `wikipedia` module that could require an installation. You can install the module in several ways, it depends on your installation of python and Jupyter. If you have administrator rights, you can use
``` 
conda install -c conda-forge wikipedia
```
when your installation used `conda` or 
```
pip install wikipedia
```
that should work in any case. However, if you need to install it on a computer where you do not have administrator rights, you should use
```
pip install --user wikipedia
```
which installs the module in a subdirectory of your home directory. 

Official documentation for the `wikipedia` module can be found [here](https://wikipedia.readthedocs.io/en/latest/code.html).

In [124]:
import networkx as nx
import wikipedia

wikipedia.set_lang("en")    # we will search English version of Wikipedia

Module `wikipedia` enables us to load Wikipedia pages and extract links contained within the pages. E.g., let us download the page about the beatle [Bembidion-Ambiguum](https://en.wikipedia.org/wiki/Bembidion_ambiguum) and list all links from the page.

In [138]:
page = "Bembidion-ambiguum"

try:
    wiki = wikipedia.page(page)
except wikipedia.DisambiguationError:
    print("Page title", page,"is ambiguous")
except wikipedia.PageError:
    print("No matching page for '",page,"' found")
    try:
        wiki = wikipedia.page(page,auto_suggest=False)
    except:
        print("No matching page for '",page,"' found even with auto_suggest=False")
    print("auto_suggest=False helped!")
except:
    print("Could not load page",page)

In [139]:
print("The Wikipedia page on '{}' has the title '{}' and contains {} links".format(page,wiki.title,len(wiki.links)))
for link in wiki.links:
    print(link,'|',link.title())

The Wikipedia page on 'Bembidion-ambiguum' has the title 'Bembidion ambiguum' and contains 20 links
Animal | Animal
Arthropod | Arthropod
Beetle | Beetle
Bembidion | Bembidion
Binomial nomenclature | Binomial Nomenclature
California | California
Doi (identifier) | Doi (Identifier)
Global Biodiversity Information Facility | Global Biodiversity Information Facility
Ground beetle | Ground Beetle
Insect | Insect
Mediterranean region | Mediterranean Region
PMC (identifier) | Pmc (Identifier)
PMID (identifier) | Pmid (Identifier)
Pierre François Marie Auguste Dejean | Pierre François Marie Auguste Dejean
Salt marsh | Salt Marsh
San Francisco Bay | San Francisco Bay
Taxonomy (biology) | Taxonomy (Biology)
Trechinae | Trechinae
Wikidata | Wikidata
Wikispecies | Wikispecies


Our goal will be to analyze some notion which has a page in Wikipedia and to find important notions related to the given notion. We will do that by building and analyzing the so-called ego network around the given notion.

At first, you should build the ego network, which is a subgraph of nodes that are close to the node representing the given notion. Then you will analyze the network.

Your task is to implement two functions and then use them to analyze given notions.

The first function 
```
getWikipediaNetwork(ego,depth=2)
```
should build a **directed** network of nodes corresponding to the titles of all Wikipedia pages with distance at most `depth` from the starting page `ego`. The distance from page *A* to page *B* is the minimal number of links that should be clicked in order to get from page *A* to page *B*. **Remark:** Your network should contain all nodes of distance at most `depth` from `ego`, but you should collect all edges only for nodes with distance at most `(depth - 1)`.

Your implementation should omit several types of pages that almost always create a dense structure close to virtually each Wikipedia page. You should omit at least all pages with the titles containing suffix "(identifier)" or "(Identifier)". See such pages in the above example.

In [140]:
def getWikipediaNetwork(ego,depth=2):
    # create network - a directed graph of Wikipedia pages - around the 
    # page with title ego till the distance depth
    #pass


Implement the function
```
getNetwork(name,depth=2,download=False)
```

that should create the ego-network (oriented network of type `nx.DiGraph()`)around page with title `name` of depth `depth`. If `download` is `True`, the network should be built by downloading pages from Wikipedia and the built network should be stored in the file with the name 

```
name+'.csv'
```
If `download` is `False`, the function will check whether the file with the name `name+'.csv'` exists. If yes, it supposes that 
the file contains the edge list of the ego network around the notion from the parameter `name`
and returns a network (`nx.DiGraph()`) with the network. Otherwise, it collects the network by calling
the function `getWikipediaNetwork(name,depth)` and stores it in the corresponding CSV-file and returns the network.

In [141]:
import os

def getNetwork(name,depth=2,download=False):
    # create network around the Wikipedia page with title name. If download is False and a file
    # <name>.csv exists, the network is read from the file, otherwise it is built using the function
    # getWikipediaNetwork(name,depth)
    #pass

Using the above functions (and possibly other suitable functions) implement the function

In [163]:
def analyzeEgo(G,ego,topNeighbours=25):
    # analyze the ego network G with the root node ego
    pass

that takes the ego network `G` around node `ego`, deletes all nodes of degree at most one from the network and then finds `topNeighbours` nodes in the network with the highest **in-degree** except the `ego`. The function should  
* print the number of nodes and edges of the original network,
* print the number of nodes and edges of the network after truncation,
* print value of at least two centrality measures for the `ego` node,
* draw the subgraph of `G` on `topNeighbours` nodes with maximal in-degree and the `ego` node,
and 
* return a list of pairs (node,in-degree) for `topNeighbours` nodes with maximal in-degree.

After that, you should use your functions and analyze three ego networks:
1. around `Network science`,
2. around `Complex network` and 
3. around your personal favorite Wikipedia page.
In each of the analyses, you should use and display 20 nodes around ego with the maximal in-degree.

Are the networks comparable?

**Hints:**

* The ego network around `Bembidion-ambiguum` has approximately 8185 nodes and 9763 edges,
the degree of the node Bembidion-ambiguum is 17, its truncated network has 1210 nodes and 2788 edges
* The ego network around `Complex network` has approximately 13830 nodes and 26037 edges,
the degree of the node Complex network is 133, its truncated network has 3504 nodes and 15711 edges.

Your values can differ slightly as the Wikipedia is changing and also some page can be unavailable.

