# Homework 5 - Visit the Wikipedia hyperlinks graph!
In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the categories to which an article belongs to, we are curious to rank the articles according to some criteria. 

<div style="text-align:center"><img src="https://i.pinimg.com/originals/a7/5f/dc/a75fdcab110ae11f155ed96f428a86ae.png"/> </div>

## Research questions


**[RQ1]** Build the graph <img src="https://latex.codecogs.com/gif.latex?G=(V,&space;E)" title="G=(V, E)" /> where *V* is the set of articles and *E* the hyperlinks among them, and provide its basic information:
 
- If it is direct or not
- The number of nodes
- The number of edges 
- The average node degree. Is the graph dense?

###### Build the graph!

In [22]:
import json
import pandas as pd
from collections import defaultdict

In [65]:
file = open('wiki-topcats-reduced.txt','r').read().split('\n')
grafo = defaultdict(set)
for row in file:
        link=row.split('\t')
        try:
            grafo[link[0]].add(link[1])
            if link[1] not in grafo:
                grafo[link[1]] = set()
        except: 
            pass

###### Find out if it's directed or not:

We want to check if all the nodes that have edges coming form the node __62__ have an edge to the node __62__.

In [66]:
print(all(["62" in grafo[edge] for edge in grafo['62']]))

False


As you can see, the statement above tells us that not all the nodes that are pointed by the node __62__ have an edge to the node __62__ and this is the counterexample to proof that our graph is directed.

###### Get the number of nodes!

In [67]:
number_of_nodes=len(grafo)
number_of_nodes

461195

###### Get the number of edges!

In [68]:
number_of_edges= sum([len(grafo[node]) for node in grafo])
number_of_edges

2645247

###### Get the average node degree. Is the graph dense?

In graph theory, the degree (or valency) of a vertex of a graph is the number of edges incident to the vertex. The degree of a vertex $v$ is denoted $\deg(v)$.

In [6]:
avg_degree= 2*number_of_edges/number_of_nodes
avg_degree

11.47127353939223

As we see, on average a node has 11-12 edges connected with him.<br>
We've calculated the average node degree with the following equation: $\frac{2 \cdot \vert E \vert}{\vert V \vert}$

In mathematics, a dense graph is a graph in which the number of edges is close to the maximal number of edges, so
we can conclude that the graph is quietly sparse.

## RQ2 
![wikis](https://image.slidesharecdn.com/exploringarticlenetworksonwikipediawithnodexl-150512201926-lva1-app6891/95/exploring-article-networks-on-wikipedia-with-nodexl-16-638.jpg?cb=1431630282)
Given a category $C_0 = \{article_1, article_2, ... \}$ as input we want to rank all of the nodes in V according to the following criteria: <br>
Given an input category **C_0**, the first category of the rank, **C_0**, always corresponds to the input category. The order of the remaining categories is given by:
<br>
<br>
<div style="text-align:center">
distance($C_0$, $C_i$) = median(ShortestPath($C_0$, $C_i$))
 </div>
<br>

The lower is the distance from **C_0**, the higher is the **C_i** position in the rank. ShortestPath(**C_0**, **C_i**) is the set of all the possible shortest paths between the nodes of **C_0** and **C_i**. 

Fist of all we have to import our class that contains all the useful stuffs in order to build our block ranking.

In [8]:
import handler

In the constructor we have only few instructions, these are needed for the inizialization. <br>

The first line pick the input category via standard input. <br>

Next we build the dictionary that maps the categories with the articles that belongs to them, in this step the only constraint is to use only the categories that have more than __3500__ articles inside; since there are a lot of articles that belong to multiple categories we've to remove them according to a criteria that will be lighted later. <br> 

The last line load the graph from an edgelist. A problem that we've faced is that in the edgelist the line __#FromNodeId  ToNodeId__ missed and at the very beginning we had some troubles for this. 


In [12]:
H = handler.Handler()

Windows_games


![mthreads](https://cdn-images-1.medium.com/max/2000/1*6DyyyyZqMPIaHhoVXPPxtg.jpeg)

Now for build the block ranking the only thing to do is call the function __"scheduler"__ that is a multithread interface to the median calculator, it opens as much threads as it can, attach to it the function __"multithread_engine"__ with the name of the __C_i-th__ category as parameter (all other variables you need are inside the class) and the results are stores in a json format file with this standard: __{'Name_of_i-th_category' : Score}__ The score is assigned via the following equation. <br> <br>

<div style="text-align:center"> $median(ShortestPath(C_0, C_i)) + \sqrt{Infs} \cdot log(1+Infs)$ </div> <br>

* Infs contains the number of missig paths from $C_0$ to $C_i$
* The more missing paths you have, the higher is the distance and the category $C_i$ goes down in the rank.
* We have avoided to insert the infinites inside the collection in order to avoid bad results of the computation.
* If the shortest_path has 0 missing paths the second term of the equation is **0** and you end only with the classical median

<br>

Once you have all the results, the block ranking is the recombination of the json-files into a sorted big one in ascending order. Since the computation is very expansive, we save it for future purposes.

In [13]:
block_ranking = H.scheduler()

In [20]:
pd.DataFrame(block_ranking, columns=['Category', 'Distance']).head()

Unnamed: 0,Category,Distance
0,Windows_games,-1.0
1,American_films,215.067515
2,American_film_actors,220.362446
3,Members_of_the_United_Kingdom_Parliament_for_E...,225.458473
4,Indian_films,232.414749


## Category cleaning and PageRank
![abba](https://www.andrearonzano.com/wp-content/uploads/2016/04/pagerank-esiste-ancora-andrea-ronzano-848x435.jpg)

As we said before, we have to remove the articles that have multiple categories according to the following: <br>

* If the article belongs to the input category remove it from all other categories.

* The category of the article will correspond, among the categories it belongs to, to the closest to the input category.

### Before cleanup

We want to show you that the cleanup is efficient and it converges to the right point.

So in the next chuck we show a visualization of the map $< Category : Number ~ of ~ articles ~ in ~ that ~ category >$ 

In [40]:
cat2numarticles = {cat : len(H.categories[cat]) for cat in H.categories}
pd.DataFrame(list(cat2numarticles.items()),
                      columns=['Cat','Number of Articles']).head()

Unnamed: 0,Cat,Number of Articles
0,English_footballers,9237
1,The_Football_League_players,9467
2,Association_football_forwards,6959
3,Association_football_goalkeepers,3997
4,Association_football_midfielders,8270


We clean the cateogry, with the following criteria: <br>


<div style="text-align:center"> $O  =  \bigcup_{i = 1}^{n} \bigcap_{j = 1}^{n} \left( {C_i},{C_j} \right) $ </div> <br>
Where: <br>

* *n* is the number of categories
* C_i, C_j are two arbitrary categories 
* O is the big-set that contains all the articles that we've to remove

Once you have O, you simply intersect O with $C_i$ and remove it from $C_i$.
The operation is inorder with respect to the block ranking disposition, so you keep the articles that are near to the input one, we've done this programmatically so you don't need multiple nested loops that slow the computation. Last, the operation is inplace so you don't need to store the result into another variable. 

In [44]:
H.cat_builder()

Since we want to verify that all went well, we print the same frame we printed before with the uptaded map

In [45]:
cat2numarticles_after = {cat : len(H.categories[cat]) for cat in H.categories}
pd.DataFrame(list(cat2numarticles_after.items()),
                      columns=['Cat','Number of Articles']).head()

Unnamed: 0,Cat,Number of Articles
0,English_footballers,932
1,The_Football_League_players,9437
2,Association_football_forwards,5139
3,Association_football_goalkeepers,184
4,Association_football_midfielders,183


As you can see, the number of articles inside each category dropped off.
Another verification is to provide the bigset *"O"* and if it's empy we're fine!

In [46]:
block_ranking = [elem[0] for elem in json.load(open('dropthebayes.json', 'r'))]
visited = set()
for supcategory in block_ranking:
    bigset = set()
    visited.add(supcategory)
    for subcategory in block_ranking:
        if subcategory in visited:
            continue
        else:
            bigset = bigset.union(set(H.categories[subcategory]).intersection(H.categories[supcategory]))

bigset

set()

Good!

Now remains to do the rank according to the PageRank algorithm. <br>

The next chunk doest the PageRank and store it into a dictionary $<Article : Score >$

In [52]:
ranks = H.pagerank()

The reamaining part provides some fancy visualizations, we want to show you which articles are the "best" for the given input category.

In [95]:
topArticles = pd.DataFrame(sorted(ranks.items(), key = lambda lv: -lv[1]), columns=['Id','Score']).head()
topArticles

Unnamed: 0,Id,Score
0,1738167,55
1,1738168,49
2,1735504,38
3,1740913,37
4,1740964,34


I want a mapping between the article id and the article name, in order to merge the dataframes

In [96]:
id2name = {line.split(" ")[0] : ' '.join(line.split(" ")[1:])[:-1] for line in open('wiki-topcats-page-names.txt', 'r')}
id2article = pd.DataFrame(list(id2name.items()),
                      columns=['Id','Article'])
id2article.head()

Unnamed: 0,Id,Article
0,1736718,Wario Blast: Featuring Bomberman!
1,1736719,List of Super Famicom and Super Nintendo sport...
2,1736720,International Superstar Soccer (video game)
3,1736721,Tecmo Super NBA Basketball
4,1736722,Hanna Barbera's Turbo Toons


Inner join between the frames computed before.

In [97]:
topArticles.merge(id2article, on = 'Id', how = 'inner')

Unnamed: 0,Id,Score,Article
0,1738167,55,Half-Life (video game)
1,1738168,49,Half-Life 2
2,1740913,37,Myst
3,1740964,34,The Sims


# Conclusions

Seems that according to the data we have and *Windows_games* as input category, Half-Life is the most popular article on wikipedia.

![half](https://steamcdn-a.akamaihd.net/steam/apps/70/header.jpg?t=1530045175)