# Homework 5 - Visit the Wikipedia hyperlinks graph!
In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the categories to which an article belongs to, we are curious to rank the articles according to some criteria. 

<div style="text-align:center"><img src="https://i.pinimg.com/originals/a7/5f/dc/a75fdcab110ae11f155ed96f428a86ae.png"/> </div>

## Research questions


**[RQ1]** Build the graph <img src="https://latex.codecogs.com/gif.latex?G=(V,&space;E)" title="G=(V, E)" /> where *V* is the set of articles and *E* the hyperlinks among them, and provide its basic information:
 
- If it is direct or not
- The number of nodes
- The number of edges 
- The average node degree. Is the graph dense?

###### Build the graph!

In [1]:
from collections import defaultdict
import networkx as nx

In [2]:
file = open('wiki-topcats-reduced.txt','r').read().split('\n')
grafo = defaultdict(set)
for row in file:
        link=row.split('\t')
        try:
            grafo[link[0]].add(link[1])
            if link[1] not in grafo:
                grafo[link[1]] = set()
        except: 
            pass

###### Find out if it's directed or not:

We want to check if all the nodes that have edges coming form the node __62__ have an edge to the node __62__.

In [3]:
print(all(["62" in grafo[edge] for edge in grafo['62']]))

False


As you can see, the statement above tells us that not all the nodes that are pointed by the node __62__ have an edge to the node __62__ and this is the counterexample to proof that our graph is directed.

###### Get the number of nodes!

In [4]:
number_of_nodes=len(grafo)
number_of_nodes

461194

###### Get the number of edges!

In [5]:
number_of_edges= sum([len(grafo[node]) for node in grafo])
number_of_edges

2645247

###### Get the average node degree. Is the graph dense?

In graph theory, the degree (or valency) of a vertex of a graph is the number of edges incident to the vertex. The degree of a vertex $v$ is denoted $\deg(v)$.

In [6]:
avg_degree= 2*number_of_edges/number_of_nodes
avg_degree

11.471298412381774

As we see, the average node degree is slightly great than six.
In mathematics, a dense graph is a graph in which the number of edges is close to the maximal number of edges.
We can conclude that the graph is not dense. It is very sparse indeed.

## RQ2 
Given a category $C_0 = \{article_1, article_2, ... \}$ as input we want to rank all of the nodes in V according to the following criteria:

In [7]:
categories = defaultdict(list)
with open('wiki-topcats-categories.txt', 'r') as f:
    for row in f:
        splitted_row = row.split(' ')
        categories[splitted_row[0][9:-1]] = splitted_row[1:]

In [8]:
G = nx.read_edgelist('wiki-topcats-reduced.txt', nodetype=str, delimiter='\t', create_using=nx.DiGraph())

In [9]:
Gprime = G.subgraph(categories['Living_people'])

In [10]:
score={}
for edge in Gprime.edges():
    if edge[1] not in score:
        score[edge[1]]=1
    else:
        score[edge[1]]+=1

In [11]:
score['107']

7

In [12]:
score

{'581885': 17,
 '582079': 20,
 '588209': 3,
 '590304': 15,
 '22518': 35,
 '23565': 2,
 '23574': 5,
 '23593': 2,
 '23596': 2,
 '23619': 5,
 '23620': 2,
 '23642': 1,
 '23646': 2,
 '23647': 4,
 '23653': 1,
 '23663': 3,
 '23678': 3,
 '23688': 2,
 '23707': 5,
 '23731': 1,
 '23819': 33,
 '139287': 15,
 '572251': 175,
 '1179619': 115,
 '1250086': 13,
 '1383234': 399,
 '1403671': 91,
 '1541404': 1,
 '1591791': 3,
 '679817': 46,
 '692540': 24,
 '696108': 65,
 '696137': 3,
 '1021738': 15,
 '1061892': 103,
 '1172770': 5,
 '1172990': 7,
 '1174100': 17,
 '1175762': 48,
 '1178190': 23,
 '1178648': 57,
 '1179072': 52,
 '1179588': 146,
 '1181747': 235,
 '1184226': 196,
 '1184695': 212,
 '1186047': 42,
 '1191315': 10,
 '1222972': 15,
 '825998': 18,
 '1162063': 3,
 '1225015': 2,
 '1534776': 3,
 '761405': 4,
 '764469': 2,
 '385369': 9,
 '389449': 2,
 '418830': 4,
 '435793': 5,
 '442586': 189,
 '667148': 1,
 '668312': 14,
 '668380': 3,
 '668383': 18,
 '668409': 18,
 '668422': 33,
 '668515': 27,
 '668552':

In [13]:
# defing the score based on sum of weights of the in-edges 

LivingPscore = sorted(score.items(), key= lambda x: x[1], reverse=True)
LivingPscore

[('1400548', 3200),
 ('1400547', 2627),
 ('1400635', 2304),
 ('1180117', 977),
 ('1179311', 869),
 ('1400483', 809),
 ('539805', 771),
 ('1179931', 749),
 ('1178721', 718),
 ('1108354', 717),
 ('1400479', 715),
 ('1179873', 640),
 ('1061960', 617),
 ('1181772', 613),
 ('1400624', 566),
 ('539587', 545),
 ('1184448', 536),
 ('1179886', 521),
 ('1062053', 497),
 ('539786', 491),
 ('1400534', 481),
 ('1400542', 476),
 ('1179291', 458),
 ('1184029', 439),
 ('1184210', 430),
 ('1179622', 430),
 ('1174956', 429),
 ('1184073', 421),
 ('1179398', 419),
 ('1502533', 417),
 ('1061902', 415),
 ('1179838', 413),
 ('1179263', 407),
 ('1179691', 405),
 ('539770', 403),
 ('1184788', 401),
 ('1184044', 401),
 ('1383234', 399),
 ('1061920', 391),
 ('1184217', 387),
 ('813732', 386),
 ('1706002', 376),
 ('1179414', 362),
 ('1061971', 361),
 ('1179676', 354),
 ('1593337', 344),
 ('1179646', 344),
 ('1783682', 338),
 ('1184022', 337),
 ('1592499', 337),
 ('1022429', 336),
 ('1184864', 332),
 ('1184225', 3

In [14]:
# Extending the Graph to another node C1 

Gprime1 = G.subgraph(categories['Italian_nobility'])

In [15]:
score1={}
for edge in Gprime1.edges():
    if edge[1] not in score1:
        score1[edge[1]]=1
    else:
        score1[edge[1]]+=1

In [16]:
# defing the score based on sum of weights of the in-edges 

ItalianNscore = sorted(score1.items(), key= lambda x: x[1], reverse=True)
ItalianNscore

[('1765831', 5),
 ('1765847', 5),
 ('1765824', 4),
 ('1765845', 4),
 ('1765848', 4),
 ('1765837', 4),
 ('1765832', 3),
 ('942319', 3),
 ('1765823', 3),
 ('1345947', 3),
 ('1765846', 3),
 ('1345967', 3),
 ('1344730', 2),
 ('211867', 2),
 ('1765822', 2),
 ('1765839', 2),
 ('1346006', 2),
 ('1345981', 2),
 ('1345968', 2),
 ('1346001', 2),
 ('1346004', 2),
 ('1784323', 2),
 ('1765835', 2),
 ('1345946', 2),
 ('1345965', 2),
 ('1781830', 2),
 ('1346041', 2),
 ('1765834', 2),
 ('1346000', 2),
 ('211868', 2),
 ('1765819', 1),
 ('942300', 1),
 ('116239', 1),
 ('1345970', 1),
 ('1344684', 1),
 ('1346039', 1),
 ('1765991', 1),
 ('1344688', 1),
 ('1345975', 1),
 ('1765825', 1),
 ('1344732', 1),
 ('1784322', 1),
 ('1765820', 1),
 ('1765818', 1),
 ('1346003', 1),
 ('1765874', 1),
 ('177345', 1),
 ('1765817', 1),
 ('1765836', 1),
 ('211869', 1),
 ('1346040', 1)]