# Homework 5 - Visit the Wikipedia hyperlinks graph!
In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the categories to which an article belongs to, we are curious to rank the articles according to some criteria. 

<div style="text-align:center"><img src="https://i.pinimg.com/originals/a7/5f/dc/a75fdcab110ae11f155ed96f428a86ae.png"/> </div>

## Research questions


**[RQ1]** Build the graph <img src="https://latex.codecogs.com/gif.latex?G=(V,&space;E)" title="G=(V, E)" /> where *V* is the set of articles and *E* the hyperlinks among them, and provide its basic information:
 
- If it is direct or not
- The number of nodes
- The number of edges 
- The average node degree. Is the graph dense?

###### Build the graph!

In [5]:
from collections import defaultdict
import networkx as nx
from itertools import product
from collections import deque

In [None]:
file = open('wiki-topcats-reduced.txt','r').read().split('\n')
grafo = defaultdict(set)
for row in file:
        link=row.split('\t')
        try:
            grafo[link[0]].add(link[1])
            if link[1] not in grafo:
                grafo[link[1]] = set()
        except: 
            pass

###### Find out if it's directed or not:

We want to check if all the nodes that have edges coming form the node __62__ have an edge to the node __62__.

In [None]:
print(all(["62" in grafo[edge] for edge in grafo['62']]))

As you can see, the statement above tells us that not all the nodes that are pointed by the node __62__ have an edge to the node __62__ and this is the counterexample to proof that our graph is directed.

###### Get the number of nodes!

In [None]:
number_of_nodes=len(grafo)
number_of_nodes

###### Get the number of edges!

In [None]:
number_of_edges= sum([len(grafo[node]) for node in grafo])
number_of_edges

###### Get the average node degree. Is the graph dense?

In graph theory, the degree (or valency) of a vertex of a graph is the number of edges incident to the vertex. The degree of a vertex $v$ is denoted $\deg(v)$.

In [None]:
avg_degree= 2*number_of_edges/number_of_nodes
avg_degree

As we see, on average a node has 11-12 edges connected with him.
In mathematics, a dense graph is a graph in which the number of edges is close to the maximal number of edges, so
we can conclude that the graph is quietly dense.

## RQ2 
Given a category $C_0 = \{article_1, article_2, ... \}$ as input we want to rank all of the nodes in V according to the following criteria:

In [6]:
categories = defaultdict(list)
with open('wiki-topcats-categories.txt', 'r') as f:
    for row in f:
        splitted_row = row.split(' ')
        if len(splitted_row[1:]) > 3500:
            categories[splitted_row[0][9:-1]] = splitted_row[1:]

In [7]:
inverted_index = {}
inverted_index.update({nodo:categoria for categoria in categories for nodo in categories[categoria]})

In [8]:
G = nx.read_edgelist('wiki-topcats-reduced.txt', nodetype=str, delimiter='\t', create_using=nx.DiGraph())

In [9]:
for node in G:
    if node in inverted_index:
        G.nodes[node]['cat'] = inverted_index[node]

In [10]:
C0 = 'English_footballers'
C1 = 'Association_football_midfielders'

In [11]:
cart_prod = product(categories[C0], categories[C1])

In [12]:
# graph is a networkx directed graph
# source is the source node represented as a string.
# target is the target, same as before.

def shortest_path(graph, source, target):
    visited = dict()#is the list containing the nodes of the shortest path between the source and the target
    path = []
    visited[source] = 'null'
    to_visit = deque([source])
    while(to_visit):
        visiting = to_visit.pop()
        vicini = set(G.neighbors(visiting))
        visited.update({vicino : visiting for vicino in vicini if vicino not in visited})
        to_visit.extendleft(vicini)
        if target in vicini:
            chiave = target
            path.append(target)
            while visited[chiave] != 'null':
                path.append(visited[chiave])
                chiave = visited[chiave]
            return path[::-1]

        
def spanning_tree(graph, source):
    visited = dict()#is the list containing the nodes of the shortest path between the source and the target
    visited[source] = 'null'
    to_visit = deque([source])
    while(to_visit):
        visiting = to_visit.pop()
        vicini = set(G.neighbors(visiting))
        yield {vicino : visiting for vicino in vicini if vicino not in visited}
        visited.update({vicino : visiting for vicino in vicini if vicino not in visited})
        to_visit.extendleft(vicini)
    

In [13]:
asidbas = shortest_path(G, '52', '107')

In [14]:
asidbas

['52', '1163551', '1061284', '1061246', '1181401', '107']

In [None]:
score={}
for edge in Gprime.edges():
    if edge[1] not in score:
        score[edge[1]]=1
    else:
        score[edge[1]]+=1

In [51]:
st = spanning_tree(G,'52')
depth = 0
source = '52'
dest = '107'

class Generator:
    def __init__(self, gen):
        self.gen = gen

    def __iter__(self):
        self.value = yield from self.gen
        
        
def wrapper(G, source):
    def depth(st, source, dest, deep):
        for i in st:
            if dest == source:
                yield dest
                raise StopIteration(depth)
            if dest in i:
                #print('a ',dest)
                yield dest
                yield from depth(spanning_tree(G,source), source, i[dest], deep+1)#from depth(spanning_tree(G,source), source, i[dest])
            return deep
            
    st = spanning_tree(G, source)
    a = Generator(depth(st, source ,dest,0))
    #print(list(a))

    #print(a.value)
    return a

In [52]:
v = wrapper(G, '52')

In [53]:
for i in v:
    print(i)