In [1]:
import networkx as nx
from collections import defaultdict

### [RQ1]
**Creating Graph**

In [2]:
DG = nx.DiGraph()

In [3]:
path = "C:/Users/Asia/"

# i.e. our_edges = {node1:[node2,node4,node5], ..., 1:[2, 3, 5], 2:[5], 3:[6] ....}
# our_edges = defaultdict(list)

with open(path + "wiki-topcats-reduced.txt", "r") as f:
    #create graph
    for line in f.readlines():
        article1, article2 = line.split()
        DG.add_weighted_edges_from([(int(article1), int(article2), 1)])
#         our_edges[int(article1)].append(int(article2))

In [40]:
# list(DG.neighbors(52))

[401135, 1069112, 1163551]

**Basic Information**:
+ If it is direct or not - **Direct**
+ The number of nodes
+ The number of edges
+ The average node degree. Is the graph dense?

In [None]:
#TODO: Melis

### [RQ2] 
Given a category $C_0 = \{article_1, article_2, \dots \}$ as input we want to rank all of the nodes in V according to the block-ranking, where the blocks are represented by the categories:
$$block_{RANKING} =\begin{bmatrix} C_0 \\ C_1 \\ \dots \\ C_c\\ \end{bmatrix}$$

Each category  corresponds to a list of nodes.

The first category of the rank, $C_0$, always corresponds to the input category. The order of the remaining categories is given by:

$$distance(C_0, C_i) = median(ShortestPath(C_0, C_i))$$

The lower is the distance from $C_0$, the higher is the $C_i$ position in the rank. $ShortestPath(C_0, C_i)$ is the set of all the possible shortest paths between the nodes of $C_0$ and $C_i$. Moreover, the length of a path is given by the sum of the weights of the edges it is composed by.

##### Reading the file with categories

In [4]:
with open(path + "wiki-topcats-categories.txt", "r") as f2:
    categories = {} # {category0 : [article1, article2, ...], ...., 5: [23, 45, 6]}
    for cat_indx, line in enumerate(f2.readlines()):
        categories[cat_indx] = list(map(int, line.split(";")[1].split()))

In [5]:
# So our C0 is categories[1]
C0 = categories[1]

#### Choosing categories which exist in our reduced graph:

In [6]:
selected_category_indx = []
for i in range(len(categories)):
    if DG.subgraph(categories[i]).nodes:
        selected_category_indx.append(i)

In [7]:
len(selected_category_indx)

11985

In [8]:
selected_category_indx[:5]

[1, 3, 4, 29, 30]

#### Creating graph consisting of example categories

In [171]:
test_nodes = categories[1] + categories[3]

In [175]:
test_graph = DG.subgraph(test_nodes)

In [176]:
test_graph.nodes

NodeView((84354, 1194242, 537220, 89734, 85767, 1178634, 84108, 1043983, 350608, 1287952, 85268, 194583, 1058, 79139, 79143, 350247, 1525160, 826926, 1043638, 669757, 1260349, 1527616, 76871, 883273, 216650, 1047370, 499532, 604876, 827472, 1287891, 494420, 79069, 449761, 80237, 971629, 541169, 1445619, 540020, 919797, 538870, 1112567, 1011834, 1565695))

In [177]:
test_graph.edges

OutEdgeView([(540020, 538870), (538870, 540020)])

#### Task: Find $median(ShortestPath(C0, C1))$

In [None]:
#TODO: Dijkstry Algorithm - Gui

In [None]:
# TODO: Checking conditions from General Notes -- After having Block-Ranking Vector ???

#### Officially our C0 is... categories[1]  :)
Now, we have to check points from "General Notes"

In [None]:
selected_category_indx[0]

In [222]:
all_couples = [(i, j)  for i in selected_category_indx[0:] for j in selected_category_indx[0:] if i<j]

In [221]:
# we have to check if any two categories graphs has intersection 
C0 = categories[selected_category_indx[0]]
ll = len(all_couples)

for ii, (c1, c2) in enumerate(all_couples):
    intersec = set(categories[c1]).intersection(set(categories[c2]))
    if ii%20000==0: print("Done:.....{}%".format(round(ii/ll, 3)))
    if intersec:
        if c1==0: 
            categories[c2] = list(set(categories[c2])-intersec)
            print("--------------------")

Done:.....0.0%
Done:.....0.0%
Done:.....0.001%
Done:.....0.001%
Done:.....0.001%
Done:.....0.001%
Done:.....0.002%
Done:.....0.002%
Done:.....0.002%
Done:.....0.003%
Done:.....0.003%
Done:.....0.003%
Done:.....0.003%
Done:.....0.004%
Done:.....0.004%
Done:.....0.004%
Done:.....0.004%
Done:.....0.005%
Done:.....0.005%
Done:.....0.005%
Done:.....0.006%
Done:.....0.006%
Done:.....0.006%
Done:.....0.006%
Done:.....0.007%
Done:.....0.007%
Done:.....0.007%
Done:.....0.008%
Done:.....0.008%
Done:.....0.008%
Done:.....0.008%
Done:.....0.009%
Done:.....0.009%
Done:.....0.009%
Done:.....0.009%
Done:.....0.01%
Done:.....0.01%
Done:.....0.01%
Done:.....0.011%
Done:.....0.011%
Done:.....0.011%
Done:.....0.011%
Done:.....0.012%
Done:.....0.012%
Done:.....0.012%
Done:.....0.013%
Done:.....0.013%
Done:.....0.013%
Done:.....0.013%
Done:.....0.014%
Done:.....0.014%
Done:.....0.014%
Done:.....0.014%
Done:.....0.015%


KeyboardInterrupt: 

In [206]:
len(all_couples)

71814120

## Block Ranging Algorithm - Step 1 , 2 , 3

### Step1

In [10]:
induced_graph = DG.subgraph(C0)

In [11]:
induced_graph.edges

OutEdgeView([(540020, 538870), (538870, 540020)])

For each node compute the sum of the weigths of the in-edges.

$$score_{article_i} = \sum_{j \in in-edges(article_i)} w_j$$

In [21]:
def sum_weights_inedges(induced_graph):
    # Iterate to get sum of weights of in-edges
    all_weights = defaultdict(int)
    for (node1,node2,data) in induced_graph.edges(data=True):
            all_weights[node2] += data['weight']
    return all_weights

In [22]:
sum_weights_inedges(induced_graph)

defaultdict(int, {538870: 1, 540020: 1})

In [20]:
induced_graph.edges

OutEdgeView([(540020, 538870), (538870, 540020)])

### Step 2
 Extend the graph to the nodes that belong to $C_1$. Thus, for each article in $C_1$ compute the score as before. 
 **Note that the in-edges coming from the previous category, $C_0$, have as weights the score of the node that sends the edge.**

In [27]:
C1 = categories[selected_category_indx[1]]
induced_graph2 = DG.subgraph(C1)
# calculate weights
sum_weights_inedges(induced_graph2)

defaultdict(int, {})

In [30]:
induced_graph2.nodes

NodeView((604876,))