#### This notebook prepares the input data to be consumed by the Neural Network.

In [6]:
%load_ext autoreload
%autoreload 2

Run the cell below to extract the edgelist from .mtx file (the way Node2Vec expects the input). We will iterate through all the mtx files and create a corresponding .edgelist file

In [29]:
import os

data_files = os.listdir('../data')
for file in data_files:
    if file.endswith('.mtx'):
        file_name = file.replace('.mtx', '')
        file_edgelist = file_name+'.edgelist'
        if not file_edgelist in data_files:
            lines = None
            with open('../data/'+file) as file_mtx:
                lines = file_mtx.readlines()
            with open('../data/'+file_edgelist, 'w') as file_edgelist:
                file_edgelist.writelines(lines[2:])
                print(file_edgelist, 'created')

Now that we have extracted edgelists from .mtx files using above cell, let's generate node embeddings (node2vec). For that I will use the code that the author's have shared on there [Github](https://github.com/aditya-grover/node2vec). But first we have to convert the script from python2 to python3 and replace "import node2vec" with "import node2vec3".

In [42]:
# ! 2to3 -w './node2vec'  # didn't work. No such file or directory error
# os.listdir('./node2vec')  # while this works!
# so converted main.py and node2vec.py using online translators

For generating embeddings use the parameters used in the "shortest path distance" paper.

In [57]:
%%time
import os

if not os.path.exists('../data/emb'):
    os.makedirs('../data/emb')
# ! python node2vec/main3.py --help
! python node2vec/main3.py --input ../data/socfb-OR.edgelist --output ../data/emb/socfb-OR.emd

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Wall time: 18min 24s


In [1]:
from graph_proc import Graph
from logger import Logger

logger = Logger('../outputs/logs', 'log_')
graph = Graph('../data/socfb-American75.mtx', logger)
graph.calculate_distances(1)  # there is an isolated cycle in the graph! 1187-5780, that's why it keeps looping
graph.distance_map[1][0:50]
# total accessible nodes 6370; each source processing ~13 seconds; 6370*13 seconds = 23 hours! (good job :-p )
# but we don't need to process all the nodes, we need ~100. So 21 mins! But still need to optimize BFS.

100%|█████████▉| 6370/6386 [00:15<00:00, 408.14it/s]


array([0., 3., 3., 2., 3., 3., 2., 3., 2., 2., 2., 2., 2., 3., 3., 2., 3.,
       4., 2., 2., 2., 2., 2., 2., 3., 2., 2., 3., 2., 2., 3., 3., 2., 3.,
       2., 3., 3., 2., 3., 2., 2., 3., 2., 2., 3., 3., 3., 2., 3., 3.])

The result is saved in a pickle file (dict) to analyse. As you can see below only isolated (disconnected from source) nodes are left out which for a cycle with another node.

In [2]:
import pickle
import numpy as np
from scipy import io

mtx_path = '..\data\socfb-American75.mtx'
mat_csr = io.mmread(mtx_path).tocsr()
distance_map = pickle.load(open('../outputs/distance_map_1588710068.714855.pickle', 'rb'))
l = np.array(distance_map[1])
hitlist = np.where(l==np.inf)[0]
for i in hitlist:
    print(i, '--', np.where(mat_csr[i].toarray()[0]>0)[0])

1187 -- [5780]
1460 -- [1599]
1571 -- [5399]
1599 -- [1460]
1790 -- [5975]
3427 -- [4080]
4080 -- [3427]
4166 -- [5418]
4681 -- [4849]
4849 -- [4681]
4867 -- [5183]
5183 -- [4867]
5399 -- [1571]
5418 -- [4166]
5780 -- [1187]
5975 -- [1790]
