#### This notebook prepares the input data to be consumed by the Neural Network.

In [3]:
%load_ext autoreload
%autoreload 2

Run the cell below to extract the edgelist from .mtx file (the way Node2Vec expects the input). We will iterate through all the mtx files and create a corresponding .edgelist file

In [29]:
import os

data_files = os.listdir('../data')
for file in data_files:
    if file.endswith('.mtx'):
        file_name = file.replace('.mtx', '')
        file_edgelist = file_name+'.edgelist'
        if not file_edgelist in data_files:
            lines = None
            with open('../data/'+file) as file_mtx:
                lines = file_mtx.readlines()
            with open('../data/'+file_edgelist, 'w') as file_edgelist:
                file_edgelist.writelines(lines[2:])
                print(file_edgelist, 'created')

Now that we have extracted edgelists from .mtx files using above cell, let's generate node embeddings (node2vec). For that I will use the code that the author's have shared on there [Github](https://github.com/aditya-grover/node2vec). But first we have to convert the script from python2 to python3 and replace "import node2vec" with "import node2vec3".

In [42]:
# ! 2to3 -w './node2vec'  # didn't work. No such file or directory error
# os.listdir('./node2vec')  # while this works!
# so converted main.py and node2vec.py using online translators

For generating embeddings use the parameters used in the "shortest path distance" paper.

In [57]:
%%time
import os

if not os.path.exists('../data/emb'):
    os.makedirs('../data/emb')
# ! python node2vec/main3.py --help
! python node2vec/main3.py --input ../data/socfb-OR.edgelist --output ../data/emb/socfb-OR.emd

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Wall time: 18min 24s


The Graph class has a <i>naive</i> implementation of Dijkstra's Algorithm to calculate distance of all the nodes from a specified source node. It is slow but since we need to run it for landmarks (number of landmarks << number of nodes) only I will go ahead with this. 

In [1]:
from graph_proc import Graph
from logger import Logger

logger = Logger('../outputs/logs', 'log_')
graph = Graph('../data/socfb-American75.mtx', logger)
save_path = graph.process_landmarks()

0%|          | 0/150 [00:00<?, ?it/s]number of landmarks: 150
100%|██████████| 150/150 [37:05<00:00, 14.83s/it]save path: ../outputs/distance_map_1588792161.8904061.pickle



The result is saved in a pickle file (dict) to analyse. As you can see below only isolated (disconnected from source) nodes are left out, which form a cycle with another node (isolated cycles). Same set of isolated nodes are found for all of the landmarks. So we can ignore them.

In [2]:
import pickle
import numpy as np
from scipy import io

save_path = '../outputs/distance_map_1588792161.8904061.pickle'
mtx_path = '..\data\socfb-American75.mtx'
mat_csr = io.mmread(mtx_path).tocsr()
distance_map = pickle.load(open(save_path, 'rb'))
keys = list(distance_map.keys())
count = 0
for key in keys:
    l = np.array(distance_map[key])
    hitlist = np.where(l==np.inf)[0]
    # print('Number of isolated keys for source-{} is {}'.format(key, len(hitlist)))
    if(len(hitlist) > 0):
        count += 1
    # for i in hitlist:
    #     print(i, '--', np.where(mat_csr[i].toarray()[0]>0)[0])
    # if(len(hitlist)>0):
    #     break
print('Number of sources for which any isolated nodes found are', count)

Number of sources for which any isolated nodes found are 150


Now we have to read the distance map and embeddings to form training data.

In [3]:
import numpy as np
import sys

emd_path = '../data/emb/socfb-American75.emd'
emd_map = {}
with open(emd_path, 'r') as file:
    lines = file.readlines()
    for line in lines[1:]:
        temp = line.split(' ')
        emd_map[np.int(temp[0])] = np.array(temp[1:], dtype=np.float)
print('size of emd_map:', sys.getsizeof(emd_map)/1024/1024,'MB')
print('size of distance_map:', sys.getsizeof(distance_map)/1024/1024,'MB')

size of emd_map: 0.28133392333984375 MB
size of distance_map: 0.00447845458984375 MB


In [20]:
from tqdm.auto import tqdm

dataset_path = '../data/datasets/socfb-American75.pickle'

emd_dist_pair = []
for landmark in tqdm(list(distance_map.keys())):
    node_distances = distance_map[landmark]
    emd_dist_pair.extend([((emd_map[node]+emd_map[landmark])/2, distance) for node, distance in enumerate(node_distances, 1) if node != landmark and distance > 1])

print('length of embedding-distance pairs', len(emd_dist_pair))

100%|██████████| 150/150 [00:03<00:00, 38.86it/s]length of embedding-distance pairs 946804



In [55]:
import sys

x = np.zeros((len(emd_dist_pair), len(emd_dist_pair[0][0])))
y = np.zeros((len(emd_dist_pair),))

for i, tup in enumerate(tqdm(emd_dist_pair)):
    x[i] = tup[0]
    y[i] = tup[1]
print("Shape of x={} and y={}".format(x.shape, y.shape))
print('size of x={} MB and y={} MB'.format(sys.getsizeof(x)/1024/1024, sys.getsizeof(y)/1024/1024))

100%|██████████| 946804/946804 [00:01<00:00, 628284.07it/s]Shape of x=(946804, 128) and y=(946804,)
size of x=924.6133880615234 MB and y=7.2236328125 MB



Since the data takes up a lot of space, let's convert the datatype of x and y. In case you are worried about the precision loss, I think you can save the converted data into separate ndarray(x1), and try "np.mean(np.abs(x-x1))". For this data it was very small (2.7954226433144966e-09),so ignoring it. And in our case graphs are unweighted, so distance would be integer always.

In [56]:
x = x.astype('float32')
y = y.astype('int')
print('size of x={} MB and y={} MB'.format(sys.getsizeof(x1)/1024/1024, sys.getsizeof(y1)/1024/1024))

size of x=462.30674743652344 MB and y=3.6118621826171875 MB


Now let's split the data into training, validation and test datasets.