#### This notebook prepares the input data to be consumed by the Neural Network.

In [3]:
%load_ext autoreload
%autoreload 2

Run the cell below to extract the edgelist from .mtx file (the way Node2Vec expects the input). We will iterate through all the mtx files and create a corresponding .edgelist file

In [29]:
import os

data_files = os.listdir('../data')
for file in data_files:
    if file.endswith('.mtx'):
        file_name = file.replace('.mtx', '')
        file_edgelist = file_name+'.edgelist'
        if not file_edgelist in data_files:
            lines = None
            with open('../data/'+file) as file_mtx:
                lines = file_mtx.readlines()
            with open('../data/'+file_edgelist, 'w') as file_edgelist:
                file_edgelist.writelines(lines[2:])
                print(file_edgelist, 'created')

Now that we have extracted edgelists from .mtx files using above cell, let's generate node embeddings (node2vec). For that I will use the code that the author's have shared on there [Github](https://github.com/aditya-grover/node2vec). But first we have to convert the script from python2 to python3 and replace "import node2vec" with "import node2vec3".

In [42]:
# ! 2to3 -w './node2vec'  # didn't work. No such file or directory error
# os.listdir('./node2vec')  # while this works!
# so converted main.py and node2vec.py using online translators

For generating embeddings use the parameters used in the "shortest path distance" paper.

In [57]:
%%time
import os

if not os.path.exists('../data/emb'):
    os.makedirs('../data/emb')
# ! python node2vec/main3.py --help
! python node2vec/main3.py --input ../data/socfb-OR.edgelist --output ../data/emb/socfb-OR.emd

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Wall time: 18min 24s


The Graph class has a <i>naive</i> implementation of Dijkstra's Algorithm to calculate distance of all the nodes from a specified source node. It is slow but since we need to run it for landmarks (number of landmarks << number of nodes) only I will go ahead with this. 

In [1]:
from graph_proc import Graph
from logger import Logger

logger = Logger('../outputs/logs', 'log_')
graph = Graph('../data/socfb-American75.mtx', logger)
save_path = graph.process_landmarks()

0%|          | 0/150 [00:00<?, ?it/s]number of landmarks: 150
100%|██████████| 150/150 [37:05<00:00, 14.83s/it]save path: ../outputs/distance_map_1588792161.8904061.pickle



The result is saved in a pickle file (dict) to analyse. As you can see below only isolated (disconnected from source) nodes are left out, which form a cycle with another node (isolated cycles). Same set of isolated nodes are found for all of the landmarks. So we can ignore them.

In [2]:
import pickle
import numpy as np
from scipy import io

save_path = '../outputs/distance_map_1588792161.8904061.pickle'
mtx_path = '..\data\socfb-American75.mtx'
mat_csr = io.mmread(mtx_path).tocsr()
distance_map = pickle.load(open(save_path, 'rb'))
keys = list(distance_map.keys())
count = 0
for key in keys:
    l = np.array(distance_map[key])
    hitlist = np.where(l==np.inf)[0]
    # print('Number of isolated keys for source-{} is {}'.format(key, len(hitlist)))
    if(len(hitlist) > 0):
        count += 1
    # for i in hitlist:
    #     print(i, '--', np.where(mat_csr[i].toarray()[0]>0)[0])
    # if(len(hitlist)>0):
    #     break
print('Number of sources for which any isolated nodes found are', count)

Number of sources for which any isolated nodes found are 150


All cells before this had to be run once to process the graph and save results to save time. Now we have to read the distance map and embeddings to form training data.

In [38]:
import numpy as np
import sys
import pickle

save_path = '../outputs/distance_map_1588792161.8904061.pickle'
distance_map = pickle.load(open(save_path, 'rb'))
emd_path = '../data/emb/socfb-American75.emd'
emd_map = {}
with open(emd_path, 'r') as file:
    lines = file.readlines()
    for line in lines[1:]:
        temp = line.split(' ')
        emd_map[np.int(temp[0])] = np.array(temp[1:], dtype=np.float)
print('size of emd_map:', sys.getsizeof(emd_map)/1024/1024,'MB')
print('size of distance_map:', sys.getsizeof(distance_map)/1024/1024,'MB')

size of emd_map: 0.28133392333984375 MB
size of distance_map: 0.00447845458984375 MB


In [39]:
from tqdm.auto import tqdm

dataset_path = '../data/datasets/socfb-American75.pickle'

emd_dist_pair = []
for landmark in tqdm(list(distance_map.keys())):
    node_distances = distance_map[landmark]
    emd_dist_pair.extend([((emd_map[node]+emd_map[landmark])/2, distance) for node, distance in enumerate(node_distances, 1) if node != landmark and distance != np.inf])

print('length of embedding-distance pairs', len(emd_dist_pair))

100%|██████████| 150/150 [00:04<00:00, 36.83it/s]length of embedding-distance pairs 955350



In [40]:
import sys

x = np.zeros((len(emd_dist_pair), len(emd_dist_pair[0][0])))
y = np.zeros((len(emd_dist_pair),))

for i, tup in enumerate(tqdm(emd_dist_pair)):
    x[i] = tup[0]
    y[i] = tup[1]
print("Shape of x={} and y={}".format(x.shape, y.shape))
print('size of x={} MB and y={} MB'.format(sys.getsizeof(x)/1024/1024, sys.getsizeof(y)/1024/1024))

100%|██████████| 955350/955350 [00:02<00:00, 409864.66it/s]Shape of x=(955350, 128) and y=(955350,)
size of x=932.9590911865234 MB and y=7.2888336181640625 MB



In [41]:
np.min(y), np.max(y)

(1.0, 7.0)

In [42]:
num_neg_dist = 0
distances_ = []
for i, landmark in enumerate(distance_map.keys()):
    distances_.extend(distance_map[landmark])
print('number of negative distances', np.sum(np.array(distances_) < 0))

number of negative distances 0


Since the data takes up a lot of space, let's convert the datatype of x and y. In case you are worried about the precision loss, I think you can save the converted data into separate ndarray(x1), and try "np.mean(np.abs(x-x1))". For this data it was very small (2.7954226433144966e-09),so ignoring it. And in our case graphs are unweighted, so distance would be integer always.

In [43]:
x = x.astype('float32')
y = y.astype('int')
print('size of x={} MB and y={} MB'.format(sys.getsizeof(x)/1024/1024, sys.getsizeof(y)/1024/1024))

size of x=466.47959899902344 MB and y=3.6444625854492188 MB


Now let's split the data into training, validation and test datasets.

In [44]:
from sklearn.model_selection import train_test_split
import torch

seed_random = 9999
np.random.seed(seed_random)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed_random)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, train_size=0.75, random_state=seed_random)
x_train, x_cv, y_train, y_cv = train_test_split(x_train, y_train, test_size=0.2, train_size=0.8, random_state=seed_random)

print('shapes of train, validation, test data', x_train.shape, y_train.shape, x_cv.shape, y_cv.shape, x_test.shape, y_test.shape)

shapes of train, validation, test data (573209, 128) (573209,) (143303, 128) (143303,) (238838, 128) (238838,)


In [45]:
from sklearn.preprocessing import MinMaxScaler

# TODO try standardization and no normalization also and compare result

mm_scaler = MinMaxScaler(feature_range=(0, 1))
x_train = mm_scaler.fit_transform(x_train)
x_cv = mm_scaler.transform(x_cv)
x_test = mm_scaler.transform(x_test)

In [46]:
batch_size = 2400
input_size = x_train.shape[1]
hidden_units_1 = 256
hidden_units_2 = 100
output_size = 1

lr = 1e-2
epochs = 100

In [47]:
from torch.utils import data as torch_data

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)

trainset = torch_data.TensorDataset(torch.as_tensor(x_train, dtype=torch.float, device=device), torch.as_tensor(y_train, dtype=torch.float, device=device))
train_dl = torch_data.DataLoader(trainset, batch_size=batch_size, drop_last=True)

val_dl = torch_data.DataLoader(torch_data.TensorDataset(torch.as_tensor(x_cv, dtype=torch.float, device=device), torch.as_tensor(y_cv, dtype=torch.float, device=device)), batch_size=batch_size, drop_last=True)

test_dl = torch_data.DataLoader(torch_data.TensorDataset(torch.as_tensor(x_test, dtype=torch.float, device=device), torch.as_tensor(y_test, dtype=torch.float, device=device)), batch_size=batch_size, drop_last=True)

device: cuda:0


In [58]:
from torchsummary import summary

torch.manual_seed(9999)
def get_model():
    model = torch.nn.Sequential(
        torch.nn.Linear(input_size, hidden_units_1),
        torch.nn.ReLU(),
        torch.nn.Linear(hidden_units_1, hidden_units_2),
        torch.nn.ReLU(),
        torch.nn.Linear(hidden_units_2, output_size),
        torch.nn.ReLU(),
        # torch.nn.Softplus(),
    )
    model.to(device)
    return model

def poissonLoss(xbeta, y):
    """Custom loss function for Poisson model."""
    loss=torch.mean(torch.exp(xbeta)-y*xbeta)
    return loss

model = get_model()

print('model loaded into device=', next(model.parameters()).device)
summary(model, input_size=(128, ))

lr_reduce_patience = 10
lr_reduce_factor = 0.05

loss_fn = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9, dampening=0, weight_decay=0, nesterov=True)
lr_sched = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=lr_reduce_factor, patience=lr_reduce_patience, verbose=True, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=1e-8, eps=1e-08)

model loaded into device= cuda:0
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                  [-1, 256]          33,024
              ReLU-2                  [-1, 256]               0
            Linear-3                  [-1, 100]          25,700
              ReLU-4                  [-1, 100]               0
            Linear-5                    [-1, 1]             101
              ReLU-6                    [-1, 1]               0
Total params: 58,825
Trainable params: 58,825
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.22
Estimated Total Size (MB): 0.23
----------------------------------------------------------------


In [59]:
def evaluate(model, dl):
    model.eval()
    final_loss = 0.0
    count = 0
    with torch.no_grad():
        for data_cv in dl:
            inputs, dist_true = data_cv[0], data_cv[1]
            count += len(inputs)
            outputs = model(inputs)
            loss = loss_fn(outputs, dist_true)
            final_loss += loss.item()
    return final_loss/len(dl)

In [60]:
%%time
# %load_ext tensorboard

import time
import copy
from tqdm.auto import tqdm
from utils import *
from torch.utils.tensorboard import SummaryWriter
# from tensorboardX import SummaryWriter

last_loss = 0.0
min_val_loss = np.inf
patience_counter = 0
early_stop_patience = 20
best_model = None
train_losses = []
val_losses = []

output_path = '../outputs'
model_save_path = output_path+'/models'
tb_path = output_path+'/logs'

writer = SummaryWriter(log_dir=tb_path+'/run4_4l_noround_relu', comment='', purge_step=None, max_queue=1, flush_secs=20, filename_suffix='')

torch.backends.cudnn.benchmark = True

# epoch_bar = tqdm(range(epochs), ncols=12000)  # tqdm really slows down training! 
# with torch.autograd.detect_anomaly():
for epoch in range(epochs):  # loop over the dataset multiple times
    running_loss = 0.0
    stime = time.time()
    
    for i, data in enumerate(train_dl, 0):
        # get the inputs; data is a list of [inputs, dist_true]
        model.train()
        inputs, dist_true = data[0], data[1]
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = loss_fn(outputs, dist_true)
        
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        last_loss = loss.item()

    val_loss = evaluate(model, val_dl)
    lr_sched.step(val_loss)
    if val_loss < min_val_loss:
        min_val_loss = val_loss
        patience_counter = 0
        best_model = copy.deepcopy(model)
        print(epoch,"> Best val_loss model saved:", round(val_loss, 3))
    else:
        patience_counter += 1
    train_loss = running_loss/len(train_dl)
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    writer.add_scalar('loss/train', train_loss, epoch)
    writer.add_scalar('loss/val', val_loss, epoch)

    if patience_counter > early_stop_patience:
        print("Early stopping at epoch {}. current val_loss {}".format(epoch, val_loss))
        break

    if epoch % 5 == 0:    
        print("epoch:{} -> train_loss={},val_loss={} - {}".format(epoch, round(train_loss, 5),                round(val_loss, 5), seconds_to_minutes(time.time()-stime)))
    # epoch_bar.update(1)
    # epoch_bar.set_description(desc="train_loss={},val_loss={},running_loss={}".format(round(last_loss,3), round(val_loss,3), running_loss))

print('Finished Training')
best_model_path = model_save_path+'/model_'+str(time.time())+'.pt'
torch.save(best_model.state_dict(), best_model_path)

0 > Best val_loss model saved: 0.464
epoch:0 -> train_loss=0.54745,val_loss=0.46425 - 0.0 minutes 22.0 seconds
1 > Best val_loss model saved: 0.464
2 > Best val_loss model saved: 0.464
3 > Best val_loss model saved: 0.464
4 > Best val_loss model saved: 0.464
5 > Best val_loss model saved: 0.464
epoch:5 -> train_loss=0.46452,val_loss=0.46382 - 0.0 minutes 20.0 seconds
6 > Best val_loss model saved: 0.464
7 > Best val_loss model saved: 0.464
8 > Best val_loss model saved: 0.464
9 > Best val_loss model saved: 0.464
10 > Best val_loss model saved: 0.464
epoch:10 -> train_loss=0.46443,val_loss=0.46375 - 0.0 minutes 21.0 seconds
11 > Best val_loss model saved: 0.464
12 > Best val_loss model saved: 0.464
13 > Best val_loss model saved: 0.464
14 > Best val_loss model saved: 0.464
15 > Best val_loss model saved: 0.464
epoch:15 -> train_loss=0.46439,val_loss=0.46373 - 0.0 minutes 21.0 seconds
16 > Best val_loss model saved: 0.464
17 > Best val_loss model saved: 0.464
18 > Best val_loss model sav

In [61]:
def test(model, dl):
    model.eval()
    final_loss = 0.0
    count = 0
    y_hat = []
    with torch.no_grad():
        for data_cv in dl:
            inputs, dist_true = data_cv[0], data_cv[1]
            count += len(inputs)
            outputs = model(inputs)
            y_hat.extend(outputs.tolist())
            loss = loss_fn(outputs, dist_true)
            final_loss += loss.item()
    return final_loss/len(dl), y_hat

model.load_state_dict(torch.load(best_model_path))
test_loss, y_hat = test(model, test_dl)
y_hat[0:10], y_test[0:10]

([[2.736067771911621],
  [2.731607437133789],
  [2.742253065109253],
  [2.740025520324707],
  [2.7375662326812744],
  [2.7345962524414062],
  [2.741856575012207],
  [2.7314066886901855],
  [2.7373125553131104],
  [2.7439823150634766]],
 array([3, 2, 3, 2, 3, 2, 3, 3, 3, 2]))

#### Things to try
* Next try one cycle lr sched
* Distance values of 2 & 3 dominate the data. Best way to handle?