## Utilities

In this notebook we will load our data, normalize the data, one hot encode the classes and find accuracy. All the functions can be called when needed in the training notebook. Our data has 8 classes with 3000 column or papers. Each paper belong to one class and each paper cite at least one other paper. So our one hot encoded matrix wil be of size 3000 X 8 with ones at the place that class the paper it belongs to. Each paper citaction is represented by the edge in adjacent matrix.

In [8]:
import numpy as np
import torch
import scipy.sparse as sp

In [9]:
path = "./cora/"
dataset = "cora"
idx_features = np.genfromtxt("{}{}.content".format(path, dataset), dtype=np.dtype(str))

Let's first define one hot encoded function - 

In this we first cconvert all thelabels in classes using set. [Set](https://docs.python.org/3.6/library/stdtypes.html#set) extact all the unique values. This is an iterable data structure. Then we convert our classes in form of dictionary and map them to integers starting from 0-8. You can use this for larger citation datset as well. Then we map all our given labels to the integer that we got from mapping the classes in the dictionary.

In [50]:
def one_hot_encoded(labels):
    classes = set(labels)
    classes_dict = {c: np.identity(len(classes))[i,:] for i,c in enumerate(classes)}
    
    one_hot_labels = np.array(list(map(classes_dict.get, labels)), dtype=np.int32)
    
    return one_hot_labels

### Load Data

This function is used to load the datset and convert it in form to feed the network. This converts our dataset in and outputs adjacent matrix, features, labels, train data, validation data, test data. But before we start that, I will like you to know one thing that sparse matrix and normal matrix are not possible we have to convert either both of them into [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix) or covert sparse into dense matrix using [to_dense](https://pytorch.org/docs/stable/sparse.html#torch.sparse.FloatTensor.to_dense) using pytorch.

We will use [np.genfromtxt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) which converts all text data to whatever data type you want to map it into. You can see the dataset, it is in form <paper_id> <word_attributes>+ <class_label>. We will seprate paper_id by selecting $0_th$ index for each paper. Seperate features by selecting $1_st$ to last for each paper and last will represent the class. We will convert class into one hot encoded classes but remember this is not real one hot encoded since it is just in form of integers for labels since the loss function we will be using just accept as integer rather than one hot encoding. we map each paper id to integer 0-n where n is total number of papers.

Now we load our cites data using np.genfromtext which contains first data as citing and second as cited paper. We will make our adjacent matrix  out of this eddges dataset. Just for context adjacent matrix is a sparse matrix with 1s at the place where ther is an eddge between nodes also to  be noted that this is a non-directed edge. So it will be symmetric matrix, i.e, a->b then b->a. The step to covert directed edge to undirected edge is a difficult step to understad at first but it is beautifully coded to only keep the largest weight. I would not take credit for explanantion it is explained [here](https://github.com/yao8839836/text_gcn/issues/17). Now we load the data as train index, validation index and test index in our tensor along with features, label and adjacent sparse matrix as adjacent sparse tensor. 

Allof these tensors are output for training using GCN class we made in models notebook.

In [55]:
def load_data(path = "./cora/", dataset = "cora"):
    print("Loading {} dataset...".format(dataset))
    
    idx_features_labels = np.genfromtxt("{}{}.content".format(path, dataset), dtype=np.dtype(str))
    idx = np.array(idx_features_labels[:,0], dtype=np.int32)
    features = sp.csr_matrix(idx_features_labels[:,1:-1], dtype=np.float32)
    labels = one_hot_encoded(idx_features_labels[:,-1])
    
    idx_map = {j:i for i,j in enumerate(idx)}
    
    edges_unordered = np.genfromtxt("{}{}.cites".format(path, dataset), dtype=np.int32)
    edges = np.array(list(map(idx_map.get, edges_unordered.flatten()))).reshape(edges_unordered.shape)
    
    adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:,0], edges[:,1])), shape=(labels.shape[0], labels.shape[0]), dtype=np.float32)
    adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T>adj)
    
    features = normalize(features)
    adj = normalize(adj+ sp.eye(adj.shape[0]))
    
    idx_train = range(140)
    idx_val = range(200, 500)
    idx_test = range(500, 1500)

    features = torch.FloatTensor(np.array(features.todense()))
    labels = torch.LongTensor(np.where(labels)[1])
    adj = sparse_mx_to_torch_sparse_tensor(adj)

    idx_train = torch.LongTensor(idx_train)
    idx_val = torch.LongTensor(idx_val)
    idx_test = torch.LongTensor(idx_test)

    return adj, features, labels, idx_train, idx_val, idx_test

### Normalize 

This normaliization step is more of a hyper paremeter used by writer but has been used differently in other papers differently. This makes our matrix symmetric. Explanation is pretty mathematical.

First you have to be introduced to Laplacian matrix. It is Matrix representation of a graph. We normalize the matrix using random walk or making a symmetric normalized laplacian.

Here random walk is used-

rowsum gives the Degree matrix of each row,i.e, each paper. The you take inverse of each value in degree matrix. Where degree matrix is sum of all the edges for each node,i.e, each paper. Now we set all the values whose degree is 0 to 0 snce there inverse is Infinite. Make a diagonal matrix out of that inverse matrix and multiply it by our given matrix.

Formal definition is $L^{rw} = D^{-1}L$, where $D$ is the degree matrix and $L$ is the Laplacian matrix, which is $L = D-A$, where $A$ is the adjancey matrix. Here writer of the paper did not use $L$ but used $A$ directly.

More famous is Laplacian symmetric Normalized matrix - 

$L^{sym} = D^{-\frac{1}{2}}LD^{\frac{1}{2}}$

In [56]:
def normalize(matrix):
    rowsum = np.array(matrix.sum(axis=1))
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0
    r_mat_inv = sp.diags(r_inv)
    matrix = r_mat_inv.dot(matrix)
    
    return matrix

In [58]:
def sparse_mx_to_torch_sparse_tensor(sparse_matrix):
    sparse_matrix = sparse_matrix.tocoo().astype(np.float32)
    indices = torch.from_numpy(np.vstack((sparse_matrix.row, sparse_matrix.col)).astype(np.int64))
    values = torch.from_numpy(sparse_matrix.data)
    shape = torch.Size(sparse_matrix.shape)
    return torch.sparse.FloatTensor(indices, values, shape)

In [59]:
def accuracy(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum()
    return correct / len(labels)