# Introductory Work on Cora Data-set


## Purpose:

To perform a simple testing exercise to try to predict the category of records contained in a data-set.This exercise will focus on simpler classification methods to predict results. This includes a graph embedding method (node2vec). See the original paper [here](https://arxiv.org/abs/1607.00653).



## Data Description:

**From the README:**

The Cora dataset consists of Machine Learning papers. These papers are classified into one of the following seven classes:

		Case_Based
		Genetic_Algorithms
		Neural_Networks
		Probabilistic_Methods
		Reinforcement_Learning
		Rule_Learning
		Theory

The papers were selected in a way such that in the final corpus every paper cites or is cited by atleast one other paper. There are 2708 papers in the whole corpus. 

After stemming and removing stopwords we were left with a vocabulary of size 1433 unique words. All words with document frequency less than 10 were removed.

In [1]:
# install StellarGraph if running on Google Colab
import sys
import stellargraph as sg
import pandas as pd
import os

import stellargraph as sg
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN

from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
from stellargraph.data import BiasedRandomWalk
from stellargraph import StellarGraph
from stellargraph import datasets
from IPython.display import display, HTML
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec

%matplotlib inline

## Data Analysis

A plot of the frequency of connections is provided below:

In [2]:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
print(G.info())

StellarGraph: Undirected multigraph
 Nodes: 2708, Edges: 5429

 Node types:
  paper: [2708]
    Features: float32 vector, length 1433
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [5429]
        Weights: all 1 (default)
        Features: none


Node2vec is founded on using a random walk to sample the node choices over each of the nodes in the graph. Let $c_i$ be the $i$th node in the random walk where $c_i = u$ is the initial node. Then the random walk will be defined as:

$$ P(c_i = x|c_{i-1}=v) = \frac{\pi_{vx}}{Z} \text{ if }(v,x)\in E$$

Where:

* $\pi_{vx}$ is the unnormalized transition probability between node $x$ and $v$.
* $Z$ is the normalizing constant.


The code below implements the 2nd degree random walk on the graph. In this case, we are setting the maximum length of the random walk to 100, and are doing 10 walks per node. 

This is implemented and run through the BiasedRandomWalk algorithm below which will output a list of lists. Each list contains 100 components from the random walk. 

In [3]:
len(walks)

NameError: name 'walks' is not defined

In [None]:
rw = BiasedRandomWalk(G)
walks = rw.run(
    nodes=list(G.nodes()),  # root nodes
    length=100,  # maximum length of a random walk
    n=10,  # number of random walks per root node
    p=0.5,  # Defines (unormalised) probability, 1/p, of returning to source node
    q=2.0,  # Defines (unormalised) probability, 1/q, for moving away from source node
)
print("Number of random walks: {}".format(len(walks)))

As mentioned earlier, the random walk algorithm produces a list of 100 node ids representing the "neighborhood". The first twenty ids from the first node are displayed below:

In [None]:
walks[1][0:20]

At this point, we generate the node embeddings for each neighborhood using the Word2vec algorithm. The list comprehension below is just there to format the ids into text strings. After that, the word2vec model is fit on the node lists to generate a list of node embeddings.

In [None]:
str_walks = [[str(n) for n in walk] for walk in walks]
model = Word2Vec(str_walks, size=128, window=5, min_count=0, sg=1, workers=2, iter=1)
# Retrieve node embeddings and corresponding subjects
node_ids = model.wv.index2word  # list of node IDs
node_embeddings = (model.wv.vectors)  # numpy.ndarray of size number of nodes times embeddings dimensionality
node_targets = node_subjects[[int(node_id) for node_id in node_ids]]

The graph of the embeddings is provided below;

In [None]:
len(str_walks[0])

In [None]:
len(node_embeddings[0])

In [None]:
# Apply t-SNE transformation on node embeddings
tsne = TSNE(n_components=2)
node_embeddings_2d = tsne.fit_transform(node_embeddings)
# draw the points
alpha = 0.7
label_map = {l: i for i, l in enumerate(np.unique(node_targets))}
node_colours = [label_map[target] for target in node_targets]

plt.figure(figsize=(10, 8))
plt.scatter(
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1],
    c=node_colours,
    cmap="jet",
    alpha=alpha,
)

At this point, we have a vector of 100 features which represent the neighborhood of the model. This can be used with 

In [None]:
node_targets

In [None]:
X[0]

In [None]:
# X will hold the 128-dimensional input features
X = node_embeddings
# y holds the corresponding target values
y = np.array(node_targets)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.1, test_size=None)
print("Array shapes:\n X_train = {}\n y_train = {}\n X_test = {}\n y_test = {}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
clf = LogisticRegressionCV(
    Cs=10, cv=10, scoring="accuracy", verbose=False, multi_class="ovr", max_iter=300)
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)