# Talbe of contents

* [Taks 2: Node Classification in Graphs](#task2)
    * [Introduction](#introduction2)
    * [Import libraries](#libraries2)
    * [Data preprocessing](#dataprep)
    * [Text Embedding](#text)
    * [Node Embedding](#node)
    * [SVM Classifer Model(with node embedding) ](#svc1)
    * [SVM Classifer Model(with node embedding + text embedding)](#svc2)
    * [Compare model and result conclusion](#conclusion2)


# Task 2: Node Classification in Graphs
<a id="task2"></a>

## Introduction
<a id="introduction2"></a>

In the task 2, I will perform the node classification by using the graph dataset of citation network. In this dataset, each node is paper, and edge indicates the relationship between two paper. Also, we have the title information of each paper. Then we will do the text embedding and node embedding. After finish embedding, I will use the SVM classifier method to build two classifier, one is only contain node embedding, and another is contain node embedding and text embedding. Finally, we will evaluate the result.


The steps in Node Classification in graphs are:
1. Get the nodes (paper) from docs.txt
2. Get the edges (between paper and paper) from adjedges.txt
3. Add the nodes and edges into graph
4. Do the text embedding
5. Do the node embedding
6. Combine the node embedding with text embedding 
7. Split the dataset into train dataset and test dataset (with seed=0)
6. Build the SVM classifer(with node embedding) and VM classifer(with node embedding+ text embedding).
7. Fit the data
8. Evaluate the result.

## Import libraries
<a id="libraries2"></a>
In this part, I will import some libraries for the following task.

In [15]:
#!pip install node2vec
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy as sp
import scipy.sparse.linalg as linalg
import scipy.cluster.hierarchy as hr
from scipy.spatial.distance import pdist, squareform
import sklearn.datasets as datasets
import sklearn.metrics as metrics
import sklearn.utils as utils
import sklearn.linear_model as linear_model
import sklearn.svm as svm
import sklearn.cluster as cluster
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from node2vec import Node2Vec


import re
import networkx as nx

import seaborn as sns
%matplotlib inline

## Data preprocessing
<a id="dataprep"></a>

The steps in data preprocessing are:
1. Get the nodes (paper) and title from docs.txt
2. Get the edges (between paper and paper) from adjedges.txt
3. Add the nodes and edges into graph




In [4]:
np.random.seed(0)
# get the node and id title
with open('docs.txt') as f:
    content = f.readlines()

id_link={}
id_title={}
for i in content:
    x=re.match(r"(^\d*)", i)
    x1=re.match(r"^\d* (.*)",i)
    title=x1.group(1)
    book_id=x.group()
    id_link[book_id]=[]
    id_title[book_id]=title

# get the the edge between nodes
with open('adjedges.txt') as f:
    content2 = f.readlines()
for i in content2:
    links=i.split()
    if len(links)>1:
        book_id=links[0]
        if book_id in id_link.keys():
            orig_link=id_link[book_id]
            in_id=[j for j in links[1:]]
            final_link=orig_link+in_id
            id_link[book_id]=final_link 

# get the label of nodes
with open('labels.txt') as f:
    content3 = f.readlines()
final_id=[]
final_label=[]
for i in content3:
    links=i.split()
    final_id.append(links[0])
    final_label.append(links[1])

subject = pd.DataFrame({'id': final_id, 'label':final_label})
subject = subject.set_index("id")
final_subject = subject["label"]


# Create an empty graph with no nodes and no edges
g = nx.Graph()

# add the nodes and edges to the graph
for key, value in id_link.items():
    g.add_node(key)
    for i in value:
        g.add_edge(key,i)

## Text Embedding<a id="text"></a>

In this step, I will use the tf-idf to perform the text embedding.

In [9]:
trainDocs=list(id_title.values())

In [11]:
# define a LemmaTokenizer to vectorize the text
class LemmaTokenizerSpacy(object):        
    def __call__(self,doc):
        trydoc = nlp(doc)
        return [token.lemma_ for token in trydoc]

In [17]:
# create the list of stop words
stopwords_list = list(en_stop)
add_stopwords=["'", '-PRON-', 'd', 'm', 'regard', 'use', 've','-pron-']
stopwords_list.extend(add_stopwords)

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

# get the vectorizer
vectorizer=TfidfVectorizer(analyzer='word',input='content',
                           lowercase=True,
                           token_pattern='(?u)\\b\\w\\w+\\b',
                           stop_words=stopwords_list,
                           min_df=3,
                           ngram_range=(1,2),
                           tokenizer=LemmaTokenizerSpacy())

In [68]:
text_embedding_sparse=vectorizer.fit_transform(trainDocs)

In [69]:
text_embedding=text_embedding_sparse.toarray()

## Node Embedding
<a id="node"></a>
In step of Node embedding, I will use the Node2Vec method to embed the node. In Node2Vec, when developing a random walk, there is a certain probability to go back to the previous node.


In [20]:
from node2vec import Node2Vec
# pre-compute the probabilities and generate walks :
node2vec = Node2Vec(g, dimensions=64, walk_length=30, num_walks=150, workers=1)
# embed the nodes
model = node2vec.fit(window=10, min_count=1, batch_words=4)

model.save('node2vec.model')

Computing transition probabilities: 100%|██████████| 36928/36928 [00:08<00:00, 4162.30it/s]
Generating walks (CPU: 1): 100%|██████████| 150/150 [1:38:45<00:00, 39.50s/it]


In [29]:
#from gensim.models import Word2Vec
#model = Word2Vec.load("node2vec.model")

In [48]:
# list of test node IDs
node_ids = model.wv.index2word
# numpy.ndarray of size number of nodes*embeddings dimensionality 
node_embeddings_orig = model.wv.vectors  
# set the filter_indices to find each test node in node_ids's index 
filter_indices=[]
for i in final_id:
    filter_indices.append(node_ids.index(i))

# get the final embedding
node_embeddings=np.take(node_embeddings_orig, filter_indices, 0)

# list of test node IDs
final_node_ids=[]
for i in filter_indices:
    final_node_ids.append(node_ids[i])

node_targets = final_subject[[node_id for node_id in final_node_ids]]

# combine the text embedding and node embedding together
text_node_embedding=np.concatenate((node_embeddings,text_embedding),axis=1)

## SVM Classifer Model ( With node embedding)
<a id="svc1"></a>
In this step, I will build the node classifer by using the SVM model with node embedding.

In [91]:
# Set X as input features
X = node_embeddings
# Set y as corresponding target values
y = np.array(node_targets)

# Split the data to train data set and test data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=0)

In [92]:
from sklearn.svm import LinearSVC
from sklearn import svm
from sklearn.model_selection import GridSearchCV
svc=LinearSVC()
svc.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [93]:
y_pred_LSVC = svc.predict(X_test)
confusion_matrix_LSVC=confusion_matrix(y_test, y_pred_LSVC)
recall_LSVC=recall_score(y_test, y_pred_LSVC,average='macro')
precision_LSVC=precision_score(y_test, y_pred_LSVC,average='macro')
f1score_LSVC=f1_score(y_test, y_pred_LSVC,average='macro')
accuracy_LSVC=accuracy_score(y_test, y_pred_LSVC)
matthews_LSVC = matthews_corrcoef(y_test, y_pred_LSVC) 
print(confusion_matrix_LSVC)
print('Accuracy: '+ str(accuracy_LSVC))
print('Macro Precision: '+ str(precision_LSVC))
print('Macro Recall: '+ str(recall_LSVC))
print('Macro F1 score:'+ str(f1score_LSVC))
print('MCC:'+ str(matthews_LSVC))

[[4031  155   54  384    2]
 [ 231 2476  156  274    4]
 [2800  281  223  285    0]
 [ 495  259   52 1366    1]
 [1155   81   13  198    0]]
Accuracy: 0.5405982905982906
Macro Precision: 0.44334765267987386
Macro Recall: 0.47008429371822613
Macro F1 score:0.41438456893266806
MCC:0.4208782507548116


## SVM Classifer Model ( With node embedding + text embedding)
<a id="svc2"></a>
In this step, I will build the node classifer by using the SVM model with node embedding + text embedding.

In [84]:
# Set X as input features
X = text_node_embedding
# Set y as corresponding target values
y = np.array(node_targets)

# Split the data to train data set and test data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=0)

In [85]:
from sklearn.svm import LinearSVC
from sklearn import svm
from sklearn.model_selection import GridSearchCV
#rbf_svc = svm.SVC(kernel='rbf')
#rbf_svc.fit(X_train, y_train)
svc=LinearSVC()
svc.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [87]:
#y_pred_LSVC = rbf_svc.predict(X_test)
y_pred_LSVC = svc.predict(X_test)
confusion_matrix_LSVC=confusion_matrix(y_test, y_pred_LSVC)
recall_LSVC=recall_score(y_test, y_pred_LSVC,average='macro')
precision_LSVC=precision_score(y_test, y_pred_LSVC,average='macro')
f1score_LSVC=f1_score(y_test, y_pred_LSVC,average='macro')
accuracy_LSVC=accuracy_score(y_test, y_pred_LSVC)
matthews_LSVC = matthews_corrcoef(y_test, y_pred_LSVC) 
print(confusion_matrix_LSVC)
print('Accuracy: '+ str(accuracy_LSVC))
print('Macro Precision: '+ str(precision_LSVC))
print('Macro Recall: '+ str(recall_LSVC))
print('Macro F1 score:'+ str(f1score_LSVC))
print('MCC:'+ str(matthews_LSVC))

[[3953   70  298  121  184]
 [  61 2818  134  112   16]
 [ 591  202 2604  139   53]
 [ 118  140  105 1798   12]
 [ 540   54  106   72  675]]
Accuracy: 0.7911324786324786
Macro Precision: 0.7862422202017484
Macro Recall: 0.7542289277999005
Macro F1 score:0.763716229268779
MCC:0.7289803631256759


In [None]:
accuracy_LSVC=accuracy_score(y_test, y_pred_LSVC)

## Compare model and result conclusion
<a id="conclusion2"></a>

SVM Classifer Model( With node embedding):

|              | Pred=0 | Pred=1 |Pred=2 |Pred=3 |Pred=4 |
| -----------  | ----------- | ----------- |----------- |----------- |----------- |
| **Actual=0**  | 4060      |   151  |15  |400  |0  |
| **Actual=1** | 256        |   2527 |101 |256 |1 |
| **Actual=2** | 2800        |   275 |205 |308 |1 |
| **Actual=3** | 559        |   224 |34 |1355 |1 |
| **Actual=4** | 1168        |   75 |8 |192 |4 |

Accuracy: 0.544

Macro Precision: 0.582

Macro Recall: 0.473

Macro F1 score: 0.416

MCC: 0.429

------------------------------------------

SVM Classifer Model ( With node embedding + text embedding):

|              | Pred=0 | Pred=1 |Pred=2 |Pred=3 |Pred=4 |
| -----------  | ----------- | ----------- |----------- |----------- |----------- |
| **Actual=0**  | 3953     |   70  |298  |121  |184  |
| **Actual=1** | 61        |   2818 |134 |112 |16 |
| **Actual=2** | 591        |   202 |2604 |139 |53 |
| **Actual=3** | 118        |   140 |105 |1798 |12 |
| **Actual=4** | 540        |  54 |106 |72 |675 |

Accuracy: 0.791

Macro Precision: 0.786

Macro Recall: 0.754

Macro F1 score: 0.763

MCC: 0.728


------------
According to the result:
1.	The SVM classifer model **(with node embedding + text embedding)** has higher accuracy than the SVM classifer model(with node embedding)
2.	Beside, The SVM classifer model **(with node embedding + text embedding)** has higher precision, recall, F1 score than logistic regression model. This means the the SVM classifer model **(with node embedding + text embedding)** is most suitable in this case.
3.	The reason that the SVM classifer model **(with node embedding)** has extremely sparse network structure in node embedding. Therefore, this model will not do the very well when extracting the features and cause the bad performance in accuracy.
4. In the future, if we want to chase higher accuracy maybe we could use the Graph Neural Network.