<a href="https://colab.research.google.com/github/kprokofi/graph_networks_analysis/blob/main/graph_lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crafting features for Cora dataset

In this labaratory work we will need to explore tools how to classify graph nodes. We will not use GCN or GAN to solve this problem, instead we will manualy extract some useful features from graph and will try to beat baseline! 

In [None]:
!pip install spektral==0.6.1
!pip install node2vec

In [None]:
import pandas as pd
import networkx as nx
from networkx.algorithms.shortest_paths.generic import shortest_path
import spektral
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from node2vec import Node2Vec


In [None]:
adjancy, features, labels, train_mask, val_mask, test_mask = spektral.datasets.citation.load_data(dataset_name='cora')

Loading cora dataset
Pre-processing node features


In [None]:
features = features.todense().astype('float32')
adjancy = (adjancy.todense() + np.eye(adjancy.shape[0])).astype('float32')
features

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
print(np.sum(train_mask))
print(np.sum(val_mask))
print(np.sum(test_mask))

140
500
1000


## Features engineering

### Shortest paths from each node to each (dijkstra)

In [None]:
G = nx.convert_matrix.from_numpy_matrix(adjancy)

In [None]:
paths = shortest_path(G)

In [None]:

pcc_longueurs=list(nx.all_pairs_shortest_path_length(G))

In [None]:
n = G.number_of_nodes()

In [None]:
length = dict(nx.all_pairs_shortest_path_length(G))

In [None]:
distances=np.zeros((n,n))# distances[i, j] is the length of the shortest path between i and j
for i in range(n):
    for j in range(n):
        if i == j:
            distances[i, j] = 1
        elif j not in length[i]:
            distances[i, j] = 0 # mark unreachable nodes with 0
        else:
            distances[i, j] = length[i][j] + 1

In [None]:
distances.shape

(2708, 2708)

### Pagerank vector

In [None]:
pg = np.array(list(nx.pagerank(G, alpha=0.9).values()))

### Degree of centrality

In [None]:
c_degree = np.array(list(nx.degree_centrality(G).values()))
c_eigenvector = np.array(list(nx.eigenvector_centrality(G).values()))
c_closeness = np.array(list(nx.closeness_centrality(G).values()))
c_betweenness = np.array(list(nx.betweenness_centrality(G).values()))

In [None]:
centrality = np.concatenate([pg.reshape(-1,1), c_eigenvector.reshape(-1,1), c_closeness.reshape(-1,1), c_betweenness.reshape(-1,1)], axis=1)

In [None]:
centrality.shape

(2708, 4)

### Prepare train test val datasets

In [None]:
# train/val/test data
x_train_cora_features = features[train_mask]
x_train_manual = np.concatenate([distances[train_mask], centrality[train_mask]],axis=1)
x_test_cora_features = features[test_mask]
x_test_manual =  np.concatenate([distances[test_mask], centrality[test_mask]],axis=1)
x_val_cora_features = features[val_mask]
x_val_manual = np.concatenate([distances[val_mask], centrality[val_mask]],axis=1)
# train/val/test labels
y_train = labels[train_mask]
y_test = labels[test_mask]
y_val = labels[val_mask]

In [None]:
# concat it
x_train = np.concatenate([x_train_cora_features, x_train_manual], axis=1)
x_test = np.concatenate([x_test_cora_features, x_test_emb], axis=1)
x_val = np.concatenate([x_val_cora_features, x_val_emb], axis=1)

In [None]:
# keep it for CV validation
X_all = np.concatenate([x_train, x_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

### Random forest and XGBoost models

First of all lets validate baseline model. We will use only predifined from the dataset features

In [None]:
xg_baseline = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=5)
xg_baseline.fit(features[train_mask], np.argmax(y_train, axis=1)) # features from cora only

pred = xg_baseline.predict(features[test_mask])
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=pred)
print(report)

              precision    recall  f1-score   support

           0       0.28      0.50      0.36       130
           1       0.55      0.67      0.61        91
           2       0.82      0.68      0.74       144
           3       0.71      0.47      0.57       319
           4       0.50      0.47      0.48       149
           5       0.48      0.50      0.49       103
           6       0.48      0.56      0.52        64

    accuracy                           0.53      1000
   macro avg       0.54      0.55      0.54      1000
weighted avg       0.59      0.53      0.54      1000



Keep in mind: 53% on test for baseline

Now, let's use our manual features, tune and validate few algorithms



In [None]:
# Random Forest
RT = RandomForestClassifier(max_depth=50, n_estimators=40, random_state=5)
RT.fit(x_train, np.argmax(y_train, axis=1))

pred = RT.predict(x_test)
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=pred)
print(report)

              precision    recall  f1-score   support

           0       0.60      0.66      0.63       130
           1       0.73      0.78      0.76        91
           2       0.73      0.90      0.81       144
           3       0.88      0.43      0.58       319
           4       0.55      0.84      0.66       149
           5       0.64      0.68      0.66       103
           6       0.43      0.59      0.50        64

    accuracy                           0.66      1000
   macro avg       0.65      0.70      0.66      1000
weighted avg       0.71      0.66      0.65      1000



In [None]:
# Gradient Boosting
xg = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=5)
xg.fit(x_train, np.argmax(y_train, axis=1))

pred = xg.predict(x_test)
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=pred)
print(report)

              precision    recall  f1-score   support

           0       0.73      0.60      0.66       130
           1       0.65      0.80      0.72        91
           2       0.84      0.85      0.84       144
           3       0.77      0.61      0.68       319
           4       0.59      0.50      0.54       149
           5       0.48      0.74      0.58       103
           6       0.45      0.67      0.54        64

    accuracy                           0.66      1000
   macro avg       0.64      0.68      0.65      1000
weighted avg       0.69      0.66      0.66      1000



66% for both models using our manual features together with cora predefined. Let's now do a cross validation 

### Cross Validation Score

In [None]:
# WARNING! the code will be executed for about 10-15 minutes
scores = cross_val_score(xg, X_all, np.argmax(y_all, axis=1), cv=5)

In [None]:
np.mean(scores)

0.7796875

78% accuracy on Cross validation for XGboost

In [None]:
scores = cross_val_score(RT, X_all, np.argmax(y_all, axis=1), cv=5)
np.mean(scores)

0.725

72.5% for Random Forest that means worse generelezation than XGBoost

### MLP network

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(x_train.shape[1], int(x_train.shape[1]/2), y_train.shape[1]), 
                                     activation='relu', solver='adam', alpha=1e-4, batch_size=140, 
                                     learning_rate='invscaling', learning_rate_init=0.001, verbose=True, max_iter=200)
mlp.fit(x_train, np.argmax(y_train, 1))

Iteration 1, loss = 3.83216524
Iteration 2, loss = 89.63385662
Iteration 3, loss = 1.87636549
Iteration 4, loss = 1.85924468
Iteration 5, loss = 1.84795299
Iteration 6, loss = 1.84039631
Iteration 7, loss = 1.83494166
Iteration 8, loss = 1.83146586
Iteration 9, loss = 1.82787417
Iteration 10, loss = 2.12665754
Iteration 11, loss = 1.82425181
Iteration 12, loss = 1.82195027
Iteration 13, loss = 1.82009964
Iteration 14, loss = 1.81857727
Iteration 15, loss = 1.81705284
Iteration 16, loss = 1.81610918
Iteration 17, loss = 1.81517063
Iteration 18, loss = 1.81376032
Iteration 19, loss = 1.81263194
Iteration 20, loss = 1.81160755
Iteration 21, loss = 1.81109891
Iteration 22, loss = 1.80983572
Iteration 23, loss = 1.80817850
Iteration 24, loss = 1.80600538
Iteration 25, loss = 1.80354912
Iteration 26, loss = 1.80391923
Iteration 27, loss = 1.80326635
Iteration 28, loss = 1.80298262
Iteration 29, loss = 1.80263013
Iteration 30, loss = 1.80208124
Iteration 31, loss = 1.80147623
Iteration 32, lo



MLPClassifier(activation='relu', alpha=0.0001, batch_size=140, beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(4145, 2072, 7), learning_rate='invscaling',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=True,
              warm_start=False)

In [None]:
mlp_pred = mlp.predict(x_test)
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=mlp_pred)
print(report)

              precision    recall  f1-score   support

           0       0.14      0.98      0.24       130
           1       0.00      0.00      0.00        91
           2       0.00      0.00      0.00       144
           3       0.00      0.00      0.00       319
           4       0.28      0.13      0.18       149
           5       0.50      0.01      0.02       103
           6       0.88      0.22      0.35        64

    accuracy                           0.16      1000
   macro avg       0.26      0.19      0.11      1000
weighted avg       0.17      0.16      0.08      1000



  _warn_prf(average, modifier, msg_start, len(result))


Some bad result! Tried to play with hidden nodes, lr, wd, but the result wasn't better than 17%

### Try NodeToVec

In [None]:
node2vec = Node2Vec(G, dimensions=100, walk_length=16, num_walks=100)
model = node2vec.fit(window=10, min_count=1)

Computing transition probabilities:   0%|          | 0/2708 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 100/100 [03:00<00:00,  1.81s/it]


In [None]:
node2vec_feature = np.empty((len(G.nodes()), 100))
for v in G.nodes():
  node2vec_feature[v] = model.wv[str(v)]

Generate new train/val/test

In [None]:
x_train_w2v = np.concatenate([x_train, node2vec_feature[train_mask]], axis=1)
x_test_w2v = np.concatenate([x_test, node2vec_feature[test_mask]], axis=1)
x_val_w2v = np.concatenate([x_val, node2vec_feature[val_mask]], axis=1)

In [None]:
xg = GradientBoostingClassifier(n_estimators=130, learning_rate=0.07, max_depth=2, random_state=5)
xg.fit(x_train_w2v, np.argmax(y_train, axis=1))

pred = xg.predict(x_test_w2v)
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=pred)
print(report)

              precision    recall  f1-score   support

           0       0.60      0.60      0.60       130
           1       0.74      0.78      0.76        91
           2       0.85      0.88      0.87       144
           3       0.84      0.60      0.70       319
           4       0.65      0.66      0.66       149
           5       0.53      0.74      0.62       103
           6       0.46      0.75      0.57        64

    accuracy                           0.69      1000
   macro avg       0.67      0.72      0.68      1000
weighted avg       0.72      0.69      0.69      1000



In [None]:
scores = cross_val_score(xg, X_all, np.argmax(y_all, axis=1), cv=5)
np.mean(scores)

0.7625

+3% accuracy gain to test! But -1% in CV validation

In [None]:
RT = RandomForestClassifier(max_depth=20, n_estimators=40, random_state=5)
RT.fit(x_train_w2v, np.argmax(y_train, axis=1))

pred = RT.predict(x_test_w2v)
report = classification_report(y_true=np.argmax(y_test, axis=1), y_pred=pred)
print(report)

              precision    recall  f1-score   support

           0       0.65      0.72      0.69       130
           1       0.71      0.77      0.74        91
           2       0.73      0.88      0.80       144
           3       0.83      0.46      0.60       319
           4       0.59      0.76      0.66       149
           5       0.64      0.65      0.64       103
           6       0.39      0.66      0.49        64

    accuracy                           0.66      1000
   macro avg       0.65      0.70      0.66      1000
weighted avg       0.70      0.66      0.66      1000



In [None]:
scores = cross_val_score(RT, X_all, np.argmax(y_all, axis=1), cv=5)
np.mean(scores)

0.725

no changes for Random Trees 

## Conclusion

We got acquainted with the methods of classifying nodes in a graph without using SOTA GCNs and GANs. We collected manual features including shortest paths matrix, centrality degree and tried node2vec method. We trained XGboost, random forest classifier and MLP neural network

---
The results:

*   Random trees classifier achieved 66% test accuracy and 72.5% CV accuracy
*   XGBoost achieved 69% test accuracy and 78% CV accuracy
*   MLP achieved only 16-17% accuracy




