## DeepWalk (created from Randomwalk)

Randomwalk algorithm using `networx` and the `karateclub`library.

Jay Urbain, PhD

11/11/2022, 3/11/2025

Load the karate club graph using [networkx](https://networkx.org/).

In [8]:
# tested with python 3.10, used conda environment

# !python -m pip install --upgrade pip
# !pip install karateclub --upgrade
# !pip install networkx 
# !pip install numpy==1.22.0

#!pip install scikit-learn
#!pip install matplotlib
!pip install pandas



In [1]:
import networkx as nx
G = nx.karate_club_graph() # load the Zachary's karate club graph
print("Number of nodes (club members)", len(G.nodes))

Number of nodes (club members) 34


Plot the graph:

In [2]:
nx.draw_networkx(G)

ImportError: Matplotlib requires numpy>=1.23; you have 1.22.0

Each node represents a participant. If the participants talk to each other, they have a relationship.

There are two types of labels or group memberships.

Plot the graph with labels:

In [None]:
# plot the graph with labels
labels = []
for i in G.nodes:
    # Mr. Hi or Officer
    club_names = G.nodes[i]['club']
    # Numerically encode club name
    labels.append(1 if club_names == "Officer" else 0)
#print('labels', labels)  

# can choose different layout
layout_pos = nx.spring_layout(G)
nx.draw_networkx(G, pos=layout_pos, node_color=labels, cmap='coolwarm')

Perform node embedding using the `Deepwalk` algorithm and the `karateclub` library.

Paper:   
[DeepWalk: Online Learning of Social Representations](https://arxiv.org/abs/1403.6652)

Karateclub library:    
https://karateclub.readthedocs.io/en/latest/notes/introduction.html

Karateclub DeepWalk reference:   
https://karateclub.readthedocs.io/en/latest/modules/root.html#karateclub.node_embedding.neighbourhood.deepwalk.DeepWalk

In [4]:
# Perform node embedding using the Deepwalk class in the karateclub library

from karateclub import DeepWalk, Node2Vec

# construct model - use Google  
Deepwalk_model = DeepWalk(walk_number=10, walk_length=80, dimensions=124)

# fit the model to the graph 
Deepwalk_model.fit(G)

# construct model. Make Node2Vec a little more biased for staying local  
Node2Vec_model = Node2Vec(walk_number=10, walk_length=80, p=0.6, q=0.4, dimensions=124)

# fit the model to the graph 
Node2Vec_model.fit([G])


  from scipy.sparse import csr_matrix, issparse


AttributeError: 'list' object has no attribute 'number_of_nodes'

In [3]:
# get learned embedding
embedding = Deepwalk_model.get_embedding()

node2Vec_embedding = Node2Vec_model.get_embedding()

NameError: name 'Deepwalk_model' is not defined

How many nodes and how many features?     
-- 34 x 124

In [None]:
print('Embedding array shape (nodex x features)', embedding.shape)
print('Node2Vec Embedding array shape (nodex x features)', embedding.shape)

Plot lower dimensional representations of the data.

Can use [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) 
or [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
     

In [None]:
# Low dimensional plot of the nodes x features
import sklearn

PCA_model = sklearn.decomposition.PCA(n_components=2)
lowdimension_embedding = PCA_model.fit_transform( embedding )

node2vec_lowdimension_embedding = PCA_model.fit_transform( node2Vec_embedding )

Lower dimensional embedding should be the nuber of nodes x 2.

In [None]:
print('Low dimmensional embedding representation from (node x 2):', lowdimension_embedding.shape)

print('Low dimmensional node2vec embedding representation from (node x 2):', node2vec_lowdimension_embedding.shape)



Plot 2-d represention

In [None]:
import matplotlib.pyplot as plt

plt.scatter(lowdimension_embedding[:,0], lowdimension_embedding[:,1], c=labels, s=15, cmap='coolwarm')
plt.title('RandomWalk')
plt.show()

plt.scatter(node2vec_lowdimension_embedding[:,0], node2vec_lowdimension_embedding[:,1], c=labels, s=15, cmap='coolwarm')
plt.title('Node2Vec')
plt.show()


After using Deepwalk we get a lower dimensional representation.

Now perform node classification.

Create train and test data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

x_train, x_test, y_train, y_test = train_test_split(embedding, labels, test_size=0.3)
x_train, x_test


Fit the model to the data, i.e., the learned representations, using logistic regression.

In [None]:
ml_model = LogisticRegression(random_state=0).fit(x_train, y_train) 
y_predict = ml_model.predict(x_test)
ml_acc = roc_auc_score(y_test, y_predict)
print('AUC:', ml_acc)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

x_train, x_test, y_train, y_test = train_test_split(node2Vec_embedding, labels, test_size=0.3)
x_train, x_test

In [6]:
ml_model = LogisticRegression(random_state=0).fit(x_train, y_train) 
y_predict = ml_model.predict(x_test)
ml_acc = roc_auc_score(y_test, y_predict)
print('AUC:', ml_acc)

NameError: name 'LogisticRegression' is not defined

Note: this is a relatively small and simple network.

Random walk with restart???

https://medium.com/@chaitanya_bhatia/random-walk-with-restart-and-its-applications-f53d7c98cb9