# Graph Machine Learning Introduction

by Alejandro Correa Bahnsen, Jaime D. Acevedo-Viloria & Luisa Roa

version 1.2, October 2022

In this notebook we will be doing a brief introduction to graph machine learning. The agenda is as follows:


1. Different types of graphs - Introduction to NetworkX
2. Creating Graph Based Features and enhancing ML models
3. Creating a Graph from own data using NetworkX
4. Transductive Learning vs. Inductive Learning
5. Graph Neural Networks - Introduction to DGL

Through these basics you will be able to leverage graphs for the enhacement of Machine Learning models, or be able to learn the basics of how to build Neural Networks specially crafted for Graphs. We hope you like it!

In [None]:
#Install libraries
!pip install dgl==0.6.1
!pip install torch==1.9.0

## Types of Graphs - An Introduction to NetworkX

NetworkX according to it's creators is: NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. 

![](https://raw.githubusercontent.com/jdacevedo3010/graph-mahine-learning-workshop/master/images/networkx_description.png)

https://networkx.org/

We will be using the latest stable 2.8.7 version of the Package as referenced on the requirements .txt provided in the repo.

In [None]:
import networkx as nx
import numpy as np
import pandas as pd

Let's start by describing different properties graphs can have and what those mean for the graph in subject. We will use NetworkX visual examples for every one of them and we will also describe real world applications where you may find such type of graph.

We will be using different NetworkX to innitialize graphs, this is just to highlight the many different ways we can do this. Make sure to check the documentation for more info in this: https://networkx.org/documentation/stable/reference/generators.html

### Directed & Undirected Graphs

This property refers as to whether the edges connecting the graphs have an inherent direction in it.

In undirected the graphs, edges indicate a two-way relationship, and as such they can be traversed from either node to other connected. In directed graphs, edges indicate a one-way direction. Meaning that they can only be traversed in an specific direction of the edge.

## Creating Graph Based Features and enhancing ML models 

Now let's see how we can use these new features taken from the graph to enhance our ML models. First, let's import the information of the users in the graph from the nodes_features_workshop csv.

In [None]:
df = pd.read_csv('https://github.com/jdacevedo3010/graph-mahine-learning-workshop/raw/master/data/nodes_features_workshop.csv').set_index('USER_ID')
df.head()

Here we have a DataFrame with the users as the index. The columns contain the features that profile them:

1. Device Type: An encoding of the different devices in the dataset
2. Expected Value: A score that measures the value the client will bring
3. Sales: Total amount spent by the user

And a label that tells us whether the user is fraudulent or not (FRAUD column).

Let's use this information to train a couple of traditional Machine Learning models such as: Gradient Boosting Trees, and Logistic Regression. Let's first create train and test masks, we will do this manually given that we wil later need this same division for the Graph Neural Networks. We will also be using torch tensors for the same reason.

In [None]:
import torch as th

def create_masks(df, seed=23, test_size=0.2):
    '''
    This function creates binary tensors that indicate whether an user is on the train or test set
    '''
    np.random.RandomState(seed)
    temp_df = df.copy()
    temp_df['split_flag'] = np.random.random(df.shape[0])
    train_mask = th.BoolTensor(np.where(temp_df['split_flag'] <= (1 - test_size), True, False))
    test_mask = th.BoolTensor(np.where((1 - test_size) < temp_df['split_flag'] , True, False))
    return train_mask, test_mask

In [None]:
#Create binary masks
train_mask, test_mask = create_masks(df, 23, 0.3)

print(train_mask)

#Here we transform the tensors so they indicate the indices of the train and test users instead of the binary
train_nid = train_mask.nonzero().squeeze()
test_nid = test_mask.nonzero().squeeze()

print(train_nid)

Now, let's create the X and Y tensors:

In [None]:
#Create X and Y dataframes
X = df.drop(['FRAUD'], axis=1)
y = df.drop(['DEVICE_TYPE','EXPECTED_VALUE','SALES'], axis=1)

print('The shape of the X DataFrame is: ',X.shape)
print('The shape of the y DataFrame is: ',y.shape)

In [None]:
#Transform the X and Y dataframes to tensors now as well
X = th.tensor(X.values).float()
y = th.tensor(y.values).type(th.LongTensor).squeeze_()

print(X.shape)
print(y.shape)

Let's create the functions to train the ML models, and a function that allows us to measure the performance of those models in terms of ROC Curve AUC, F1-Score, Precision and Recall:

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score

def get_gb_preds(X_train, y_train, X_test, seed=23):
    clf = GradientBoostingClassifier(random_state=seed)
    clf.fit(X_train,y_train)
    y_pred_probas = clf.predict_proba(X_test)
    return y_pred_probas

def get_lr_preds(X_train, y_train, X_test, seed=23):
    clf = LogisticRegression(random_state=seed)
    clf.fit(X_train,y_train)
    y_pred_probas = clf.predict_proba(X_test)
    return y_pred_probas

def get_results(y_pred_probas, y_test, threshold=0.5):
    pred_probas_1 = y_pred_probas[:,1]
    preds_1 = np.where(pred_probas_1>threshold,1,0)
    auc = roc_auc_score(y_test, pred_probas_1)
    f1 = f1_score(y_test,preds_1)
    prec = precision_score(y_test,preds_1)
    recall = recall_score(y_test,preds_1)
    return auc, f1, prec, recall

#### Logistic Regression Results

In [None]:
X[train_nid].shape

In [None]:
model = 'logistic-regression'
y_pred_probas = get_lr_preds(X[train_nid], y[train_nid], X[test_nid], seed=23)

results_df = pd.DataFrame(columns=['Model','AUC','F1 Score','Precision','Recall'])
auc, f1, prec, recall = get_results(y_pred_probas, y[test_nid], 0.5)
dict_results = {'Model':model, 'AUC':auc, 'F1 Score':f1, 'Precision':prec, 'Recall':recall}
results_df = results_df.append(dict_results, ignore_index=True)

results_df

#### GBoost results

In [None]:
model = 'GBoost'
y_pred_probas = get_gb_preds(X[train_nid], y[train_nid], X[test_nid], seed=23)

auc, f1, prec, recall = get_results(y_pred_probas, y[test_nid], 0.5)
dict_results = {'Model':model, 'AUC':auc, 'F1 Score':f1, 'Precision':prec, 'Recall':recall}
results_df = results_df.append(dict_results, ignore_index=True)

results_df

## Creating a Graph from own data using NetworkX

We will know see how to create a graph from data instead of randomly. For this we will have to import the csv's in the data folder and process them for NetworkX.

First lets import the relations csv in the data folder:

In [None]:
import pandas as pd

edges_df = pd.read_csv('https://github.com/jdacevedo3010/graph-mahine-learning-workshop/raw/master/data/new_edges_workshop.csv')
edges_df.head()

In [None]:
edges_df.shape

Here you a 2-colummn DataFrame that contains the undirected edges between distinct users. This is normally referred to as "List of edges" and it's a common way to create graphs. NetworkX also has a method to create a graph from a DataFrame of edges, let's do just that:

In [None]:
G = nx.from_pandas_edgelist(edges_df,'~from','~to')
G.number_of_nodes()

Let's draw a portion of the graph to check it out:

In [None]:
G_draw = nx.from_pandas_edgelist(edges_df.head(100),'~from','~to')
nx.draw(G_draw)

Nice! We can now use this created graph to extract characteristics from it. 

Let's say we want to get the number of neighbors of each node in the graph:

In [None]:
degrees = dict(G.degree)

We can use the degree method in NetworkX to create a dictionary that holds the node id's as the keys and the degree of that node as the value. There are also plenty of other measures, we can extract from the Graph object of NetworkX like centrality or betweeness metrics.

More information about the NetworkX library can be found in this tutorial: https://networkx.org/nx-guides/content/tutorial.html. Or in the overall documentation guide of the package: https://networkx.org/nx-guides/index.html

#### And now let's enhance the features with some extracted from the graphs

Now we will use the previously generated dictionary of degrees as an additional feature to the DataFrame, along with the centrality measure PageRank.

PageRank was developed by Google and measures the importance of a node in a Graph given how connected are the node's neighbors. More information can be found here: https://es.wikipedia.org/wiki/PageRank

Let's first calculate that for the previously generated graph:

In [None]:
pr = nx.pagerank(G,alpha=0.9)


And now let's add both the degree and PageRank as features for the users:

In [None]:
#Map Degree and PageRank into the DataFrame
df_enhanced = df.copy()
df_enhanced['DEGREE'] = df.index.map(degrees)
df_enhanced['PAGERANK'] = df.index.map(pr)

df_enhanced.head()

## Now let's add a personalized feature

In [None]:
device_score_dict = df_enhanced.DEVICE_TYPE.to_dict()

In [None]:
nx.set_node_attributes(G, device_score_dict, 'device_score')

In [None]:
def average_attribute(G, u, attribute_name):
    return sum(G.nodes[v]['device_score'] for v in G.neighbors(u)) / G.degree(u)

In [None]:
avg_device_score_dict = { u: average_attribute(G,u, 'device_score') for u in G.nodes }

df_enhanced['AVG_NEIGHBOR_DV_SCORE'] = df.index.map(avg_device_score_dict)

df_enhanced.head()

And finally, let's run the same models as before with these new features to see how the results compare with each other:

In [None]:
#Create X and Y dataframes
X_enhanced = df_enhanced.drop(['FRAUD'], axis=1).fillna(0)
y_enhanced = df_enhanced.drop(['DEVICE_TYPE','EXPECTED_VALUE','SALES','DEGREE','PAGERANK','AVG_NEIGHBOR_DV_SCORE'], axis=1)

print('The shape of the X DataFrame is: ',X.shape)
print('The shape of the y DataFrame is: ',y.shape)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_enhanced[['DEGREE','PAGERANK','AVG_NEIGHBOR_DV_SCORE']] = scaler.fit_transform(X_enhanced[['DEGREE','PAGERANK','AVG_NEIGHBOR_DV_SCORE']])

In [None]:
#Transform the X and Y dataframes to tensors now as well
X_enhanced = th.tensor(X_enhanced.values).float()
y_enhanced = th.tensor(y_enhanced.values).type(th.LongTensor).squeeze_()

print(X_enhanced.shape)
print(y_enhanced.shape)

In [None]:
model = 'logistic-regression-enhanced'
y_pred_probas = get_lr_preds(X_enhanced[train_nid], y_enhanced[train_nid], X_enhanced[test_nid], seed=23)

auc, f1, prec, recall = get_results(y_pred_probas, y_enhanced[test_nid], 0.5)
dict_results = {'Model':model, 'AUC':auc, 'F1 Score':f1, 'Precision':prec, 'Recall':recall}
results_df = results_df.append(dict_results, ignore_index=True)

results_df

In [None]:
model = 'GBoost-enhanced'
y_pred_probas = get_gb_preds(X_enhanced[train_nid], y_enhanced[train_nid], X_enhanced[test_nid], seed=23)

auc, f1, prec, recall = get_results(y_pred_probas, y_enhanced[test_nid], 0.5)
dict_results = {'Model':model, 'AUC':auc, 'F1 Score':f1, 'Precision':prec, 'Recall':recall}
results_df = results_df.append(dict_results, ignore_index=True)

results_df

It looks like in this case the new added features from the graph are not achieving a better performance for the Machine Learning model, this could be due to the model already doind a pretty good job with the non-graph features and because of the social graph not provding usefull information for fraude detection that makes sense given that fraudulents probably would try to hide from connections. Maybe another type of graph can be better to this task.

Given the rarity of these Graph-Based Features they carry vastly different information from the tipically used features and as such allow the model to better differentiate classes, that can lead to better performance in models.This conclusions is further developed on our previously published paper: Supporting Financial Inclusion with Graph Machine Learning and Super-App Alternative Data; where we prove how graph based features augments the AUC of Credit RIsk models up to 4-5 percentage points!

https://arxiv.org/abs/2102.09974


![](https://raw.githubusercontent.com/jdacevedo3010/graph-mahine-learning-workshop/master/images/paper_performance.png)

The above Figure is taken from the paper, there, the authors show how Graph-Based Features Enhanced models improve the results in terms of predicting credit default over non-Grap-based features models (Base in the figure). 

## Transductive vs Inductive Learning

As shown in the paper and in the simple example above, the enhancing of traditional ML pipelines with graph-based features can be highly benefitial for the performance of the model. The question then is why don't we stop at this point?. What's the point of developing highly complex algortihms such as Graph Neural Networks?

This serves us as an introduction to the next topic, Transductive & Inductive Learning.

Transductive Learning is when the model learns from the complete graph, and through the hiding of some labels is tests on some masked subset of nodes. This means that when a set of unseen nodes is added the model can't handle those new nodes and has to be retrained.

Inductive Learning is when the model desn't need to have access to al nodes and edges of the graph for training, as such we can train with a section of the graph and test on the other previously unseen section. Thos also means that the model can be trained on a graph or a graph in a period of time, and then use that same model for another similar graph or for new node and edge additions through time.

![](https://raw.githubusercontent.com/jdacevedo3010/graph-mahine-learning-workshop/master/images/transductive_inductive.png)

#### Now let's take a few minutes to think whether the previously built model can be used in a Transductive or Inductive setting 

Hint: Use the definitions of the Graph Based Features added to get to this answer:

PageRank: https://es.wikipedia.org/wiki/PageRank
Degree: https://en.wikipedia.org/wiki/Degree_(graph_theory)

Your answer goes here:






#### The Inductive nature of Graph Neural Networks is the reason why they have taken so much popularity. With them we do not need to have expert knowledge crafting of graph based features, and we can also train a model that can then be used in other similar graphs or in evolutions of the graph where it was trained!

## Graph Neural Networks - Introduction to DGL 

Let's start by giving a brief introduction to DGL.

DGL, acronym for Deep Graph Library, is an easy to use library for the creation of deep learning models on graphs (also known as Graph Neural Networks or GNN's). This package is framework agnostic, being able to use PyTorch, TensorFlow or Apache MXNet. It's also efficient and scalable using fast and memory efficient message passing primitive for training GNN's that can be scaled using GPU acceleration.

Here you can find more information on the package or educative examples where it's used: https://www.dgl.ai/

In [None]:
import dgl
import torch as th
import torch.nn as nn
import torch.nn.functional as F

As said on the presentation, there are many taks that can be performed with GNN's:

1. Node Classification: Where we want to predict whether a node belong to a certain class, think of the previous task where we wanted to determine whether users were fraudulent.
2. Link Prediction: Where we want to predict the likeliness of every pair of nodes being connected. Think of a reccomender system of friend in Facebook, the likelier a connection is the more would the platform reccomend you add that person.
3. Graph Classification: Where we want to perfeor classification on a certain structure of nodes or in a whole graph, this can be used for families detection in a graph.

For the purpose of this workshop and considering time-constraint we will be reviewing Node Classification Tasks. Performing the same classification task done with the previous traditional ML models.


#### Let's first remember a few things from the graph, do you think we are dealing with a heterogeneous or homogeneous graph?

That is quite relevant for the GNN model to be used because it dictates what kind of models we can use. For example Graph Convolutional Network (GCN) can only handle homogeneous graphs, if we have a heterogenous graph we have to use a variant in it's implementation called Relational Graph Convolutional Networks (RGCN). 

#### GCN Review

This model was the first GNN developed, it is considered as one of the most basic GNN variants. This model was developed by Kipf and Welling, published in 2017 with this paper: https://arxiv.org/pdf/1609.02907.pdf.

The convolution concept used here is quite similar to that of the Convolutional Neural Networks (CNN) tipically used in Computer Vision. The key difference here is that a CNN is built for Euclidean structured data, and graphs are definitely not that. 

This can be easily explained with this example, in images, when we separate those on pixels for Computer Vision we know that every pixel has the same amount of pixels surrounding it, they follow an specific order it's a perfect grid structure. In graphs, not every node has the same amount of neighbors and there is no intrinsic order in them, those irregularities give graphs their non-Euclidean structure nature.

![](https://raw.githubusercontent.com/jdacevedo3010/graph-mahine-learning-workshop/master/images/convolutions.png)

source: https://arxiv.org/pdf/1901.00596.pdf

Therefore, most GNN's are modifications to take into account the non-Euclidean nature of graphs. Mostly, this is done via permutation invariant aggregations (like the sum or average) of the neighbors for each node, this is commonly known as message passing.

In the case of GCN the formula for the forward of the layers is the following:

![](https://raw.githubusercontent.com/jdacevedo3010/graph-mahine-learning-workshop/master/images/gcn.png)

Here, through the dot product between A & H we are transforming the representation from their features or previous representationto the sum of the neighboring nodes features or representations, to then be passed through a selected non-linearity function represented by sigma (This can be a Relu, Leaky Relu, sigmoid, etc). We need to take into account that this does not aggregate the nodes own features unless there are self-loops for every node, this is a common modification done to graphs for GNN's.

It's also important to notice that with each forward message passing of the GNN we are aggregating another hope aways of neighbors!


#### Now let's code that up!

Let's build a two layer GCN that classfies our nodes on whether they are fraudsters or not! With DGL this is as easy as creating a GCN model class that calls the method GraphConv.

In [None]:
#GCN Class
from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        #Here we go from our original number of features to the size of our hidden representations
        self.conv1 = GraphConv(in_feats, h_feats)
        #Here we go from the hidden representation to the dimension of the number of classes for the probabilities
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat) #Here we apply the first convolution
        h = F.relu(h) #Here we apply the selected non-linearity. In this case a Relu
        h = self.conv2(g, h) #Here we apply the second convolution
        return h


Then we can simply create the model with the given attributes, we will be using the feature set without the Graph-Based features here and a hidden state size of 16.

In [None]:
hidden_size = 16
num_classes = 2

# Create the model with given dimensions
model = GCN(X.shape[1], hidden_size, num_classes)

And now we will create the training function using PyTorch:

In [None]:
def train(g, features, labels, train_mask, test_mask, epochs, model):
    optimizer = th.optim.Adam(model.parameters(), lr=0.01) #Selected optimizer
    best_val_acc = 0
    best_test_acc = 0

    #Here we create the validation set with a portion of the train set
    val_mask = train_mask[:len(train_mask) // 5]
    train_mask = train_mask[len(train_mask) // 5:]

    train_mask = train_mask.nonzero().squeeze()
    test_mask = test_mask.nonzero().squeeze()
    val_mask = val_mask.nonzero().squeeze()

    for e in range(epochs):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))
    
    #Check results on test set
    y_pred_probas = model(g, features).detach().numpy()
    y_pred_probas = y_pred_probas[test_mask]
    y_preds = y_pred_probas[:,1]
    model_name = 'GCN'
    
    auc, f1, prec, recall = get_results(y_pred_probas, labels[test_nid], 0.5)
    dict_results = {'Model':model_name, 'AUC':auc, 'F1 Score':f1, 'Precision':prec, 'Recall':recall}
    results_df = pd.DataFrame(columns=['Model','AUC','F1 Score','Precision','Recall'])
    results_df = results_df.append(dict_results, ignore_index=True)
    return results_df

After that, we have to create the dgl graph to send into the GNN's. It is not the same as in NetworkX this is not a hard process either as you shall see:

In [None]:
#We start by using the same edges df but transforming the from and to columns into arrays
src = edges_df['~from'].to_numpy()
snk = edges_df['~to'].to_numpy()

G_dgl = dgl.graph((src,snk))

Additionally, DGL graphs objects also have some cool methods in it like a fast description of the whole graph: 

In [None]:
G_dgl

That way we can quickly check the number of nodes or edges. Additionally, when dealing with heterogeneous graphs we can also see the different type of both nodes and edges. The n_data and e_data schemes are there if we want to directly add attributes to either the nodes or edges of the graph respectively.

More information about what we can do with DGL graph objects can be found here: https://docs.dgl.ai/en/0.6.x/generated/dgl.graph.html

We also have to add self-loops to the Graph so that the DGL model also takes the nodes own information in the aggregation. Let's do that using the simple add_self_loop() method of DGL (https://docs.dgl.ai/en/0.6.x/generated/dgl.add_self_loop.html):

In [None]:
G_dgl.add_edges(G_dgl.nodes(), G_dgl.nodes())

In [None]:
y[train_mask]

In [None]:
epochs = 100
results_df = train(G_dgl, X, y, train_mask, test_mask, epochs, model)

In [None]:
results_df

As you can see the results are not up to par with the other models, although this model hasn't been optimized at all in terms of hyperparameters. That could up the performance to better levels. Additionally these models are quite usefull because there is no necesarry construction of new features or expert inputs in the crafting of the model. This along with the inductive nature of the model allows us to use it on any similar graph!

We have also proven the effectiveness of GNN's (More specifically RGCN) on more complex datasets and graphs in our paper: https://arxiv.org/abs/2107.13673


#### As said before there are many other GNN models, for this specific graph we can use more complex variants like GAT, GraphSage or GIN. Those achieve better generalization of the task at hand and usually lead to better results. 

Other GNN's that we recommend trying for this same dataset changing the model calss object:

GAT introduces attentions to the GCN model, weighting neighbors with certain features more for the classification task. More information can be found here: https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/9_gat.html

GraphSAGE is a general inductive framework that leverages node feature information (e.g.,  text  attributes) to  efficiently  generate  node  embeddings  forpreviously unseen data. More information can be found here: https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage

GIN is the most expressive GNN developed to this date. More information can be found here: https://github.com/dmlc/dgl/tree/master/examples/pytorch/gin

Now you can try to modify the model class object with whichever other model you prefer and compare the results with GCN!

In [None]:
#Your code goes here

We hope you liked this introduction to GNN's this is a highly compeling subject that is currently on of the most researched on the world. Feel free to contact us to any questions you may have about this!