<h1>GNNs for Fraud Detection </h1>
This assessment will be divided into 2 parts:

- [Week 1] During the first part, you'll discover how to import a CSV file and create a graph dataset. You'll have fun experimenting with functions and learning to sample batches of sub-graphs for training. Make sure to complete this part before the end of week 1. There are 3 sections in this part, each section is worth 5 points.

- [Week 2] In the second part, you'll construct your very own GCN model and train it using the data you prepared in week 1. Get ready to watch your fraud detection system come to life! There are 2 sections in this part, each section is worth 5 points.

In [1]:
'''
Note: We will train our GNNs on CPU runtime since we have a very small graph and training time should be fairly low, 
you can use GPUs if you wish, but make sure that you install the right DGL version from here- https://www.dgl.ai/pages/start.html
The below code installs DGL for a CPU runtime
'''

!pip install  dgl -f https://data.dgl.ai/wheels/repo.html
!pip install  dglgo -f https://data.dgl.ai/wheels-test/repo.html

Looking in links: https://data.dgl.ai/wheels/repo.html
Collecting dgl
  Downloading https://data.dgl.ai/wheels/dgl-1.0.2-cp310-cp310-win_amd64.whl (3.0 MB)
     ---------------------------------------- 3.0/3.0 MB 1.4 MB/s eta 0:00:00
Installing collected packages: dgl
Successfully installed dgl-1.0.2
Looking in links: https://data.dgl.ai/wheels-test/repo.html
Collecting dglgo
  Downloading dglgo-0.0.2-py3-none-any.whl (63 kB)
     ---------------------------------------- 63.5/63.5 kB 1.1 MB/s eta 0:00:00
Collecting ruamel.yaml>=0.17.20
  Downloading ruamel.yaml-0.17.21-py3-none-any.whl (109 kB)
     -------------------------------------- 109.5/109.5 kB 3.1 MB/s eta 0:00:00
Collecting typer>=0.4.0
  Downloading typer-0.7.0-py3-none-any.whl (38 kB)
Collecting ogb>=1.3.3
  Downloading ogb-1.3.6-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.8/78.8 kB 2.2 MB/s eta 0:00:00
Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp310-cp310-win_amd64.whl (20.5 MB)


In [2]:
#Don't bother if you get this warning message- "DGL backend not selected or invalid.  Assuming PyTorch for now."
import torch
import dgl
import pandas as pd


DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


<h2>Week 1</h2>

<h3>Loading Your First Graph Dataset </h3>
You've been provided with 2 csv files that contain an open source fraud detection dataset created by Amazon. 

<h4> Amazon Fraud Detection Dataset </h4>
The Amazon dataset encompasses product evaluations in the Musical Instruments category. Individuals with over 80% helpful votes are identified as benign entities, while those with fewer than 20% helpful votes are considered fraudulent entities. Performing a fraudulent user detection task on the Amazon dataset involves a binary classification process. Each of these users have a 25-dim dense feature representation that is obtained by calculating certain statistical properties of the user's behaviors. Features include properties like entropy of user's ratings, time entropy, sentiment of user's reviews etc. You can learn more about the features from Table 1. in the paper-https://arxiv.org/pdf/2005.10150.pdf.

The nodes in the graph are therefore users on the Amazon e-commerce platform, the nodes also have handcrafted-features. The node information is available in the file below
- node_information.csv: contains node_id as the first column and features 1-25 in the corresponding columns, the last column is the label of the user (benign, fraudulent)

To create a network of interconnected users and generate a graph, we link users who share similarities. The file provided contains connections between users exhibiting the top 5% mutual review text similarities (calculated using TF-IDF) among all users. In other words, users with high textual resemblances are connected, based on the assumption that this structure could reveal insights into the communication patterns among fraudulent users.

- edge_data.csv: contains 2 columns with source and destination node ids indicating an edge between the source and destination columns

Please implement the functions that have  "NotImplementedError()" marked and then run the driver-code run functions that you've implemented. In order to test individual functions, you can run the corresponding driver code

NOTE: Kindly be aware that the open-source graph has been adapted to meet the requirements of this assignment. Please utilize the files provided to you and refrain from downloading files from external sources.

In [None]:
#@title

''' 
The dataset for this project is available in the shared directory: https://drive.google.com/drive/folders/1hiPnkO9VcQTMptBLvbC5WGfJS6gQtOaS?usp=sharing
Please create a copy in your own Google Drive and mount the path below
'''
from google.colab import drive
drive.mount('/content/drive/')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#import required packages
import torch
import dgl
import pandas as pd


In [None]:
#Section 1: Data Loading
def load_node_information_from_csv(path: str):
    '''Given a path to the node information csv file, create a tensor of node 
    features and corresponding labels. You can load using the Pandas library
    Args:
        path: path to a csv file
    Returns: 
        a tensor of node features of the shape (num_nodes, num_features) and a tensor of 
        node labels of the shape (num_nodes)
    '''
    # YOUR CODE HERE
    node_info = pd.read_csv(path)
    node_labels = torch.from_numpy(node_info['label'].values)
    node_features = torch.from_numpy(node_info.drop(['label'], axis=1).values)
#     raise NotImplementedError
    return node_features, node_labels
    

def load_edges_from_csv(path: str):
    '''Given a path to a csv file, create a tuple of tensors, you can use the Pandas library
    Args:
        path: path to a csv file
    Returns: 
        src: a pytorch tensor of source node ids
        dst: a pytorch tensor of destination node ids
    '''
    # YOUR CODE HERE
    # make sure that the node ids are in the required type format, ie. int64
    raise NotImplementedError()
    

def create_graph_from_tensors(src_tensor: torch.Tensor, dst_tensor: torch.Tensor):
    '''Given a tuple of edge tensors (u,v), create a graph such that each element in u is 
    connected to each element in v with a one-to-one mapping
    please refer to: https://docs.dgl.ai/en/1.0.x/generated/dgl.graph.html
    For example: 
    u = th.tensor([1, 2, 3]), 
    v = th.tensor([4, 5, 0]) 
    should create a graph with 6 nodes and 3 edges:
    1 -> 4, 2 -> 5, 3 -> 0
    Args:
        edge_tensors: a tuple of edge tensors
    Returns: 
        a DGL graph
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    


def add_node_features_and_labels(graph: dgl.DGLGraph, node_features: torch.Tensor, node_labels: torch.Tensor):
    '''Given a graph and a tensor of node features and labels, add the node features and labels to 
    the graph object so as to access them later directly from the graph object. 
    **Name the features and labels as "features" and "labels" respectively**
    please refer to: https://docs.dgl.ai/guide/graph-feature.html?highlight=features
    Args:
        graph: a DGL graph
        node_features: a tensor of node features of type float()
    Returns: 
        a DGL graph with node features with shape (num_nodes, num_features) and labels with shape (num_nodes, 1)
    '''

    #**Name the features and labels as "features" and "labels" respectively**
    # YOUR CODE HERE
    raise NotImplementedError()



In [None]:
# Section 2: data exploration
# play around with your data
def get_num_nodes(graph: dgl.DGLGraph):
    '''Given a DGL graph, return the number of nodes
    please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
    Returns: 
        the number of nodes in the graph
    '''
    # # YOUR CODE HERE
    # raise NotImplementedError()
    return graph.num_nodes()


def check_if_edge_exists(graph: dgl.DGLGraph, u: int, v: int):
    '''Given a DGL graph and two nodes u and v, 
    return True if the edge (u,v) exists in the graph, False otherwise
    please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
        u: a node
        v: a node
    Returns: 
        True if the edge (u,v) exists in the graph, False otherwise
    '''
    # YOUR CODE HERE
    raise NotImplementedError()


def get_first_hop_neighbors(graph: dgl.DGLGraph, node: int):
    '''Given a DGL graph and a node, return the first hop neighbors of the node
       DO NOT USE DGL's built-in neighbor sampler. First hop neighbors are the nodes that are directly connected to the node
       please refer to: https://docs.dgl.ai/en/0.1.x/api/python/graph.html#querying-graph-structure
    Args:
        graph: a DGL graph
        node: a node
    Returns: 
        a list of first hop neighbors of the node
    '''
    # YOUR CODE HERE
    raise NotImplementedError()


def get_second_hop_neighbors(graph: dgl.DGLGraph, node: int):
    '''Given a DGL graph and a node, return the second hop neighbors of the node
       DO NOT USE DGL's built-in neighbor sampler. Second hop neighbors are the nodes that are connected to the first hop neighbors of the node
    Args:
        graph: a DGL graph
        node: a node
    Returns: 
        a tensor of second hop neighbors of the node
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    


<h4>Data Sampling</h4>
As touched upon in the reading material, graphs possess a relational nature, distinguishing them from datasets like images or text, which maintain a fixed context window. Consequently, when sampling a node for training, it's essential to also sample the neighbors we want to include for aggregation. We'll delve deeper into the GCN model next week, but for now, remember that graph neural networks learn from both node-specific information (i.e., node features) and structural information (a node's neighborhood). As a result, data batches typically consist of a node's subgraph, including its neighborhood in a particular manner. For example, we can consider a node's first and second hop neighbors as its neighborhood. Alternatively, we could use a fixed number of neighbors (either randomly or through a ranking process) in each hop, commonly referred to as fan-out. So, when we say "sample a node's first-hop neighborhood with fan-out of 5," it means we select a total of 5 neighbors from the node's first hop. In this section, we use DGL's in-built neighbor sampler for obtaining batches of node data. 


In [None]:
def create_data_sampler(fanout_list):
    '''create a DGL data sampler
    Args: layers: the number of hops in the neighborhood that we want to sample
    Returns: 
        a DGL data sampler of type NeighborSampler. 
        This sampler will sample neighborhood as specified by the fanout_list.
        read more about this sampler in the docs: 
        https://docs.dgl.ai/generated/dgl.dataloading.NeighborSampler.html
    '''
    # YOUR CODE HERE
    raise NotImplementedError()


def create_data_loaders(graph: dgl.DGLGraph, sampler, batch_size: int, train_ids: torch.Tensor, val_ids: torch.Tensor):
    '''Given a DGL graph, a sampler, a batch size, and a train/val ratio, 
    split the graph into training, validation, and test sets
    Use the DGL data loader to create data loaders for the training and validation sets
    reference: https://docs.dgl.ai/generated/dgl.dataloading.DataLoader.html#dgl.dataloading.DataLoader
    Args:
        graph: a DGL graph
        sampler: a DGL data sampler
        batch_size: the size of the batch 
        train_ratio: the ratio of the training set 
        val_ratio: the ratio of the validation set
    Returns: 
        train and validation data loader objects
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    


In [None]:
#driver code for sections 1 and 2
'''
DO NOT CHANGE ANYTHING IN THE CODE BELOW, RUN IT TO TEST YOUR CODE CORRECTNESS
Just make sure that you've set the data_path correctly
'''

src_edges, dst_edges = load_edges_from_csv(f'C:/Users/nimes/Downloads/Fraud_Detection_using_GNN/edge_data.csv')
graph = create_graph_from_tensors(src_edges, dst_edges)
num_nodes = get_num_nodes(graph)
print('Number of nodes in the graph: ', num_nodes)
edge_exists = check_if_edge_exists(graph, 0, 1)
print('Does the edge (0,1) exist in the graph? ', edge_exists)
first_hop_neighbors = get_first_hop_neighbors(graph, 0)
print('First hop neighbors of node 0: ', first_hop_neighbors)
second_hop_neighbors = get_second_hop_neighbors(graph, 0)
print('Second hop neighbors of node 0: ', second_hop_neighbors)
graph_features, labels = load_node_information_from_csv(f'C:/Users/nimes/Downloads/Fraud_Detection_using_GNN/node_information.csv')
graph = add_node_features_and_labels(graph, graph_features, labels)
print('Graph with node features: ', graph)
graph = dgl.add_self_loop(graph) #add self loops to prevent 0 degree nodes (DGL crashes when node-degree=0)


In [None]:
#driver code for section 3, we create a random list of train and validation ids with a 80:20 split and use these ids to instantiate dataloaders
'''
DO NOT CHANGE ANYTHING IN THE CODE BELOW, RUN IT TO TEST YOUR CODE CORRECTNESS
'''


#create train and val masks
train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)

torch.manual_seed(0)
train_mask[torch.randperm(num_nodes)[:int(0.8*num_nodes)]] = True
val_mask = ~train_mask

#obtain respective ids
train_ids = torch.nonzero(train_mask, as_tuple=True)[0]
val_ids = torch.nonzero(val_mask, as_tuple=True)[0]

#create sampler and data loaders
sampler = create_data_sampler([15,15])
train_loader, val_loader = create_data_loaders(graph, sampler, 100, train_ids, val_ids)

for input_nodes, output_nodes, blocks in train_loader:
    print("Input nodes in the MFG (Message Flow Graph)")
    print(input_nodes)
    print("Output nodes in the MFG (Message Flow Graph)")
    print(output_nodes)
    print("Message Flow Graph used for training")
    print("Layer 1")
    print(blocks[0])
    print("Layer 2")
    print(blocks[1])
    break

<h2>END OF WEEK 1!</h2>

<h2>Week 2</h2>
This week, we'll utilize the data we've prepared to construct our very own GCN model, then train and assess it using a validation dataset! Read the resources in the reading material to understand more about GCNSs. The directions stay the same: complete all the unimplemented functions and watch your model come to life!

In [None]:
#section 4 (Model Building)
'''
create your first dgl gcn model with 2 hidden layers
Remember that 2 layer gcn means that we're 
looking at the 1st hop and 2nd hop neighbors of the nodes in the batch
'''

import torch
import torch.nn as nn
import torch.nn.functional as F
class GCN(nn.Module):
    def __init__(self, in_feats, hidden_size, num_classes):
        super(GCN, self).__init__()
        '''
        define the first and second layer of the gcn model using dgl's GraphConv module
        read more here: https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.GraphConv.html
        make sure to use the correct in_feats and out_feats for the layers
        '''

        self.conv1 = None #replace None with the definition of the first layer of the gcn model
        self.conv2 = None  #replace None with the definition of the first layer of the gcn model
        
    def forward(self, block, inputs):
        '''
        Implement the forward pass of the gcn model based on the layers defined in the __init__ function
        '''
        #remember that you need to pass respective layer information i.e., block[0] for layer 1 and block[1] for layer 2
        h = None #replace None with the first layer hidden representation of the nodes in the batch
        h = F.relu(h)
        h = None #replace None with the second layer hidden representation of the nodes in the batch
        return h

In [None]:
#section 5 (write evaluate function, refer to the driver code below for hints)

def evaluate(model, val_loader, criterion):
    '''
    Implement the evaluation function and return the loss and accuracy. 
    The code should be very similar to the train function below, except that you need to compute metrics and not backprop loss

    Args:
        model: GCN Model
        val_loader: validation dataset loader
        criterion: loss criterion 
    Returns: 
        values of loss and accuracy
    '''
    #YOUR CODE HERE
    raise NotImplementedError()

In [None]:
#driver code
#DO NOT CHANGE ANYTHING IN THE CODE BELOW, RUN IT TO TEST YOUR CODE CORRECTNESS


#train function, use this as a helper to complete the evaluate function above
def train(model, train_loader, optimizer, criterion):
    model.train()
    for input_nodes, output_nodes, blocks in train_loader:
        inputs = blocks[0].srcdata['features'].float()
        labels = blocks[1].dstdata['labels']
        logits = model(blocks, inputs)
        loss = loss_func(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

#initialize the model, optimizer, and criterion
in_feat_shape = graph.ndata['features'].shape[1]
hidden_size = 16
num_classes = 2
model = GCN(in_feat_shape, hidden_size, num_classes)
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

#train the model for 50 epochs and evaluate every 5 epochs
for epoch in range(50):
    print(f'Running Epoch {epoch}')
    train(model, train_loader, optimizer, loss_func)
    if epoch % 5 == 0:
      loss, acc = evaluate(model, val_loader, loss_func)
      print('Epoch: {}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch, loss, acc))