# GNNs for Automated Feature Engineering

## Manual Feature Engineering

Feature engineering is a pivotal process in machine learning, *involving the selection, creation, and transformation of input variables* to enhance model performance. It entails tasks such as feature selection to identify essential variables, creating new features to capture complex relationships, and handling data preprocessing tasks like encoding categorical variables or scaling values. *This process heavily relies on domain expertise and evolves over time to adapt to changing data dynamics, ultimately ensuring that machine learning models are equipped with the most relevant and informative features for accurate predictions and insights.*

**In the Mini Project on Graph ML, we performed manual feature engineering on the Stanford Facebook Dataset for clustering Graph nodes**

## Automated Feature Engineering through Supervised Neural Networks

Neural networks excel not only in learning complex patterns but also in automating feature engineering. Rather than manually designing features, neural networks can process raw data and derive informative representations automatically. For instance, in image classification, Convolutional Neural Networks (CNNs) can learn essential image features such as edges, textures, and shapes directly from pixel values. Similarly, Recurrent Neural Networks (RNNs) are proficient at capturing sequential patterns, making them suitable for natural language processing tasks where they can extract meaningful features from text data. Neural networks, with their ability to discover intricate relationships and abstractions within data, have revolutionized feature engineering by streamlining the process and achieving state-of-the-art results across various domains.

**One common methodology for feature extraction using neural networks involves training a Neural Network Classifier on a labeled dataset, feeding the raw data into the trained classifier, and utilizing the outputs from intermediate layers as the extracted features. This approach formulates the problem as a supervised classification task.**

This process is reminiscent of the **AutoEncoder** architecture for Feature Extraction. *However, in cases where labeled data is not available, as was the situation with the AutoEncoder, the output remains the same as the input, and during training, the model learns useful representations of the input data that can be considered as extracted features.*

## Graph Neural Networks for Automated Feature Engineering on Graph Data

Graph Neural Networks (GNNs) have emerged as a powerful tool for feature extraction in various domains, particularly where data is structured as graphs. GNNs excel in capturing complex relationships and patterns within graph-structured data, making them valuable for tasks like recommendation systems, social network analysis, and molecule property prediction. Instead of relying on manually crafted features, GNNs learn feature representations directly from the graph structure and node attributes. They propagate information between connected nodes through multiple layers, iteratively refining node embeddings.

**In this exercise, we will learn how to use a GNN for extracting node features, and later use the engineered node features for clustering, just like in the Mini ML Project.**

*This notebook is designed to help you guide how to approach this assignment.*

<i><font color='blue'>Some parts of the notebook are left as exercise for you and are the corresponding headers are marked in blue</font></i>

# Dataset

We will use the same "Stanford Facebook" Dataset as used in the Mini ML Project. The Dataset contains links between several nodes (users) in a Social Network.

## Install and Importing the required libraries

In [None]:
!pip3 install torch_geometric igraph

In [None]:
from google.colab import drive
from torch_geometric.utils import from_networkx
import torch
import numpy as np
import torch_geometric.transforms as T
import networkx as nx

from sklearn.metrics import roc_auc_score
from torch_geometric.nn import GCNConv
from torch_geometric.utils import negative_sampling

## Mount Google Drive, and download the dataset

In [None]:
drive.mount('/content/gdrive', force_remount=True)

In [None]:
!wget -P /content/gdrive/MyDrive/gnn-data  https://snap.stanford.edu/data/facebook_combined.txt.gz
!cd /content/gdrive/MyDrive/gnn-data && gunzip facebook_combined.txt.gz

## Let's construct the Graph using the connections (Edge) list in the dataset

<font color='blue'>Write code to load the dataset from the txt file. Your code should finally return a torch_geometric Data Object</font>

In [None]:
# read the dataset from the txt file
# return the data as a torch_geomtric data object

The Dataset contains no node features. It only contains a list of connections/edge between various nodes, denoting a real connection in the social network.

But a GNN, just like any other Neural Network, needs some encoded inputs to work on.

Let's utilize node adjacency information as input features to the GNN.

<font color='blue'>Write code to generate adjacency matrix on the graph</font>

In [None]:
# create adjacency matrix
# check its shape

Now adding these features to the graph

In [None]:
data.x = torch.tensor(adj_matrix.toarray(), dtype=torch.float32)

In [None]:
data.x

# Learning Node Embeddings through a Supervised GNN

**Objective of Node Embeddings:**

The primary goal of node embeddings is to represent nodes features (adjacency information) in a lower-dimensional space where similar nodes in the original graph are closer to each other in the embedding space, and dissimilar nodes are farther apart. This representation should capture the structural and relational information present in the graph.

**Detecting Edge Presence:**

One way to evaluate the quality of node embeddings and how well they capture relationships is by assessing their ability to predict the presence or absence of edges in the original graph.
If an edge exists between two nodes in the original graph, it's expected that their corresponding embeddings in the lower-dimensional space should be close to each other or have a high similarity.
Conversely, if there is no edge between two nodes in the original graph, their embeddings should be relatively farther apart or have a lower similarity.

**Edge Presence in Embedding Space:**

The presence of an edge in the graph obtained from the node embeddings can be an indicator of how well the embeddings capture the original graph's structure. If the embeddings are effective, you would expect edges to exist in the embedding-based graph for node pairs with strong structural relationships in the original graph.

<u>Therefore, we will formulate this problem as a link prediction problem</u>.

We will use the adjacency features we just created, in order to train a GNN which would predicts links between the input nodes.

## Encoder-Decoder Architecture for Link Prediction

An Encoder-Decoder based GNN is often used for Link Prediction problem.

In the context of edge detection or link prediction, you want to determine whether there should be an edge (connection) between two nodes or not.

The encoder-decoder architecture can be used to achieve this by using the learned node embeddings:

Encoder: The encoder part of the model takes the graph structure and node embeddings as input and encodes this information into meaningful node representations in the embedding space. This encoding step captures the similarity between nodes based on their embeddings.

Decoder: The decoder takes pairs of node embeddings (corresponding to two nodes in the graph) as input and produces a prediction of whether there should be an edge between those nodes or not. This prediction is based on the similarity captured in the embeddings.

Refer to example here for reference: https://github.com/pyg-team/pytorch_geometric/blob/master/examples/link_pred.py . Notice the usage of an encoder-decoder architecture.

<font color='blue'>Write code to split the Graph into Train, Validation and Test using T.Compose</font>

In [None]:
# use T.Compose transform to split into train, val and test data

### The GNN

<font color='blue'>Write the GNN for Link Prediction. </font>

Refer to the following example from torch_geomtric:
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/link_pred.py

In [None]:
# write GNN code here

<font color='blue'> Now write code to train the model. Experiment with various hyperparameters such as output dimensions, number of epochs etc. Also experiment with different model architectures but increasing or decreasing the number of hidden layers and neurons in each layer. </font>

In [None]:
# Train the GNN

# Validate and test on the respective datasets

<font color='blue'> Now use the model to generate embeddings on the entire dataset </font>

In [None]:
# Put the model in evaluation mode

# Encode the data to obtain the final hidden layer representations

<font color='blue'> Now try out various clustering algorithms on the node embeddings and find out the optimal number of clusters using elbow plot.
Calculate various metrics such as silhouette score, Calinski-Harabasz index, and Davies-Bouldin index for each cluster size.
Also create plots for metrics used against the number of clusters.
</font>

In [None]:
# Perform clustering using various algorithms on the node embeddings obtained from the GNN

# Evaluate clustering performance

In [None]:
# Elbow plot to find out optimal number of clusters

In [None]:
# Plot metrics vs number of clusters

# Summary

<font color='blue'> Summarize the steps taken and your observations in comparison to the manual featuring engineering performed in Mini ML Project on Graph ML.</font>


< Not adding any answer here, since the observations and steps will be subjective to Mini ML Project done by each learner>


<font color='blue'> Summarize the steps taken and your observations in comparison to the feature extraction done using AutoEncoders for downstream classification task.</font>

< Not adding any answer here, since the observations and steps will be subjective to AutoEncoder assignment done by each learner>