# Link Prediction on the Amazon Co-Purchase Graph with GNNs

## Data and Problem Statement - Overview

In this project, I want to explore the use of Graph Neural Networks (GNNs) to model structural relationships in a real-world. I use the Amazon product co-purchasing network and ground-truth communities data on the SNAP repository (link - https://snap.stanford.edu/data/com-Amazon.html)

A summary of the data:

- This data is based on the "Customers Who Bought This Item Also Bought" feature that we see on the Amazon website.
- Each node is an Amazon product. 
- An edge in this graph defines how often a product i has been co-purchased with product j - hence defining product-product relationships. 
- There is also the concept of a "community" for this graph, where each product's category provided in the Amazon catalog is a ground-truth community. Data at this level can also help us understand things that form at a community level.


The core task that I want to focus on is **link prediction**:

- Given two products, is there a way where we predict whether they should be connected in the co-purchase graph? 
- This is a classic problem in recommendation systems in e-commerce.
- A large part of this graph can be unseen product pairs. There might be a lot of products across these different categories / communities, indicating sparsity (as we will see). 
- This approach will allows us to evaluate how well structural graph patterns can guide suggestions for these unseen product pairs.


Modern recommendation systems focus heavily on how users interact with products in the catalog based on their view / add to cart / purchase patterns, and various other relevant signals that can be interpreted based on their activity on the e-commerce platform. By looking at product-product relationships through a graphical structure, we can gain underlying insights on questions like the following:

- Are there specific dominant products (or) communities of products that are linked? How are they connected and how similar are they?
- How expressive would node embeddings be by just considering the co-purchase context?
- How representative can this data / a model trained on this data be to power recommendation algorithms?


## Notebook Structure


1. **Exploring the Amazon co-purchase data**: Parse and visualize the Amazon co-purchase graph. Analyze statistics such as degree distributions, component sizes, clustering, and community structure.
3. **Task Framing**: Define the link prediction task as a binary classification problem.
4. **Model Development**: Train a GNN to learn node embeddings and use a scoring function to predict links.
5. **Evaluation & Insights**: Measure model performance and interpret results both quantitatively and qualitatively.

# Imports

In [2]:
import gzip
import json
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sys

from collections import Counter
from pprint import pprint
from scripts import utils as U

# Exploring the Amazon co-purchase dataset

### Loading the graph data & summarizing basic overall graph statistics


- Let us try to understand the structural properties of our co-purchase graph - to get an idea of how densely the products are connected and whether any global patterns emerge.

- Below, we compute and summarize basic statistics of the graph, including the number of nodes and edges, average degree, graph density, clustering coefficient, and details about connectivity. 

<!-- These statistics not only help validate data quality but also guide how we construct models—for instance, sparse graphs with strong clustering may benefit from localized message passing in GNNs. -->

<!-- The **largest connected component** is particularly important, as GNNs rely on connectivity for message propagation. We'll focus our modeling efforts on this component to ensure meaningful learning. -->


In [3]:
def load_graph(file_path):
    edges = []
    with gzip.open(file_path, "rt") as f:
        for line in f:
            if line.startswith("#"):
                continue
            source, target = map(int, line.strip().split())
            edges.append((source, target))
    return nx.Graph(edges)

def basic_graph_stats(G):
    """Calculate and return basic statistics about the graph G."""
    number_of_nodes = G.number_of_nodes()
    number_of_edges = G.number_of_edges()
    degrees = dict(G.degree()).values()

    stats = {
        "Number of nodes": number_of_nodes,
        "Number of edges": number_of_edges,
        "Average degree": sum(degrees) / number_of_nodes if number_of_nodes > 0 else 0,
        "Density": nx.density(G),
        "Average clustering coefficient": nx.average_clustering(G),
        "Number of connected components": nx.number_connected_components(G),
        "Size of largest component": len(max(nx.connected_components(G), key=len)) if number_of_nodes > 0 else 0,
    }
    return stats

In [4]:
graph = load_graph(file_path="data/raw/com-amazon.ungraph.txt.gz")
print(f"Graph has {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges.")

stats = basic_graph_stats(graph)
print("\n\n\n*** OVERALL GRAPH STATISTICS ***\n")
print(json.dumps(stats, indent=4))

Graph has 334863 nodes and 925872 edges.



*** OVERALL GRAPH STATISTICS ***

{
    "Number of nodes": 334863,
    "Number of edges": 925872,
    "Average degree": 5.529855493141971,
    "Density": 1.6513834036534368e-05,
    "Average clustering coefficient": 0.3967463932788733,
    "Number of connected components": 1,
    "Size of largest component": 334863
}
