### Cora dataset ideas:


- Feature type: Bag-of-Words (binary)

- Each node has a 1433-dimensional feature vector

- Each dimension corresponds to a unique word in the vocabulary

- Values are binary:

- 1 : the word appers in the paper

- 0 : the word does not appear

### 1. Node Classification
   - Thousands of papers are published every week —> manual categorization doesn't scale.

  #### plan:
  - apply Shallow Embeddings with Node2Vec (only uses graph structure 'no features') to show classifiction performance only utilizing citations links
  - apply Deep Embeddings (GNNs) like GCN and GAT (makes use of node features) and perform classifiction
  - analysis (compare methods accuracies , t-SNE plots of embeddings colored by class (for each method),Discuss trade-offs (scalability, expressiveness, transductive vs. inductive))



### 2. Link Prediction
  - Authors often miss relevant prior work.
  - Citation recommendation during writing
  - Related-work discovery
  - Recommending papers that an author should read or cite based on the graph structure and/or content similarity.

  #### plan:
  - edge splitting and generating negative samples 
  - learn embeddings with link prediction objective (not sure if node2vec can use same embeddings generated for node classification or not, but for gnn we need new embeddings cause loss is different)
  - check scoring methods (For each pair (u, v) in the test set, compute a score using the embeddings using Dot product/Cosine similarity/mlp classifier )
  - evaluate predictions 



### Twitch dataset ideas:

There are 5 languages, English, French, German, Italian, and Spanish. Each language community is disconnected from the others. Each node has the following features:
id,days,mature,views,partner,new_id
### 1. Node Classification
   - Automatically flagging streamers who use explicit language —> manual review doesn't scale across 32K+ streamers and multiple languages

  #### plan:
  - Approach 1: Per-language models (baseline):
    -   train and evaluate within each language separately:
    - For each language graph:
        - Node2Vec → logistic regression (structure only, no features)
        - GCN / GAT (structure + features, end-to-end)
        - Stratified train/val/test split within that language
    - Compare: does graph structure actually help? Are some languages easier to classify than others? Does graph density (DE is dense, ENGB is sparse) affect GNN performance?
- Approach 2: Transfer learning across languages
    - All languages share the same feature space:
    - Train a GCN/GAT on one language → test on a different language (zero-shot transfer)
    - So, we train on DE (largest, densest) → test on RU (smallest)
    - This tests whether the relationship between gaming preferences, social connections, and explicit language use is language/culture-invariant
    Compare transfer accuracy across all language pairs,  which transfers well, which doesn't?
    - create a Transfer Matrix: 
        - Diagonal = train and test on the same language (normal in-domain performance, the upper bound)
        - Off-diagonal = transfer performance (train on row, test on column)
        - For each cell in the matrix: Accuracy, F1-score, AUC-ROC (since it's binary classification and mature may be imbalanced)
    - we analyze: 
        - How much does transfer performance drop vs. in-domain? If off-diagonal scores are close to diagonal → explicit language behavior follows universal social patterns across cultures. If there's a big drop → the relationship between features/structure and mature is culture-specific
        - Which language transfers best as a source?
        - Are some language pairs more similar? use a heatmap
    - Per-language accuracy/F1/AUC-ROC comparison table
t-SNE of embeddings colored by (a) mature and (b) language — do language communities cluster separately even without explicit language features?
Feature-only baseline (no graph, just logistic regression on features) — how much does the graph add?
Transfer learning matrix: train on row language, test on column language




### 2. Link Prediction
  - which streamers would be friends based on their gaming preferences and existing social ties
  - Since there are no cross-language edges, link prediction only makes sense within each language. We should:
- Split edges per language (e.g., 85/5/10 train/val/test)
- Generate negative samples only from same-language node pairs (sampling random pairs across the full graph would be trivially easy since cross-language edges don't exist)
- Ensure the training graph stays connected within each component after edge removal

#### plan:
  - Node2Vec: Embed on the training subgraph per language, score test edges via dot product / cosine similarity
- GCN/GAT with link prediction head: Train with binary cross-entropy on edge existence — separate from node classification since the loss and training signal are fundamentally different
- Scoring: For each candidate pair (u, v), try dot product, cosine similarity, and Hadamard product → MLP
- Evaluation
    - AUC-ROC and Average Precision per language
    - Compare: do denser networks (DE, FR) have better link prediction than sparser ones (ENGB, RU)? Denser graphs have more training signal but also more possible negatives.
    - Does including node features improve link prediction over structure-only methods? (i.e., are streamers friends because they like the same games, or for other reasons?)


---
## Part 1: Cora Node Classification with Shallow Embeddings (Node2Vec)

**Goal:** Classify papers into 7 categories using **only the citation graph structure** (no BoW features).  
This establishes a structure-only baseline before adding node features with GNNs.

**Pipeline:**
1. Load and parse the raw Cora files (`cora.content` and `cora.cites`)
2. Build a PyTorch Geometric `Data` object
3. Train Node2Vec to learn 128-d embeddings from random walks on the citation graph
4. Train a Logistic Regression classifier on the learned embeddings
5. Evaluate with accuracy, per-class F1, and confusion matrix
6. Visualize embeddings with t-SNE colored by class

### 1.1 — Imports & Setup

In [10]:
# =============================================================================
# Imports
# =============================================================================
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
from torch_geometric.data import Data
from torch_geometric.nn import Node2Vec     # Node2Vec implementation from PyG

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
)
from sklearn.manifold import TSNE

# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# Use GPU if available: CUDA (NVIDIA) → MPS (Apple Silicon) → CPU
if torch.cuda.is_available():
    device = torch.device('cuda')
# elif torch.backends.mps.is_available():
#     device = torch.device('mps')
else:
    device = torch.device('cpu')
print(f"Using device: {device}")

Using device: cpu


### 1.2 — Load and Parse the Cora Dataset (from raw files)

We load the data **manually** from the raw `.content` and `.cites` files  
so we have full control and visibility into what's happening.

- `cora.content`: each line is `<paper_id> <1433 binary features> <class_label>`
- `cora.cites`: each line is `<cited_paper_id> <citing_paper_id>`

In [11]:
# =============================================================================
# 1.2a  Load cora.content — node features and labels
# =============================================================================
DATA_DIR = "cora"  # relative path to the cora folder

# Read the content file: <paper_id> <1433 binary features> <class_label>
content_path = os.path.join(DATA_DIR, "cora.content")
content_data = []
with open(content_path, 'r') as f:
    for line in f:
        parts = line.strip().split('\t')
        content_data.append(parts)

# Parse into structured arrays
paper_ids   = [row[0] for row in content_data]        # string IDs of each paper
features    = np.array([row[1:-1] for row in content_data], dtype=np.float32)  # 1433-d BoW
labels_str  = [row[-1] for row in content_data]       # class label strings

print(f"Number of papers (nodes): {len(paper_ids)}")
print(f"Feature matrix shape:     {features.shape}")
print(f"Unique classes:           {sorted(set(labels_str))}")

Number of papers (nodes): 2708
Feature matrix shape:     (2708, 1433)
Unique classes:           ['Case_Based', 'Genetic_Algorithms', 'Neural_Networks', 'Probabilistic_Methods', 'Reinforcement_Learning', 'Rule_Learning', 'Theory']


In [12]:
# =============================================================================
# 1.2b  Encode labels as integers and create a mapping
# =============================================================================

# Create a sorted list of unique class names for consistent ordering
class_names = sorted(set(labels_str))
class_to_idx = {name: idx for idx, name in enumerate(class_names)}

# Convert string labels to integer labels
labels = np.array([class_to_idx[l] for l in labels_str])

print("Class mapping:")
for name, idx in class_to_idx.items():
    count = (labels == idx).sum()
    print(f"  {idx}: {name:30s} ({count} papers)")

Class mapping:
  0: Case_Based                     (298 papers)
  1: Genetic_Algorithms             (418 papers)
  2: Neural_Networks                (818 papers)
  3: Probabilistic_Methods          (426 papers)
  4: Reinforcement_Learning         (217 papers)
  5: Rule_Learning                  (180 papers)
  6: Theory                         (351 papers)


In [13]:
# =============================================================================
# 1.2c  Build a mapping from paper_id → contiguous node index (0..N-1)
# =============================================================================
# PyTorch Geometric expects nodes to be indexed 0, 1, ..., N-1
paper_id_to_idx = {pid: i for i, pid in enumerate(paper_ids)}

# =============================================================================
# 1.2d  Load cora.cites — citation edges
# =============================================================================
cites_path = os.path.join(DATA_DIR, "cora.cites")
edges = []
skipped = 0

with open(cites_path, 'r') as f:
    for line in f:
        parts = line.strip().split('\t')
        cited  = parts[0]   # paper being cited
        citing = parts[1]   # paper doing the citing
        
        # Some edges may reference papers not in the content file — skip them
        if cited in paper_id_to_idx and citing in paper_id_to_idx:
            src = paper_id_to_idx[citing]  # citing  → source
            dst = paper_id_to_idx[cited]   # cited   → destination
            edges.append([src, dst])
        else:
            skipped += 1

edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()  # shape [2, num_edges]

print(f"Number of edges loaded: {edge_index.shape[1]}")
print(f"Edges skipped (missing nodes): {skipped}")

Number of edges loaded: 5429
Edges skipped (missing nodes): 0


In [14]:
# =============================================================================
# 1.2e  Build the PyTorch Geometric Data object
# =============================================================================
# Even though Node2Vec won't use features, we store them in the Data object
# so the same object can be reused later for GCN/GAT experiments.

x = torch.tensor(features, dtype=torch.float)           # [2708, 1433]
y = torch.tensor(labels, dtype=torch.long)               # [2708]

# Make the graph undirected: if paper A cites paper B, we add edges A→B and B→A.
# This is standard for citation networks because information flows both ways
# (citing and being cited both indicate topical similarity).
from torch_geometric.utils import to_undirected
edge_index_undirected = to_undirected(edge_index)

data = Data(x=x, edge_index=edge_index_undirected, y=y)
data.num_classes = len(class_names)

print(f"\nGraph summary:")
print(f"  Nodes:               {data.num_nodes}")
print(f"  Edges (undirected):  {data.num_edges}")
print(f"  Node features dim:   {data.num_node_features}")
print(f"  Classes:             {data.num_classes}")
print(f"  Avg degree:          {data.num_edges / data.num_nodes:.1f}")


Graph summary:
  Nodes:               2708
  Edges (undirected):  10556
  Node features dim:   1433
  Classes:             7
  Avg degree:          3.9


### 1.3 — Train / Validation / Test Split

We use a stratified random split to ensure each class is represented proportionally  
in each subset. This is especially important because the class distribution is imbalanced.

In [15]:
# =============================================================================
# Create stratified train/val/test masks (60% / 20% / 20%)
# =============================================================================
from sklearn.model_selection import train_test_split

num_nodes = data.num_nodes
indices = np.arange(num_nodes)

# First split: 60% train, 40% temp (val + test)
train_idx, temp_idx = train_test_split(
    indices, test_size=0.4, stratify=labels, random_state=SEED
)

# Second split: 50% of temp → 20% val, 20% test
val_idx, test_idx = train_test_split(
    temp_idx, test_size=0.5, stratify=labels[temp_idx], random_state=SEED
)

# Create boolean masks for PyG compatibility
train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask   = torch.zeros(num_nodes, dtype=torch.bool)
test_mask  = torch.zeros(num_nodes, dtype=torch.bool)

train_mask[train_idx] = True
val_mask[val_idx]     = True
test_mask[test_idx]   = True

data.train_mask = train_mask
data.val_mask   = val_mask
data.test_mask  = test_mask

print(f"Train nodes: {train_mask.sum().item()} ({train_mask.sum().item()/num_nodes*100:.0f}%)")
print(f"Val   nodes: {val_mask.sum().item()} ({val_mask.sum().item()/num_nodes*100:.0f}%)")
print(f"Test  nodes: {test_mask.sum().item()} ({test_mask.sum().item()/num_nodes*100:.0f}%)")

Train nodes: 1624 (60%)
Val   nodes: 542 (20%)
Test  nodes: 542 (20%)


### 1.4 — Train Node2Vec Embeddings

**Node2Vec** learns low-dimensional embeddings by performing biased random walks  
on the graph and then applying a Skip-Gram model (similar to Word2Vec on text).

Key parameters:
- `embedding_dim`: dimensionality of the learned embeddings (128)
- `walk_length`: number of steps in each random walk (80)
- `context_size`: window size for the Skip-Gram objective (10)
- `walks_per_node`: how many walks to start from each node (10)
- `p` (return parameter): controls likelihood of revisiting the previous node  
  - p < 1 → more likely to backtrack (BFS-like, captures local structure)  
  - p > 1 → less likely to backtrack (DFS-like, captures global structure)
- `q` (in-out parameter): controls whether the walk explores outward or stays local  
  - q < 1 → biased toward exploring farther nodes (DFS-like)
  - q > 1 → biased toward staying near the start (BFS-like)

**Important:** Node2Vec uses **only the graph structure** (edges).  
It does NOT use the 1433-d BoW features — that's intentional. We want to see  
how much classification performance we can get from citations alone.

In [16]:
# =============================================================================
# Node2Vec Model Configuration
# =============================================================================
# These hyperparameters follow the original Node2Vec paper (Grover & Leskovec, 2016)
# with slight adjustments for the Cora graph size.

EMBEDDING_DIM = 128    # Dimensionality of node embeddings
WALK_LENGTH   = 80     # Steps per random walk
CONTEXT_SIZE  = 10     # Skip-Gram window size
WALKS_PER_NODE = 10    # Number of random walks starting from each node
P = 1.0                # Return parameter (p=1, q=1 → DeepWalk-equivalent)
Q = 1.0                # In-out parameter
NUM_EPOCHS    = 100    # Training epochs for the Skip-Gram objective
BATCH_SIZE    = 256    # Batch size for stochastic gradient descent
LR            = 0.01   # Learning rate

# Initialize the Node2Vec model from PyTorch Geometric
# NOTE: This only takes edge_index — it does NOT use node features (x)
node2vec_model = Node2Vec(
    edge_index=data.edge_index,
    embedding_dim=EMBEDDING_DIM,
    walk_length=WALK_LENGTH,
    context_size=CONTEXT_SIZE,
    walks_per_node=WALKS_PER_NODE,
    p=P,
    q=Q,
    num_nodes=data.num_nodes,
    sparse=True,           # Use sparse gradients for efficiency
).to(device)

print(f"Node2Vec model initialized:")
print(f"  Embedding matrix shape: ({data.num_nodes}, {EMBEDDING_DIM})")
print(f"  Total trainable params: {sum(p.numel() for p in node2vec_model.parameters()):,}")

Node2Vec model initialized:
  Embedding matrix shape: (2708, 128)
  Total trainable params: 346,624


In [18]:
# =============================================================================
# Node2Vec Training Loop
# =============================================================================
# The training process:
# 1. Generate random walks from the graph (biased by p, q)
# 2. Treat walks as "sentences" and nodes as "words"
# 3. Apply Skip-Gram with negative sampling to learn embeddings
#    that place co-occurring nodes (in walks) close together

# DataLoader samples random walks and creates Skip-Gram training pairs
loader = node2vec_model.loader(batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

# Using SparseAdam optimizer — efficient for sparse embedding gradients
optimizer = torch.optim.SparseAdam(node2vec_model.parameters(), lr=LR)

# Training
node2vec_model.train()
losses = []  # track loss for visualization

for epoch in range(1, NUM_EPOCHS + 1):
    total_loss = 0
    num_batches = 0
    
    for pos_rw, neg_rw in loader:
        optimizer.zero_grad()
        # pos_rw: positive random walk pairs (nodes that co-occur in walks)
        # neg_rw: negative samples (random node pairs, unlikely to be related)
        loss = node2vec_model.loss(pos_rw.to(device), neg_rw.to(device))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        num_batches += 1
    
    avg_loss = total_loss / num_batches
    losses.append(avg_loss)
    
    # # Print progress every 10 epochs
    # if epoch % 10 == 0 or epoch == 1:
    print(f"Epoch {epoch:3d}/{NUM_EPOCHS} — Loss: {avg_loss:.4f}")

print("\nNode2Vec training complete!")

Epoch   1/100 — Loss: 7.6390


KeyboardInterrupt: 

In [None]:
# =============================================================================
# Plot Training Loss
# =============================================================================
plt.figure(figsize=(8, 4))
plt.plot(range(1, NUM_EPOCHS + 1), losses, color='#2196F3', linewidth=1.5)
plt.xlabel('Epoch')
plt.ylabel('Loss (Skip-Gram + Negative Sampling)')
plt.title('Node2Vec Training Loss on Cora')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 1.5 — Extract Embeddings

After training, each node has a learned 128-dimensional embedding vector.  
These embeddings encode the **structural role and neighborhood** of each paper  
in the citation graph — without ever seeing the paper's content.

In [None]:
# =============================================================================
# Extract the trained embeddings for all nodes
# =============================================================================
node2vec_model.eval()

# Get embeddings: shape [num_nodes, embedding_dim] = [2708, 128]
with torch.no_grad():
    embeddings = node2vec_model().detach().cpu().numpy()

print(f"Embedding matrix shape: {embeddings.shape}")
print(f"\nSample embedding (node 0, first 10 dims):")
print(f"  {embeddings[0, :10]}")

### 1.6 — Node Classification with Logistic Regression

Now we use the Node2Vec embeddings as input features to a **Logistic Regression** classifier.  
This is the standard evaluation protocol for shallow graph embeddings:

1. Learn embeddings (unsupervised, structure-only)
2. Use them as fixed features for a simple downstream classifier

The classifier never sees the original BoW features — only the 128-d structural embeddings.

In [None]:
# =============================================================================
# Prepare train / val / test features and labels from the embeddings
# =============================================================================
X_train = embeddings[train_mask.numpy()]
y_train = labels[train_mask.numpy()]

X_val = embeddings[val_mask.numpy()]
y_val = labels[val_mask.numpy()]

X_test = embeddings[test_mask.numpy()]
y_test = labels[test_mask.numpy()]

print(f"Train: {X_train.shape[0]} samples")
print(f"Val:   {X_val.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")

In [None]:
# =============================================================================
# Train Logistic Regression on Node2Vec embeddings
# =============================================================================
# We use L2 regularization (default) and increase max_iter for convergence.
# The classifier is simple by design — the embedding quality determines performance.

clf = LogisticRegression(
    max_iter=2000,
    multi_class='multinomial',   # 7-class classification
    solver='lbfgs',              # good default for multinomial
    random_state=SEED,
)
clf.fit(X_train, y_train)

# Predict on validation and test sets
y_val_pred  = clf.predict(X_val)
y_test_pred = clf.predict(X_test)

val_acc  = accuracy_score(y_val, y_val_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"\n{'='*50}")
print(f"  Node2Vec + Logistic Regression Results")
print(f"{'='*50}")
print(f"  Validation Accuracy: {val_acc:.4f} ({val_acc*100:.1f}%)")
print(f"  Test Accuracy:       {test_acc:.4f} ({test_acc*100:.1f}%)")
print(f"{'='*50}")

In [None]:
# =============================================================================
# Detailed Classification Report (per-class precision, recall, F1)
# =============================================================================
print("\nDetailed Classification Report (Test Set):")
print("=" * 65)
print(classification_report(
    y_test, y_test_pred,
    target_names=class_names,
    digits=3
))

# Also compute macro and weighted F1 for easy comparison later
macro_f1    = f1_score(y_test, y_test_pred, average='macro')
weighted_f1 = f1_score(y_test, y_test_pred, average='weighted')
print(f"Macro F1:    {macro_f1:.4f}")
print(f"Weighted F1: {weighted_f1:.4f}")

### 1.7 — Confusion Matrix

In [None]:
# =============================================================================
# Confusion Matrix Heatmap
# =============================================================================
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(9, 7))
sns.heatmap(
    cm, annot=True, fmt='d', cmap='Blues',
    xticklabels=class_names,
    yticklabels=class_names,
)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Node2Vec + Logistic Regression — Confusion Matrix (Test Set)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 1.8 — t-SNE Visualization of Node2Vec Embeddings

t-SNE projects the 128-d embeddings down to 2D for visualization.  
If Node2Vec has learned meaningful structure, we should see papers of the same class  
forming clusters — even though the model **never saw the class labels or paper content**.

In [None]:
# =============================================================================
# t-SNE dimensionality reduction (128-d → 2-d)
# =============================================================================
print("Running t-SNE (this may take a moment)...")

tsne = TSNE(
    n_components=2,
    perplexity=30,       # balance between local and global structure
    learning_rate='auto',
    n_iter=1000,
    random_state=SEED,
    init='pca',          # PCA initialization for stability
)
embeddings_2d = tsne.fit_transform(embeddings)

print(f"t-SNE output shape: {embeddings_2d.shape}")

In [None]:
# =============================================================================
# Plot t-SNE embeddings colored by class
# =============================================================================
# Use a colorblind-friendly palette
palette = sns.color_palette('tab10', n_colors=len(class_names))

plt.figure(figsize=(10, 8))

for idx, class_name in enumerate(class_names):
    mask = labels == idx
    plt.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        c=[palette[idx]],
        label=class_name,
        alpha=0.6,
        s=15,
        edgecolors='none',
    )

plt.legend(title='Paper Class', bbox_to_anchor=(1.05, 1), loc='upper left', markerscale=3)
plt.title('t-SNE of Node2Vec Embeddings (Cora — Structure Only)', fontsize=14)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

### 1.9 — Summary & Key Takeaways

**What we did:**
- Trained Node2Vec on the Cora citation graph using **only the graph structure** (edges)
- The model never saw the 1433-d BoW features or the class labels during embedding training
- Used the learned 128-d embeddings as input to a simple Logistic Regression classifier
<!-- 
**Key observations:**
- Node2Vec achieves reasonable accuracy using citations alone, demonstrating  
  that the citation network encodes meaningful information about paper topics
- Papers in the same research area tend to cite each other, so structural proximity  
  in the citation graph correlates with topical similarity
- The t-SNE plot should show some class clustering, confirming the embeddings  
  capture class-relevant structure -->

**Limitations (motivating GNNs next):**
- Node2Vec is **transductive**: it cannot generate embeddings for new/unseen nodes
- It completely ignores the rich BoW features (paper content)
- The embeddings are learned independently of the downstream task (unsupervised)

**Next step:** Apply GCN and GAT (deep embeddings) that combine both graph structure  
AND node features, with end-to-end supervised training for classification.