# Exploration Notebook – Collaboration Network

This notebook performs an initial exploratory analysis (EDA) of the collaboration network used in the project **“Link Prediction in Collaboration Networks using Graph Neural Networks”**.

It corresponds to **Sprint 1: Research Question, Data Collection and Network** in the Network Science course (PPGEC / UPE).

## 1. Setup and Imports

We assume the following directory structure:

```text
upe-ppgec-netsci-2025-1-projeto-icbvo/
├── data/
│   └── collaboration.edgelist.txt
├── gnn/
├── notebooks/
└── results/
```

The notebook is inside `notebooks/` and the edge list is located in `../data/collaboration.edgelist.txt`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from pathlib import Path
import json
import os

%matplotlib inline

DATA_PATH = Path("../data/collaboration.edgelist.txt")
RESULTS_DIR = Path("../results")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

DATA_PATH, RESULTS_DIR

## 2. Loading the Edge List

The file is expected to have **no header** and **two integer columns** (node identifiers), separated by whitespace (space or tab):

```text
u  v
0  1680
0  6918
...
```

In [None]:
if not DATA_PATH.exists():
    raise FileNotFoundError(f"Edge list file not found: {DATA_PATH}")

df_edges = pd.read_csv(DATA_PATH, sep=r"\s+", header=None, names=["u", "v"])
df_edges.head()

## 3. Basic Statistics

We compute:

- Number of edges
- Number of unique nodes
- Presence of self-loops
- Example of minimum and maximum node IDs

In [None]:
n_edges = len(df_edges)
nodes = set(df_edges["u"]) | set(df_edges["v"])
n_nodes = len(nodes)

has_self_loops = (df_edges["u"] == df_edges["v"]).any()
min_node = min(nodes)
max_node = max(nodes)

print(f"Number of edges: {n_edges}")
print(f"Number of unique nodes: {n_nodes}")
print(f"Any self-loops? {has_self_loops}")
print(f"Min node ID: {min_node}")
print(f"Max node ID: {max_node}")

## 4. Building the Graph with NetworkX

We treat the network as **undirected**, since collaborations are symmetric (if A collaborated with B, then B collaborated with A).

In [None]:
G = nx.from_pandas_edgelist(df_edges, source="u", target="v")
G

We confirm that the number of nodes and edges in the graph matches the basic statistics computed before.

In [None]:
print(f"Graph number of nodes: {G.number_of_nodes()}")
print(f"Graph number of edges: {G.number_of_edges()}")

## 5. Degree Distribution

We compute the degree of each node and visualize the distribution.

First, we look at the raw histogram; then, we inspect a log–log version to better understand the heavy-tailed behavior.

In [None]:
degrees = np.array([d for _, d in G.degree()])

print(f"Average degree: {degrees.mean():.2f}")
print(f"Median degree: {np.median(degrees):.2f}")
print(f"Max degree: {degrees.max()}")

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(degrees, bins=100, color="steelblue")
plt.title("Degree Distribution (Linear Scale)")
plt.xlabel("Degree")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
plt.hist(degrees, bins=100, color="darkorange")
plt.title("Degree Distribution (Log-Log Scale)")
plt.xlabel("Degree")
plt.ylabel("Frequency")
plt.xscale("log")
plt.yscale("log")
plt.tight_layout()
plt.show()

## 6. Connected Components

We inspect the number of connected components and the size of the largest connected component (LCC).

In [None]:
n_components = nx.number_connected_components(G)
lcc_nodes = max(nx.connected_components(G), key=len)
lcc_size = len(lcc_nodes)

print(f"Number of connected components: {n_components}")
print(f"Largest connected component size: {lcc_size}")
print(f"Fraction of nodes in LCC: {lcc_size / G.number_of_nodes():.4f}")

## 7. Global Network Properties

We compute some basic global measures of the network:

- Density
- Average clustering coefficient
- Transitivity

In [None]:
density = nx.density(G)
avg_clustering = nx.average_clustering(G)
transitivity = nx.transitivity(G)

print(f"Density: {density:.6f}")
print(f"Average clustering coefficient: {avg_clustering:.6f}")
print(f"Transitivity: {transitivity:.6f}")

## 8. Summary and Export

We summarize the main statistics in a Python dictionary and export the results to a JSON file in the `../results/` directory. This can be referenced later in the research paper (Data / Network Description section).

In [None]:
summary = {
    "n_nodes": int(G.number_of_nodes()),
    "n_edges": int(G.number_of_edges()),
    "has_self_loops": bool(has_self_loops),
    "min_node_id": int(min_node),
    "max_node_id": int(max_node),
    "avg_degree": float(degrees.mean()),
    "median_degree": float(np.median(degrees)),
    "max_degree": int(degrees.max()),
    "n_components": int(n_components),
    "largest_component_size": int(lcc_size),
    "largest_component_fraction": float(lcc_size / G.number_of_nodes()),
    "density": float(density),
    "average_clustering": float(avg_clustering),
    "transitivity": float(transitivity),
}

summary_path = RESULTS_DIR / "graph_summary.json"
with open(summary_path, "w") as f:
    json.dump(summary, f, indent=4)

print(f"Summary saved to: {summary_path}")
summary