# 04 – Results and Visualization

This notebook aggregates the main results and visualizations from the project:

- Global graph summary (full network)
- Network properties of the largest connected component (LCC)
- Centrality measures (degree, betweenness, eigenvector)
- GNN link prediction performance (test AUC, test AP)
- Simple plots and a LaTeX table snippet for the research paper

It assumes the following files have been generated by previous scripts / notebooks:

- `../results/graph_summary.json` (from `01_exploration.ipynb`)
- `../results/network_properties_lcc.json` (from `02_network_properties.ipynb`)
- `../results/centrality_lcc.csv` (from `02_network_properties.ipynb`)
- `../results/linkpred_metrics.json` (from `train_link_prediction_gnn.py`)

This notebook is meant to support the **Results** and **Discussion** sections of the paper.

## 1. Setup and Imports

We set the project root as `..` and load all relevant result files from the `../results/` directory.

In [None]:
import os
import json
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

PROJECT_ROOT = Path("..").resolve()
RESULTS_DIR = PROJECT_ROOT / "results"

GRAPH_SUMMARY_PATH = RESULTS_DIR / "graph_summary.json"
LCC_PROPS_PATH = RESULTS_DIR / "network_properties_lcc.json"
CENTRALITY_PATH = RESULTS_DIR / "centrality_lcc.csv"
METRICS_PATH = RESULTS_DIR / "linkpred_metrics.json"

PROJECT_ROOT, RESULTS_DIR

## 2. Load Summary Files

We load:

- Global graph summary
- LCC network properties
- Centrality table
- GNN link prediction metrics

If any file is missing, we print a warning.

In [None]:
def load_json_safe(path: Path):
    if path.exists():
        with open(path, "r") as f:
            return json.load(f)
    else:
        print(f"[Warning] JSON file not found: {path}")
        return None


graph_summary = load_json_safe(GRAPH_SUMMARY_PATH)
lcc_props = load_json_safe(LCC_PROPS_PATH)

if CENTRALITY_PATH.exists():
    centrality_df = pd.read_csv(CENTRALITY_PATH)
else:
    print(f"[Warning] Centrality CSV not found: {CENTRALITY_PATH}")
    centrality_df = None

metrics = load_json_safe(METRICS_PATH)

graph_summary, lcc_props, (centrality_df.head() if centrality_df is not None else None), metrics

## 3. Global Graph Summary (Full Network)

We organize the main statistics from `graph_summary.json` in a small table-like view for easier inspection.

Typical fields:

- `n_nodes`, `n_edges`
- `avg_degree`, `median_degree`, `max_degree`
- `density`, `average_clustering`, `transitivity`
- fraction of nodes in the largest component (if available)

In [None]:
if graph_summary is not None:
    graph_summary_df = pd.DataFrame.from_dict(graph_summary, orient="index", columns=["value"])
    display(graph_summary_df)
else:
    print("No global graph summary available.")

## 4. LCC (Largest Connected Component) Properties

We inspect `network_properties_lcc.json`, which typically includes:

- `lcc_n_nodes`, `lcc_n_edges`
- `lcc_fraction_of_total_nodes`
- `average_shortest_path_length`
- `diameter_exact_or_none`
- `degree_assortativity`

These are key values for describing the structure of the collaboration network in the paper.

In [None]:
if lcc_props is not None:
    lcc_props_df = pd.DataFrame.from_dict(lcc_props, orient="index", columns=["value"])
    display(lcc_props_df)
else:
    print("No LCC properties available.")

## 5. Centrality Distributions

If `centrality_lcc.csv` is available, we:

- Show the first few rows
- Plot histograms (log scale where appropriate) for:
  - Degree centrality
  - Betweenness centrality
  - Eigenvector centrality (if available)

These plots can help identify whether the network has a small core of highly central authors and a long tail of less central authors, which is typical of collaboration networks.

In [None]:
if centrality_df is not None:
    display(centrality_df.head())

    # Avoid issues with zeros and log-scale
    eps = 1e-9

    plt.figure(figsize=(8, 5))
    plt.hist(centrality_df["degree_centrality"], bins=100)
    plt.title("Degree Centrality Distribution (LCC)")
    plt.xlabel("Degree centrality")
    plt.ylabel("Frequency")
    plt.yscale("log")
    plt.tight_layout()
    plt.show()

    plt.figure(figsize=(8, 5))
    plt.hist(centrality_df["betweenness_centrality"] + eps, bins=100)
    plt.title("Betweenness Centrality Distribution (LCC)")
    plt.xlabel("Betweenness centrality")
    plt.ylabel("Frequency")
    plt.yscale("log")
    plt.tight_layout()
    plt.show()

    if "eigenvector_centrality" in centrality_df.columns:
        valid_eig = centrality_df["eigenvector_centrality"].dropna()
        if len(valid_eig) > 0:
            plt.figure(figsize=(8, 5))
            plt.hist(valid_eig + eps, bins=100)
            plt.title("Eigenvector Centrality Distribution (LCC)")
            plt.xlabel("Eigenvector centrality")
            plt.ylabel("Frequency")
            plt.yscale("log")
            plt.tight_layout()
            plt.show()
else:
    print("No centrality data available.")

We can also list the **top 10 nodes** according to each centrality measure; this is useful if, later, we want to identify specific authors or patterns in the collaboration structure (even if we do not map node IDs to names in this dataset).

In [None]:
if centrality_df is not None:
    def top_k(df, col, k=10):
        return df.sort_values(col, ascending=False).head(k)[["node", col]]

    print("Top 10 nodes by degree centrality:")
    display(top_k(centrality_df, "degree_centrality", k=10))

    print("Top 10 nodes by betweenness centrality:")
    display(top_k(centrality_df, "betweenness_centrality", k=10))

    if "eigenvector_centrality" in centrality_df.columns:
        print("Top 10 nodes by eigenvector centrality:")
        display(top_k(centrality_df, "eigenvector_centrality", k=10))
else:
    print("No centrality data available.")

## 6. GNN Link Prediction Metrics

We load the final metrics from `linkpred_metrics.json` and show them in a small table.

Typically, this includes fields such as:

- `test_auc`
- `test_ap`
- `encoder`
- `epochs`

These values will be directly used in the *Results* section of the paper.

In [None]:
if metrics is not None:
    metrics_df = pd.DataFrame.from_dict(metrics, orient="index", columns=["value"])
    display(metrics_df)
else:
    print("No GNN metrics available.")

If you want, you can format a short textual summary for the paper, such as:

> "Using a GCN-based link prediction model with 64-dimensional node embeddings, we obtained a test AUC of X.XXX and a test Average Precision (AP) of Y.YYY on the collaboration network."

You can adapt this later depending on the actual values and any baselines you choose to implement (e.g., Common Neighbors, Jaccard, Adamic–Adar, Preferential Attachment).

## 7. LaTeX Table Snippet for the Paper

Here we generate a simple LaTeX table snippet based on the available metrics and network properties.

You can copy the output and paste it directly into your IEEE/ACM LaTeX document, adjusting labels/captions as needed.

The table will summarize:

- Number of nodes / edges (full graph)
- LCC properties (size, average shortest path length, assortativity)
- GNN performance (test AUC, test AP)

If some data is missing (e.g., a JSON file was not generated), we fill with `N/A`.

In [None]:
def get_val(d, key, default="N/A"):
    if d is None:
        return default
    return d.get(key, default)

n_nodes = get_val(graph_summary, "n_nodes")
n_edges = get_val(graph_summary, "n_edges")
avg_degree = get_val(graph_summary, "avg_degree")
density = get_val(graph_summary, "density")

lcc_n_nodes = get_val(lcc_props, "lcc_n_nodes")
lcc_n_edges = get_val(lcc_props, "lcc_n_edges")
lcc_fraction = get_val(lcc_props, "lcc_fraction_of_total_nodes")
avg_spl = get_val(lcc_props, "average_shortest_path_length")
diameter = get_val(lcc_props, "diameter_exact_or_none")
assortativity = get_val(lcc_props, "degree_assortativity")

test_auc = get_val(metrics, "test_auc")
test_ap = get_val(metrics, "test_ap")
encoder_name = get_val(metrics, "encoder")
epochs = get_val(metrics, "epochs")

latex_table = f"""\\begin{{table}}[t]
\\centering
\\caption{{Summary of collaboration network and GNN link prediction performance.}}
\\label{{tab:network-gnn-summary}}
\\begin{{tabular}}{{ll}}
\\hline
\\textbf{{Property}} & \\textbf{{Value}} \\\\ 
\\hline
Number of nodes (full graph) & {n_nodes} \\\\ 
Number of edges (full graph) & {n_edges} \\\\ 
Average degree (full graph) & {avg_degree} \\\\ 
Density (full graph) & {density} \\\\ 
LCC size (nodes / edges) & {lcc_n_nodes} / {lcc_n_edges} \\\\ 
Fraction of nodes in LCC & {lcc_fraction} \\\\ 
Average shortest path length (LCC) & {avg_spl} \\\\ 
Diameter (LCC) & {diameter} \\\\ 
Degree assortativity (LCC) & {assortativity} \\\\ 
GNN encoder & {encoder_name} \\\\ 
Training epochs & {epochs} \\\\ 
Test AUC & {test_auc} \\\\ 
Test AP & {test_ap} \\\\ 
\\hline
\\end{{tabular}}
\\end{{table}}
"""

print(latex_table)

You can now copy the LaTeX snippet above and paste it into your paper. Adjust the caption and label as needed.

This completes the **Results and Visualization** stage, linking the numerical analysis, network structure, and GNN performance into a single consistent view suitable for publication.