# PageRank

We will explore how the **PageRank algorithm** works and how it relates to real-world web traffic data.

The dataset you will use is derived from the **Wikipedia category 'Machine Learning'**. Each node corresponds to a Wikipedia article, and each directed edge represents a hyperlink between two pages. Additionally, a separate file contains the **page view statistics (traffic)** for each article. The data was downloaded using Wikipedia API, for the curious the scripts are in the `datagen` folder.

Your goals are:
1. Load and inspect the dataset (`nodes.csv`, `edges.csv`, and `traffic.csv`).
2. Build a directed graph using NetworkX.
3. Compute the PageRank vector using both the NetworkX implementation and the power iteration method.
4. Compare the results of both methods and visualize the graph.
5. Compare PageRank values with real page traffic and discuss their correlation.

In [None]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Load the data
nodes = pd.read_csv("nodes.csv")
edges = pd.read_csv("edges.csv")
traffic = pd.read_csv("traffic.csv")

print(f"Loaded {len(nodes)} nodes, {len(edges)} edges, and {len(traffic)} traffic entries")

### Build the Graph and Transition Matrix

We first create a **directed graph** representing the hyperlink structure between Wikipedia articles.
Each node corresponds to a page, and each directed edge represents a link from one page to another.

Then we construct the **transition matrix** $M$ such that $M_{ij}$ represents the probability of moving from page $j$ to page $i$ when following a link.

In [None]:
graph = nx.DiGraph()
graph.add_nodes_from(nodes["node"])
graph.add_edges_from(edges[["source", "target"]].values)

N = len(nodes)
node_index = nodes["node"].to_dict()
node_index = {node_index[k]: k for k in node_index}

# Build the stochastic matrix M
M = np.zeros((N, N))
for u, v in graph.edges():
    if graph.out_degree(u) > 0:
        M[node_index[v], node_index[u]] = 1.0 / graph.out_degree(u)

print(f"Transition matrix M built with shape: {M.shape}")

### The Damping Factor and Power Iteration

The **damping factor** (usually denoted as $d$) models the probability that a user follows a link versus jumping to a random page.
Typically, $d = 0.85$ means the user follows a link 85% of the time and jumps randomly 15% of the time.

We will:
- Compute PageRank using NetworkX's built-in implementation.
- Compute PageRank manually using the **power iteration method** applied to the Google matrix 
$$G = d M + \frac{1 - d}{N} 1_{N\times N}$$
- Compare the two results.

In [None]:
damping_factor = 0.85

# --- Compute PageRank using NetworkX ---
pagerank = nx.pagerank(graph, alpha=damping_factor)
pagerank_vec = np.array([pagerank[n] for n in nodes["node"]])
nx.set_node_attributes(graph, pagerank, "pagerank")

# --- Compute PageRank using power iteration ---
# SOLUTION-BEGIN
G_matrix = damping_factor * M + (1 - damping_factor) / N * np.ones((N, N))

p = np.ones(N) / N
tol = 1e-10
max_iter = 1000

for i in range(max_iter):
    p_next = G_matrix @ p
    p_next /= np.linalg.norm(p_next)
    if np.linalg.norm(p_next - p, 2) < tol:
        print(f"Converged after {i} iterations")
        break
    p = p_next

pagerank_powit = p / np.sum(p)
# SOLUTION-END

# Compare results
corr = np.corrcoef(pagerank_vec, pagerank_powit)[0, 1]
l1_diff = np.linalg.norm(pagerank_vec - pagerank_powit, 1)

print("\n--- PageRank Comparison ---")
print(f"Pearson correlation: {corr:.6f}")
print(f"L1 difference:       {l1_diff:.6e}")

plt.plot(pagerank_vec, 'o-')
plt.plot(pagerank_powit, 'x-')

### Visualize the Graph

We now visualize the graph using a spring layout. Node sizes are proportional to their PageRank values.

- Larger nodes correspond to more 'important' pages.
- The layout helps highlight clusters and central pages within the category.

In [None]:
plt.figure(figsize=(12, 10))

pos = nx.spring_layout(graph, k=0.5, seed=42)

# Draw nodes
nx.draw_networkx_nodes(
    graph,
    pos,
    node_size=pagerank_vec * 5000,
    node_color="skyblue",
    alpha=0.8,
    edgecolors="gray",
)

# Draw edges
nx.draw_networkx_edges(graph, pos, arrows=True, alpha=0.4)

# Draw labels (of most important pages)optional for smaller graphs
nx.draw_networkx_labels(
    graph,
    pos,
    labels={
        n: n for n in nodes["node"] if pagerank[n] > np.quantile(pagerank_vec, 0.9)
    },
    font_color="k",
    font_size=10,
    bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3')
)

plt.title("Wikipedia Machine Learning Graph\nNode size = PageRank", fontsize=14)
plt.axis("off")
plt.tight_layout()
plt.show()

### Compare PageRank with Wikipedia Traffic

Finally, we compare the computed PageRank values with **real page traffic** obtained from the Wikimedia API.
This allows us to see whether the theoretical importance (PageRank) corresponds to actual user visits.

- We add the PageRank value to the traffic DataFrame.
- Plot traffic vs. PageRank on a log-log scale.
- Compute and print their correlation.
- Print the 10 most important pages (highest pagerank score).

In [None]:
traffic["pagerank"] = traffic["node"].map(pagerank)

# SOLUTION-BEGIN
plt.figure(figsize=(7, 6))
plt.scatter(traffic.traffic, traffic.pagerank, alpha=0.7)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Wikipedia Traffic (log scale)")
plt.ylabel("PageRank (log scale)")
plt.title("Correlation between PageRank and Traffic")
plt.tight_layout()
plt.show()

corr_traffic = np.corrcoef(traffic.traffic, traffic.pagerank)[0, 1]
print(f"Correlation between traffic and pagerank: {corr_traffic:.3f}")

print("\nTop 10 pages by PageRank:")
print(traffic.nlargest(10, "pagerank")[["node", "pagerank", "traffic"]])
# SOLUTION-END