# SDSC3001 - Assignment 1

## Question 5

### Loading the Graph

In [1]:
def load_graph(file_path):
    edges = []
    with open(file_path, "r") as file:
        for line in file:
            if line.startswith("#"):
                continue  # Skip comment lines
            parts = line.strip().split()
            if len(parts) == 2:
                from_node, to_node = map(int, parts)
                edges.append((from_node, to_node))
    return edges

In [2]:
file_path = "com-dblp.txt"
graph_edges = load_graph(file_path)

In [3]:
node_degrees = {}


for from_node, to_node in graph_edges:
    if from_node not in node_degrees:
        node_degrees[from_node] = 0
    if to_node not in node_degrees:
        node_degrees[to_node] = 0
    node_degrees[from_node] += 1
    node_degrees[to_node] += 1


total_degrees = sum(node_degrees.values())

normalized_degrees = {
    node: degree / total_degrees for node, degree in node_degrees.items()
}

### Simulate a random walk

In [4]:
import random


def simulate_random_walk(graph_edges, num_steps):
    neighbors = {}
    for from_node, to_node in graph_edges:
        if from_node not in neighbors:
            neighbors[from_node] = []
        if to_node not in neighbors:
            neighbors[to_node] = []
        neighbors[from_node].append(to_node)
        neighbors[to_node].append(from_node)

    current_node = random.choice(list(neighbors.keys()))
    visit_counts = {node: 0 for node in neighbors.keys()}

    for _ in range(num_steps):
        visit_counts[current_node] += 1
        current_node = random.choice(neighbors[current_node])

    return visit_counts


num_steps = 1000000  # Number of steps in the random walk
visit_counts = simulate_random_walk(graph_edges, num_steps)

In [5]:
def calculate_empirical_frequencies(visit_counts, num_steps):
    empirical_frequencies = {
        node: count / num_steps for node, count in visit_counts.items()
    }
    return empirical_frequencies


num_steps = 1000000  # Number of steps in the random walk
visit_counts = simulate_random_walk(graph_edges, num_steps)
empirical_frequencies = calculate_empirical_frequencies(visit_counts, num_steps)

{0: 2.4e-05, 1: 4.4e-05, 2: 1.5e-05, 3: 2.4e-05, 4: 1.4e-05, 5: 1.8e-05, 6: 4.4e-05, 7: 2.5e-05, 8: 8e-06, 9: 2.6e-05, 10: 9e-06, 11: 2.5e-05, 12: 3e-06, 13: 8e-06, 14: 5e-06, 15: 2e-06, 16: 3e-06, 17: 0.0, 18: 0.0, 19: 2.1e-05, 20: 3e-05, 21: 2e-05, 22: 2.4e-05, 23: 1.1e-05, 24: 1.1e-05, 25: 1.4e-05, 26: 3.5e-05, 27: 1.6e-05, 28: 1.2e-05, 29: 4.3e-05, 30: 2.9e-05, 31: 1.3e-05, 32: 2e-06, 33: 2.9e-05, 34: 3.1e-05, 35: 5.4e-05, 36: 0.0, 37: 1.7e-05, 38: 2.6e-05, 39: 3.2e-05, 40: 1.7e-05, 41: 5e-06, 42: 5e-06, 43: 1.4e-05, 44: 8.7e-05, 45: 3e-06, 46: 2e-05, 47: 3.2e-05, 48: 7e-06, 49: 2e-06, 50: 2e-05, 51: 2.1e-05, 52: 1.8e-05, 53: 1.6e-05, 54: 8e-06, 55: 2e-05, 56: 4e-06, 57: 4e-06, 58: 3e-06, 59: 3e-06, 60: 3e-06, 61: 2.8e-05, 62: 2.1e-05, 63: 1.7e-05, 64: 5e-06, 65: 1e-06, 66: 1.6e-05, 67: 2.9e-05, 68: 3.5e-05, 69: 2e-06, 70: 1e-06, 71: 3e-06, 72: 1.7e-05, 73: 1.4e-05, 74: 6e-06, 75: 1.6e-05, 76: 7e-06, 77: 1.6e-05, 78: 3.4e-05, 79: 1.6e-05, 80: 4.1e-05, 81: 3e-06, 82: 2.6e-05, 83: 5e

In [6]:
import numpy as np


def calculate_l1_distance(vector1, vector2):
    return np.sum(
        np.abs(np.array(list(vector1.values())) - np.array(list(vector2.values())))
    )


# Calculate the normalized degree vector
total_degrees = sum(node_degrees.values())
normalized_degrees = {
    node: degree / total_degrees for node, degree in node_degrees.items()
}

# Calculate the empirical frequency vector
num_steps = 1000000  # Number of steps in the random walk
visit_counts = simulate_random_walk(graph_edges, num_steps)
empirical_frequencies = calculate_empirical_frequencies(visit_counts, num_steps)

# Calculate the L1 distance between the normalized degree vector and the empirical frequency vector
l1_distance = calculate_l1_distance(normalized_degrees, empirical_frequencies)

print(f"L1 Distance: {l1_distance}")

L1 Distance: 0.5815182191822575


The $\ell_1$-distance between the normalized degree vector and the empirical frequency vector is a measure of how closely the empirical distribution of node visits during a random walk approximates the theoretical distribution given by the normalized degrees. Here’s why this is significant:

Significance of $\ell_1$-Distance:
Convergence to Stationary Distribution:

In the context of random walks on graphs, the stationary distribution is the long-term distribution of visits to nodes. For an undirected, connected, and non-bipartite graph, this distribution is proportional to the degrees of the nodes.
The normalized degree vector represents this stationary distribution.
The empirical frequency vector represents the observed distribution of visits after performing a random walk.
The $\ell_1$-distance quantifies the difference between these two distributions, indicating how well the random walk has converged to the stationary distribution.
Measure of Random Walk Quality:

A smaller $\ell_1$-distance indicates that the random walk has closely approximated the stationary distribution, meaning the walk has effectively "mixed" and is representative of the graph's structure.
A larger $\ell_1$-distance suggests that the random walk has not yet converged to the stationary distribution, implying that more steps may be needed for the walk to be representative.
Practical Applications:

In practical applications such as PageRank, recommendation systems, and network analysis, ensuring that the random walk has converged to the stationary distribution is crucial for accurate results.
The $\ell_1$-distance provides a way to verify this convergence and determine if the random walk has been run for a sufficient number of steps.