###### Introduction to Network Analysis 2023/24 (xii)

## Random-walk sampling, network comparison

### I. Estimation by random-walk sampling

You are given five networks in Pajek format.

+ Java class dependency network ([java.net](http://lovro.fri.uni-lj.si/ina/nets/java.net))
+ *nec* overlay map of the Internet ([nec.net](http://lovro.fri.uni-lj.si/ina/nets/nec.net))
+ Sample of Facebook social network ([facebook.net](http://lovro.fri.uni-lj.si/ina/nets/facebook.net))
+ Enron e-mail communication network ([enron.net](http://lovro.fri.uni-lj.si/ina/nets/enron.net))
+ A small part of Google web graph ([www_google.net](http://lovro.fri.uni-lj.si/ina/nets/www_google.net))



1. **(code)** Represent the networks with simple undirected graphs and reduce them to their largest connected component.



In [None]:
from tqdm import tqdm
import networkx as nx
import utils


NET_NAMES = "cdn_java nec facebook enron www_google".split()

def LCC(G: nx.Graph) -> nx.Graph:
    lcc = max(nx.connected_components(G), key=len)
    return nx.convert_node_labels_to_integers(G.subgraph(lcc))

graphs = {name: LCC(nx.Graph(utils.read_pajek(name)))
          for name in tqdm(NET_NAMES, desc="reading networks")}

reading networks: 100%|██████████| 5/5 [01:25<00:00, 17.12s/it]


2. **(code)** Implement a random-walk sampling and apply it to the networks until you sample 15% of the nodes (with repetitions). Let $s$ be the number of sampled nodes and $k_1,\dots,k_s$ their degree sequence. Estimate the average degree of the network $\langle k\rangle$ using a biased average $$\frac{\sum_ik_i}{s}$$ and also the corrected estimate $$\frac{s}{\sum_ik_i^{-1}}.$$



In [None]:
import random
from typing import Tuple

def sample_degree_avg(G: nx.Graph, node_percent=15) -> Tuple[float, float]:
    """returns tuple (biased avg., corrected avg.)"""
    assert 0 <= node_percent <= 100

    # random sampling optimization
    # (so we don't need to convert neighborhood generator to list every time)
    adj_list = [list(G[i]) for i in G.nodes]

    s = round(len(G) * node_percent / 100)

    walker = random.randint(0, len(G) - 1)
    visited = {walker}
    degree_sum = G.degree(walker)
    reciprocal_degree_sum = 1 / G.degree(walker)

    while len(visited) < s:
        walker = random.choice(adj_list[walker])
        degree_sum += G.degree(walker)
        reciprocal_degree_sum += 1 / G.degree(walker)
        visited.add(walker)

    return (degree_sum/s, s/reciprocal_degree_sum)


# TODO: for better results get the 15% from *several* runs
#       (eg. run from 5 starting points until you have disjoint 3% samples)

for name, G in graphs.items():
    biased, corrected = sample_degree_avg(G)
    print(f"{name:<15}{biased:.3f}\t\t{corrected:.3f}")

| net   | $\langle k\rangle$   | biased  | corrected | k<sub>max</sub> |
|-------|---------|---------|------|-----------------|
| java  | 12.3    | 521.2   | 11.8 | 2166            |
| nec   | 9.42    | 1243.1  | 9.1  | 13346           |
| fb    | 25.77   | 89.6    | 28.1 | 1098            |
| enron | 7.05    | 169.0   | 7.7  | 1728            |
| www   | 10.03   | 165.9   | 9.6  | 6332            |


3. **(discuss)** Compare both estimates to the true average degree $\langle k\rangle$.

Random walks are biased toward high degree nodes, so they are much more likely to appear in a sample.
That's why the biased $\langle k\rangle$ estimate is much too high, especially for networks with very large hubs.