###### Introduction to Network Analysis 2023/24 (xii)

## Random-walk sampling, network comparison

### II. Sampling Facebook social network

You are given two large samples of Facebook social network with around ten million of nodes and links. Due to their size, the networks are available only in compressed edge list format.

+ 1st sample of Facebook network ([facebook_1.adj.zip](http://lovro.fri.uni-lj.si/ina/nets/facebook_1.adj.zip))
+ 2nd sample of Facebook network ([facebook_2.adj.zip](http://lovro.fri.uni-lj.si/ina/nets/facebook_2.adj.zip))



1. **(discuss)** The samples were generated by a uniform random node selection technique called _rejection sampling_ and by the breadth-first search approach called _snowball sampling_.



The first techinque is a repeated webcrawl from a random user ID traversing friendships (invalid IDs are *rejected*, hence *rejection sampling*).

Snowball sampling is a simple BFS, but ideally done from *several* starting points to get a good representative sample. In a network of this size, a single-source BFS would take up too much space before reaching a decent proportion of nodes anyway. Note that we can get a **disconnected** sample, even if the underlying network is connected.

2. **(homework)** Try to figure out which network sample is which. Since these are still very tiny samples of Facebook social network, the answer might not be immediately obvious from their structure.

In [None]:
import utils
import networkx as nx

fb_samples = [
    utils.read_edgelist(f"facebook_{i}.adj", progress_bar=True)
    for i in [1, 2]
]

reading facebook_1: 100%|██████████| 12582911/12582911 [00:11<00:00, 1058785.53it/s]
reading facebook_2: 100%|██████████| 7839215/7839215 [00:06<00:00, 1131309.02it/s]


In [None]:
for fb in fb_samples:
    utils.info(fb, clustering_sample=1_000_000)

       Graph | 'facebook_1'
       Nodes | 8,217,272 (iso=0)
       Edges | 12,582,912 (loop=0)
      Degree | 3.06 (max=518)
         LCC | 100.0% (n=67)
  Clustering | 0.0188

       Graph | 'facebook_2'
       Nodes | 7,698,354 (iso=0)
       Edges | 7,839,141 (loop=0)
      Degree | 2.04 (max=402)
         LCC | 94.1% (n=11,469)
  Clustering | 0.0015



Sampling stops if a user has disabled friend crawling, hence we have very low $\langle k\rangle$ and disconnected samples.
Average degree should be around 200, but we only get 2 or 3: proof most people do not allow crawling.

The `facebook_2` sample is likely from *RS*, as it has lower $\langle k\rangle$ due to many low degree spokes (groups with center and its neighbors). These isolated groups yield the many more connected components we see here, unlike the 100% LCC `facebook_1`.
We also see significantly higher clustering in `facebook_1`, as expected from BFS snowball sampling (unlike *RS* which can yield zero-clustering spokes).