# Network sampling

In [None]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

rng = np.random.default_rng()

## Introduction
Many network data sets are samples of some underlying networks that we are interested in. That is, nodes and edges of these empirical networks have been sampled and we only observe part of the network. This can severely bias even the most simple network measures so that the sampled network has quantitatively different properties when compared to the underlying network.

In this exercise, we will implement three different sampling schemes and investigate the properties of the resulting networks.



### a) Generate a network with communities (1 pt)

We will first generate the original network from which we will randomly sample nodes and edges with three different schemes. We will use a network with some structure of interest: communities and a high level of clustering. We will then see how well these are retained in the sampled networks. 

Use the command `nx.relaxed_caveman_graph` to generate a network with 12 communities of 20 people each, where each link is then randomly rewired with probability $0.1$. This is the original underlying network from which we obtain samples. Then visualize the network and answer the MyCourses quiz. 

In [None]:
 # Generate network
g = None
# YOUR CODE HERE
g = nx.relaxed_caveman_graph(12, 20, 0.1, rng)
# removing self-loops that are sometimes generated by the caveman algorithm, as these can cause a mess...
g.remove_edges_from(nx.selfloop_edges(g))

In [None]:
fig_viz, ax_viz = plt.subplots(figsize=(8, 8))
pos=nx.spring_layout(g) # computes the drawing coordinates for nodes; to be reused below.
nx.draw(g, ax=ax_viz, pos=pos,node_size=10, width=.1)

In [None]:
figure_fname = 'sampling_original_graph.pdf' # run this block if you want to save your figure
path='./' # replace with your own path
fig_viz.savefig(path+figure_fname)

### b) Three sampling schemes (3 pts)
Let us now implement the three sampling schemes. **Program** three functions that perform:
1. Bernoulli sampling of nodes: iterate over nodes and include each node in the sample with probability $p$. We observe an edge if and only if we have sampled its two constituting nodes. 
2. Bernoulli sampling of edges: iterate over edges and include each edge in the sample with probability $p$. We observe a node if and only if we have sampled at least one of its edges.
3. Star sampling: iterate over nodes and include each node in the sample with probability $p$, together with all its neighbours. A real-life example would be a data set obtained by crawling through the friendship lists of randomly selected users of a social networking website.

After you're done, use the code block below to sample the network using all three schemes with a sampling probability of $p=0.2$ and print a table of the number of nodes $N$, number of edges $E$, average degree $\langle k\rangle$, and average clustering coefficient $\langle c\rangle$ of the sampled networks.


In [None]:
def sample_nodes(g, p):
    """
    Sample a graph via Bernoulli node sampling.
    For each node in g, sample it with probability p, and add edge (i, j) 
    only if both nodes i and j have been sampled.

    Parameters
    ----------------
    g: a networkx graph object
    p: sampling probability for each node
    """

    # Initialize empty network
    g_new = nx.Graph()

    # TODO: Write code for sampling. 
    # Iterate over nodes, and add to g_new with probability p. 
    # Add edges if both nodes in an edge have been observed.

    # YOUR CODE HERE
    for node in g.nodes():
        if rng.random() < p:
            g_new.add_node(node)

    for edge in g.edges():
        if edge[0] in g_new.nodes() and edge[1] in g_new.nodes():
            g_new.add_edge(edge[0], edge[1])

    return g_new

In [None]:
def sample_edges(g, p):
    """
    Sample a graph via Bernoulli edge sampling.
    For each edge in g, sample it with probability p

    Parameters
    ----------------
    g: a networkx graph object
    p: sampling probability for each edge
    """

    # Initialize empty network
    g_new = nx.Graph()

    # TODO: Write code for sampling. 
    # Iterate over edges, and add to g_new with probability p.
    
    # YOUR CODE HERE
    for edge in g.edges():
        if rng.random() < p:
            g_new.add_edge(edge[0], edge[1])

    return g_new

In [None]:
def sample_stars(g, p):
    """
    Sample a graph via star sampling.
    We sample nodes with probability p, and observe all neighbors. 
    Returns a g_new network obtained via star sampling.
    
    Parameters
    ----------------
    g: a networkx graph object
    p: sampling probability of sampling a node (and observing its neighbors)
    """

    # Initialize empty network
    g_new = nx.Graph()
    nodes = []

    # TODO: Write code for sampling. 
    # Iterate over nodes, and add to g_new with probability p.
    # If a node has been observed, then add all the node's neighbors to g_new as well.

    # YOUR CODE HERE
    for node in g.nodes():
        if rng.random() < p:
            nodes.append(node)
            g_new.add_node(node)

            for neighbor in g.neighbors(node):
                g_new.add_edge(node, neighbor)

    return g_new

In [None]:
# Auxiliary function, no need to touch this

def print_properties(g,scheme=' '):

    N=len(g)
    E=g.number_of_edges()
    k=2*E/N
    c=nx.average_clustering(g)

    printN='{:7d}'.format(N).rjust(8)
    printE='{:7d}'.format(E).rjust(8)
    printk='{:1.2f}'.format(k).rjust(8)
    printc='{:1.4f}'.format(c).rjust(7)

    printme=scheme.ljust(6)+'|'+printN+'|'+printE+'|'+printk+'|'+printc

    print(printme)


In [None]:
# First complete the sampling functions above and then run this to get a table of network characteristics

sampling_probability=0.2

g_nodes=sample_nodes(g,sampling_probability)
g_edges=sample_edges(g,sampling_probability)
g_stars=sample_stars(g,sampling_probability)

print('samp. |  nodes | edges  | avg deg| avg c')
print('-----------------------------------------')

print_properties(g,scheme='orig.')
print_properties(g_nodes,scheme='node')
print_properties(g_edges,scheme='edge')
print_properties(g_stars,scheme='star')


### c) Community structure in the sampled networks, visually (1 pt)

Next, visualize the three samples on top of the original network with the code below. Judging by eye, which two sampling methods best map out the underlying community structure? 

In [None]:
# Auxiliary function to be used below, no need to touch this.

def node_colors(g,g_new):
    '''Yields a list of nodes of g colored gray if they are not in g_new and red if they are'''

    colors=[]

    for node in g:
        if node in g_new:
            colors.append('red')
        else:
            colors.append('gray')

    return colors

In [None]:
fig, ax = plt.subplots(figsize=(3*8, 8))

ax.axis('off')
ax1=fig.add_subplot(1,3,1)
ax1.axis('off')
ax2=fig.add_subplot(1,3,2)
ax2.axis('off')
ax3=fig.add_subplot(1,3,3)
ax3.axis('off')

nx.draw(g, ax=ax1, pos=pos,node_size=5, node_color='gray', width=.1)
nx.draw(g_nodes,ax=ax1,pos=pos,node_size=5,node_color='red',edge_color='red')
ax1.set_title('Sampling nodes')

nx.draw(g, ax=ax2, pos=pos,node_size=5, node_color='gray', width=.1)
nx.draw(g_edges,ax=ax2,pos=pos,node_size=5,node_color='red',edge_color='red')
ax2.set_title('Sampling edges')

nx.draw(g, ax=ax3, pos=pos,node_size=5, node_color='gray', width=.1)
nx.draw(g_stars,ax=ax3,pos=pos,node_size=5,node_color='red',edge_color='red')
ax3.set_title('Sampling stars');

In [None]:
figure_fname = 'sampling_three_schemes.pdf' # run this block if you want to save your figure
path='./' # replace with your own path
fig.savefig(path+figure_fname)

### d) Detecting community structure in the sampled networks (1 pt)

Let's now check what a community detection algorithm will detect in the samples. Use the label propagation algorithm `nx.community.label_propagation_communities` (see exercise round 5) to detect communities in the original network and all three sampled networks. Print the number of communities in the original and sampled networks. A far larger number of communities is typically detected in one of the samples than in the rest. Which one? 

(If the above isn't clear, run the code a few times and you'll spot it — there's some randomness at play as always). 

_Note: your original network has 12 built-in communities, but the label propagation algorithm may detect fewer. If so, don't worry, but consider this as a lesson of the murky and messy nature of community detection: results may vary, even though they shouldn't!_


In [None]:

original_communities=0 # Write code below that assigns to this variable the number of communities detected in the original network
node_communities=0 # etc...
edge_communities=0
star_communities=0

# YOUR CODE HERE
original_communities = len(list(nx.community.label_propagation_communities(g)))
node_communities = len(list(nx.community.label_propagation_communities(g_nodes)))
edge_communities = len(list(nx.community.label_propagation_communities(g_edges)))
star_communities = len(list(nx.community.label_propagation_communities(g_stars)))

print("Original: "+str(original_communities)+" communities")
print("Nodes: "+str(node_communities)+" communities")
print("Edges: "+str(edge_communities)+" communities")
print("Star: "+str(star_communities)+" communities")

