# Community detection

In this practical we will try out some tools to discover community structure in networks.

Some of the libraries are difficult to install on some systems, so it is best to open this notebook in Colab [![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jgarciab/NetworkScience/blob/main/Practicals/day3a_community_detection/Community_detection_2024.ipynb)

To start, we need to install [graph-tool](https://graph-tool.skewed.de/) into the Colab environment.

In [None]:
# Graph-tool is not available in the standard colab environment so we must first install it...
%%capture
!wget https://downloads.skewed.de/skewed-keyring/skewed-keyring_1.0_all_$(lsb_release -s -c).deb
!dpkg -i skewed-keyring_1.0_all_$(lsb_release -s -c).deb
!echo "deb [signed-by=/usr/share/keyrings/skewed-keyring.gpg] https://downloads.skewed.de/apt $(lsb_release -s -c) main" > /etc/apt/sources.list.d/skewed.list
!apt-get update
!apt-get install python3-graph-tool python3-matplotlib python3-cairo

# Colab uses a Python install that deviates from the system's! Bad colab! We need some workarounds.
!apt purge python3-cairo
!apt install libcairo2-dev pkg-config python3-dev
!pip install --force-reinstall pycairo
!pip install zstandard

In [None]:
# the following gives access to your google drive so that you can load and
# save files
from google.colab import drive
drive.mount('/drive')

In [None]:
# import libraries
import graph_tool.all as gt
from graph_tool import topology, inference, generation, stats, correlations
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

## Detecting communities with modularity and the SBM


### Q1.

Calculate the modularity of the 9 node network and 2 group partition given below. (show your working)


![https://github.com/piratepeel/piratepeel.github.io/raw/master/teaching/network9nodes.png](https://github.com/piratepeel/piratepeel.github.io/raw/master/old/teaching/network9nodes.png)

### Q2.

(a) The function `create_graph_k_cliques(k)` below generates a network with $k$ weakly connected cliques of size 10.

```python
def create_graph_k_cliques(k, clique_size=10):
  g = gt.Graph(directed=False)
  lastnode = 0
  for clique in range(k):
    for i in range(lastnode, lastnode + clique_size):
      for j in range(i+1, lastnode + clique_size):
        g.add_edge(i, j)
    lastnode += clique_size
    if clique < k - 1:
      g.add_edge(i, lastnode)
    else:
      g.add_edge(i, 0)
  return g
```

Use the function to generate a graph with $k=10$ cliques.



(b) Use the previous function to generate networks with $k \in \{2,3,..,150\}$ cliques of size 10.

Use modularity maximisation to detect communities in each of the networks. What do you notice?

Calculate the number of communities retrieved by Modularity maximisation and compare them with the true number of cliques in the network.

What do you observe? Can you explain why this happens?

_Notes:_

In graph-tool you can find the maximum modularity partition:
```python
state = gt.minimize_blockmodel_dl(g1, state=inference.ModularityState)
modularity_g = state.modularity()
print('Maximised modularity:', modularity_g)
print('Number of groups:', state.get_B())
```



### Q3. Community detection

#### (a) Network One

The first network we will generate will be a simple two parameter SBM (often referred to as the *planted partition* model) in which nodes in the same community connect with higher probability than nodes from different communities.

In [None]:
def prob(a, b):
   if a == b:
       return 0.9
   else:
       return 0.01

g1, bm = generation.random_graph(1000, lambda: np.random.poisson(10),
                            directed=False,
                            model="blockmodel",
                            block_membership=lambda: np.random.randint(10),
                            edge_probs=prob)


Now we can draw the graph with the force-directed layout and colour the nodes according to the community they belong to. For this network, it should be relatively easy to see the communities in this visualisation.

In [None]:
pos = gt.arf_layout(g1)
gt.graph_draw(g1, pos=pos, vertex_fill_color=bm)

Of course, when we encounter networks in the wild, we don't know a priori how the nodes are assigned to communities. We can use graph-tool to try to find the partition with the maximum modularity...

In [None]:
state = gt.minimize_blockmodel_dl(g1, state=inference.ModularityState)
modularity_g = state.modularity()
print('Maximised modularity:', modularity_g)
print('Number of groups:', state.get_B())
#pos = gt.arf_layout(g)
state.draw(pos=pos)

Maximising modularity seems to work well!

However, we can also use statistical inference with the SBM to recover the planted communities...

In [None]:
state = gt.minimize_blockmodel_dl(g1, multilevel_mcmc_args={'niter' : 5})
print('Number of groups:', state.get_nonempty_B())
state.draw(pos=pos)

Both methods seem to work well. So which should we use?

#### (b) Network Two

Let's try another example...

In [None]:
g2 = generation.random_graph(1000, lambda: np.random.poisson(10), directed=False)

state = gt.minimize_blockmodel_dl(g2, state=inference.ModularityState)
modularity_g2 = state.modularity()
print('Maximised modularity:', modularity_g2)
print('Number of groups:', state.get_B())
pos = gt.arf_layout(g)
state.draw(pos=pos)

**What do you notice here? Is this what you would expect?**

**Are these communities meaningful?**

To explore this question further, we can perform a null hypothesis test in the same way that we did in the previous session.

In [None]:
g_rand = g2.copy()

n_samples = 1000
modularity_values = np.empty(n_samples)
for i in range(n_samples):
  generation.random_rewire(g_rand)
  state = gt.minimize_blockmodel_dl(g_rand, state=inference.ModularityState)
  modularity_values[i] = state.modularity()

In [None]:
plt.hist(modularity_values, bins=50)
plt.axvline(modularity_g2, color='black')
_ =plt.xlabel('p-value = {}'.format(sum(modularity_g2<modularity_values)/n_samples))

**What do these results tell you?**

**What happens when you use the SBM to infer the communities?**

In [None]:
state = gt.minimize_blockmodel_dl(g2, multilevel_mcmc_args={'niter' : 5})
print('Number of groups:', state.get_nonempty_B())
state.draw(pos=pos)


#### (c) Network Three

Now let's try a slight modification of the previous network...

In [None]:
g3 = generation.random_graph(1000, lambda: np.random.poisson(10), directed=False)

clique_size = 25
for i in range(clique_size):
  for j in range(i+1, clique_size):
    g3.add_edge(i, j, False)


**What is the modification?**

Try using modularity maximisation to detect communities...

Are these results what you would expect?

Now try the null hypothesis test again...

**How can you interpret this result?**

**Now try using the SBM... What do you notice?**

### Q4. Degree-correction

We can think of the SBM as a collection of ER random graphs. However this means that each block has a homogeneous degree distribution.

To allow for hetergenous degree distributions we can modify the SBM by modulating the connection probabilities to capture a given degree sequence, in the same way we create the configuration model. We call this model the degree-corrected block model.

Here we will see how we can choose between the two models using the minimum description length.

In [None]:
g_polblogs = gt.collection.data['polblogs']
g_polblogs = topology.extract_largest_component(g_polblogs, prune=True)
gt.graph_draw(g_polblogs, vertex_fill_color=g_polblogs.vp['value'], pos=g_polblogs.vp['pos'])

In [None]:
state = gt.minimize_blockmodel_dl(g_polblogs, state_args={'deg_corr': False}, multilevel_mcmc_args={'niter' : 5, 'B_max' : 2})
print('Entropy:', state.entropy())
print('Number of groups', state.get_nonempty_B())
state.draw(pos=g_polblogs.vp['pos'])


In [None]:
state = gt.minimize_blockmodel_dl(g_polblogs, state_args={'deg_corr': True}, multilevel_mcmc_args={'niter' : 5, 'B_max' : 2})
print('Entropy:', state.entropy())
print('Number of groups', state.get_nonempty_B())
state.draw(pos=g_polblogs.vp['pos'])

## Q5. Hierachical Communities

Here is the example of the face to face contacts of the high school network...


In [None]:
!wget https://networks.skewed.de/net/sp_high_school/files/proximity.gt.zst
g = gt.load_graph("proximity.gt.zst")

In [None]:
# class labels are stored in the following vertex property
set(g.vp['class'])

In [None]:

# colour the nodes by the class they attend
subjects = list(set(g.vp['class']))
subjects.sort()
scmap = plt.get_cmap('tab20c')

subjectcolormap = dict(zip(subjects, scmap(list(range(len(subjects))))))

vertex_color = g.new_vertex_property("vector<float>")
for v in g.vertices():
  vertex_color[v] = subjectcolormap[g.vp['class'][v]]
g.vp.vertex_color = vertex_color

# plot the network
gt.graph_draw(g, vertex_fill_color=g.vp['vertex_color'])

# create legend
fig = plt.figure(figsize=(1,1))

for subject in subjects:
  plt.plot([1], [1], label=subject, marker='s', color=subjectcolormap[subject])

lg = plt.legend(fontsize='xx-large', markerscale=3, ncol=3)
out = plt.axis('off')

In [None]:
# fit hierarchical blockmodel
state = gt.minimize_nested_blockmodel_dl(g)


# plot the hierarchy
state.draw(vertex_fill_color=g.vp['vertex_color'])
state.print_summary()

# create legend
fig = plt.figure(figsize=(1,1))

for subject in subjects:
  plt.plot([1], [1], label=subject, marker='s', color=subjectcolormap[subject])

lg = plt.legend(fontsize='xx-large', markerscale=3, ncol=3)
out = plt.axis('off')

print('Entropy:', state.entropy())