# Stochastic block model

The stochastic block model is a random graph model where all vertices are assigned a group. Edges are place between two vertices independently from the placement of other edges. The probability that an edge is placed between two given vertices depends only on the groups they have been assigned to. This random graph model is often used to mimic social where people are subdivided into groups where there is a lot of communication within people of each group, but not much between people from different groups.

Mathematically , the model is defined through a vector $\vec{n} = (n_1, n_2, \ldots, n_r)$ and a matrix $$P = \begin{pmatrix}
p_{11} & \cdots & p_{1r} \\
   \vdots & \ddots & \vdots \\
   p_{r1} & \cdots & p_{rr}
\end{pmatrix}.$$
Here, $r$ is the number of groups, the value of $n_k$ for fixed $k \in \{1, 2, \ldots, r\}$ tells you how many vertices are present in group $k$. Finally, the value of $p_{kl}$ for fixed $k, l \in \{1, 2, \ldots, r\}$ tells you the probability that an edge is placed between a vertex of group $k$ and $l$. Since the output graph is undirected, we need to have that $p_{kl} = p_{lk}$.

Algorithmically, we can create an instance of the stochastic block model by first generating an array $V = \{1, 2, \ldots, n_1 + n_2 + \ldots + n_r\}$ of vertices and an array $T$ of vertex groups. Element $T_i$ in the array $T$ will contain the group of vertex $i \in V$. For simplicity, you can often let $T$ be given by $$T = \{\underbrace{1, 1, \ldots, 1}_{n_1 \text{ times}}, \underbrace{2, 2, \ldots, 2}_{n_2 \text{ times}}, \ldots, \underbrace{r, r, \ldots, r}_{n_r \text{ times}}\}.$$ After defining $V$ and $T$, we loop through all vertices $i, j \in V$ such that $i \neq j$ and check their groups. Suppose that the group of vertex $i$ is $k$ and the group of vertex $j$ is $l$. Then, we add the edge $\{i, j\}$ to the edge list $E$ with probability $p_{kl}$. After we have looped over all pairs of vertices, then we have found a realisation of the stochastic block model.

**Exercise 1.** Create a function ``SBM(n, P)`` that inputs the vector $\vec{n}$ and matrix $P$ and outputs an instance of the stochastic block model. Make sure it outputs the vertex list $V$, the group list $T$ and the edge list $E$.

In [14]:
import numpy as np

def SBM(n, P):
    V = []
    T = []
    E = []

    number_vertices = sum(n)
    for i in range(number_vertices):
        V.append(i)
    for i in range(len(n)):
        for j in range(n[i]):
            T.append(i+1)
    for i in range(len(V)):
        for j in range(i+1, len(V)):
            if np.random.rand() < P[T[i]-1][T[j]-1]:
                E.append([V[i], V[j]])

    return V, E, T

#testing
n = [50, 30, 10]
P = [[0.5, 0.1, 0.02], [0.1, 0.7, 0.1], [0.02, 0.1, 0.3]]
for row in P:
    print(row)

V, E, T = SBM(n, P)
print(len(V))
print(len(E))
print(len(T))
print(T)
print(E)

[0.5, 0.1, 0.02]
[0.1, 0.7, 0.1]
[0.02, 0.1, 0.3]
90
1149
90
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[[0, 1], [0, 4], [0, 5], [0, 6], [0, 7], [0, 10], [0, 11], [0, 13], [0, 15], [0, 16], [0, 18], [0, 20], [0, 23], [0, 24], [0, 26], [0, 27], [0, 28], [0, 29], [0, 30], [0, 32], [0, 33], [0, 34], [0, 36], [0, 37], [0, 38], [0, 41], [0, 47], [0, 49], [0, 61], [0, 66], [0, 69], [0, 73], [1, 2], [1, 6], [1, 7], [1, 9], [1, 10], [1, 13], [1, 15], [1, 17], [1, 20], [1, 23], [1, 30], [1, 32], [1, 34], [1, 38], [1, 44], [1, 45], [1, 47], [1, 49], [1, 55], [1, 63], [1, 69], [1, 71], [1, 76], [2, 3], [2, 4], [2, 5], [2, 6], [2, 8], [2, 10], [2, 11], [2, 13], [2, 14], [2, 16], [2, 18], [2, 20], [2, 23], [2, 24], [2, 26], [2, 28], [2, 29], [2, 32], [2, 33], [2, 36], [2, 37

## The stochastic block model in NetworkX

In NetworkX the stochastic block model can be generated using the function ``stochastic_block_model``. It inputs the vector $\vec{n}$ and the matrix $P$ as lists (or ``np.array`` objects). Below you see some example code where NetworkX is used to generate an instance of the stochastic block model with three groups.

In [19]:
import networkx as nx

#Generate the instance of SBM
n = [50, 30, 10]
P = [[0.5, 0.1, 0.02], [0.1, 0.7, 0.1], [0.02, 0.1, 0.3]]
G = nx.stochastic_block_model(n, P)

#Extract the vertex list, edge list, and the group allocation
V = np.array(G.nodes)
E = np.array(G.edges)
T = np.zeros_like(V)
for group, partition in enumerate(np.array(G.graph['partition'])):
    T[np.array(list(partition))] = group

**Exercise 2.** By default, the function ``stochastic_block_model`` will always assign vertex $1$ through $n_1$ to group $1$, vertex $n_1 + 1$ through $n_1 + n_2$ to group $2$ et cetera. Sometimes, though, you might want to specificy the groups beforehand. Create an implementation of the stochastic block model using NetworkX where you input the vertex-group array $T$ and the probability matrix $P$, but not the vector $\vec{n}$.

NetworkXException: 'sizes' and 'p' do not match.

Like in the Erdős–Rényi model, the implementation of ``stochastic_block_model`` in NetworkX might be relatively slow when the probabilities $p_{kl}$ in the matrix $P$ satisfy $$p_{kl} \approx \frac{C_{kl}}{n_1 + n_2 + \ldots + n_r},$$ for some fixed constant $C_{kl} > 0$ for all groups $k$ and $l$. To solve this issue, the NetworkX implementation of ``stochastic_block_model`` has an optional parameter ``sparse`` that changes the generation algorithm of the stochastic block model. Its default value is "true" which results in a faster algorithm when the probabilities $p_{kl}$ are small compared to $n_1 + n_2 + \ldots + n_r$, but a slower algorithm when these values are big.

**Exercise 3.** Compare the NetworkX implementation of the stochastic block model for increasing values of $n$ with the following two probability matrices: $$P_1 = \begin{pmatrix}0.3 & 0.6 \\0.6 & 0.4 \end{pmatrix}, \qquad \text{and} \qquad P_2 = \begin{pmatrix}0.15 / n & 0.3/n \\0.3/n & 0.2/n \end{pmatrix}. $$ Take $n_1 = n_2 = n/2$ (rounding the values up or down if needed), and consider both the option ``sparse`` to be true, and ``sparse`` to be false. For each value of $n$ you consider, measure the average time the code takes to run over multiple realisations of the model. Then, plot these times on a log-log scale for all probability matrices and ``sparse``-option settings. What do you see?

In [None]:
#Your answer goes here