# Estimating Parameters in Network Models

Throughout Chapter 5, we spent a lot of attention developing intuition for many of the network models that are essential to understanding random networks. Recall that the notation that we use for a random network (more specifically, a network-valued random variable), $\mathbf A$, does *not* refer to any network we could ever hope to see (or as we introduced in the previous chapter, *realize*) in the real world. This issue is extremely important in network machine learning, so we will try to drive it home one more time: no matter how much data we collected (unless we could get infinite data, which we *can't*), we can never hope to understand the true distribution of $\mathbf A$. As network scientists, this leaves us with a bit of a problem: what, then, can we do to make useful claims about $\mathbf A$, if we can't actually see $\mathbf A$ nor its distribution?

This is where statistics, particularly, **estimation**, comes into play. At a very high level, estimation is a procedure to calculate properties about a random variable (or a set of random variables) using *only* the data we are given: finitely many (in network statistics, often just *one*) samples which we assume are *realizations* of the random variable we want to learn about. The properties of the random variable that we seek to learn about are called **estimands**, and  In the case of our network models, in particular, we will attempt to obtain reasonable estimates of the parameters (our *estimands*) associated with random networks.

Several key assumptions will be heavily used throughout the course of this chapter, which were developed in Chapter 5. In particular, the most common two properties we will leverage are:
1. Independence of edges: when working with independent-edge random network models, we will assume that edges in our random network are *independent*. This means that the probability of observing a particular realization of a random network is, in fact, the product of the probabilities of observing each edge in the random network. Notationally, what this means is that if $\mathbf A$ is a random network with $n$ nodes and edges $\mathbf a_{ij}$, and $A$ is a realization of that random network with edges $a_{ij}$, then:
\begin{align*}
    \mathbb P(\mathbf A = A) &= \mathbb P(\mathbf a_{11} = a_{11}, \mathbf a_{12} = a_{12}, ..., \mathbf a_{nn} = a_{nn}) \\
    &= \prod_{i, j} \mathbb P(\mathbf a_{ij} = a_{ij})
\end{align*}
In the special case where our networks are simple (undirected and loopless), this simplifies to:
\begin{align*}
    \mathbb P(\mathbf A = A) &= \prod_{i < j} \mathbb P(\mathbf a_{ij} = a_{ij})
\end{align*}
for any network realization $A$ which is simple. This is because if $\mathbf a_{ij} = a$, then we also know that $\mathbf a_{ji} = a$.  Further, since $A$ is also simple, then we know hat $\mathbf a_{ii} = 0$; that is, no nodes have loops.

## The Coin Flip Example

Let's think about what exactly this means using an example that you are likely familiar with. I have a single coin, and I want to know the probability of the outcome of a roll of that coin being a heads. For sake of argument, we will call this coin *fair*, which means that the true probability it lands on heads (or tails) is $0.5$. In this case, I would call the outcome of the $i^{th}$ coin flip the random variable $\mathbf x_i$, and it can produce realizations which take one of two possible values: a heads (an outcome of a $1$) or a tails (an outcome of a $0$). We will say that we see $10$ total coin flips. We will number these realizations as $x_i$, where $i$ goes from $1$ to $10$. To recap, the boldfaced $\mathbf x_i$ denotes the random variable, and the unbolded $x_i$ denotes the realization which we actually see. Our question of interest is: how do we estimate the probability of the coin landing on a heads, if we don't know anything about the true probability value $p$, other than the outcomes of the coin flips we got to observe?

Here, since $\mathbf x_i$ takes the value $1$ or $0$ each with probability $0.5$, we would say that $\mathbf x_i$ is a $Bernoulli(0.5)$ random variable. This means that the random variable $\mathbf x$ has the Bernoulli distribution, and the probability of a heads, $p$, is $0.5$. All $10$ of our $\mathbf x_i$ are called *identically distributed*, since they all have the same $Bernoulli(0.5)$ distribution.

We will also assume that the outcomes of the coin flips are mutually independent, which is explained in the terminology section.

For any one coin flip, the probability of observing the outcome $i$ is, by definition of the Bernoulli distribution:
\begin{align*}
    \mathbb P(\mathbf x_i = x_i) = p^{x_i} (1 - p)^{1 - x_i}
\end{align*}
If we saw $n$ total outcomes, the probability is, using the definition of mutual independence:
\begin{align*}
    \mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_{n} = x_{n}; p) &= \prod_{i = 1}^{n}\mathbb P(\mathbf x_i = x_i) \\
    &= \prod_{i = 1}^n p^{x_i}(1 - p)^{1 - x_i} \\
    &= p^{\sum_{i = 1}^n x_i}(1 - p)^{n - \sum_{i = 1}^n x_i}
\end{align*}
What if we saw $10$ coin flips, and $6$ were heads? Can we take a "guess" at what $p$ might be? Intuitively your first reaction might be to say a good guess of $p$, which we will abbreviate $\hat p$, would be $0.6$, which is $6$ heads of $10$ outcomes. In many ways, this intuitive guess is spot on. However, in network machine learning, we like to be really specific about why, exactly, this guess makes sense. 

Looking at the above equation, one thing we can do is use the technique of **maximum likelihood estimation**. We call the function $\mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_n = x_n; p)$ the *likelihood* of our sequence, for a given value of $p$. Note that we have added the term "$; p$" to our notation, which is simply to emphasize the dependence of the likelihood on the probability. So, what we *really* want to do is find the value that $p$ could take, which *maximizes* the likelihood. 

An easier problem, we often will find, is to instead maximize the *log likelihood* rather than the likelihood itself. This is because the log function is *monotone*, which means that if $\mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_n = x_n; p_1) < \mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_n = x_n; p_2)$, then $\log\mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_n = x_n; p_1) < \log \mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_n = x_n; p_2)$ as well for some choices $p_1$ and $p_2$. Without going too down in the weeds, the idea is that the $\log$ function does not change any critical points of the likelihood. The log likelihood of the above expression is:
\begin{align*}
\log \mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_{n} = x_{n}; p) &= \log \left[p^{\sum_{i = 1}^n x_i}(1 - p)^{n - \sum_{i = 1}^n x_i}\right] \\
&= \sum_{i = 1}^n x_i \log(p) + \left(n - \sum_{i = 1}^n x_i\right)\log(1 - p)
\end{align*}

To do this, we will use some basic calculus. Remembering from calculus $1$ and $2$, to find a maximal point of the likelihood function with respect to some variable $p$, our process looks like this:
1. Take the derivative of the log-likelihood with respect to $p$,
2. Set it equal to $0$ and solve for the critical point $p^*$,
3. Verify that the critical point $p^*$ is indeed an estimate of a maximum, $\hat p$. 

Proceeding using the result we derived above, and using the fact that $\frac{d}{du} \log(u) = \frac{1}{u}$ and that $\frac{d}{du} \log(1 - u) = -\frac{1}{1 - u}$:
\begin{align*}
\frac{d}{d p}\log \mathbb P(\mathbf x_1 = x_1, ..., \mathbf x_{n} = x_{n}; p) &= \frac{\sum_{i = 1}^n x_i}{p} - \frac{n - \sum_{i = 1}^n x_i}{1 - p} = 0 \\
\Rightarrow \frac{\sum_{i = 1}^n x_i}{p} &= \frac{n - \sum_{i = 1}^n x_i}{1 - p} \\
\Rightarrow (1 - p)\sum_{i = 1}^n x_i &= p\left(n - \sum_{i = 1}^n x_i\right) \\
\sum_{i = 1}^n x_i - p\sum_{i = 1}^n x_i &= pn - p\sum_{i = 1}^n x_i \\
\Rightarrow p^* &= \frac{1}{n}\sum_{i = 1}^n x_i
\end{align*}
We use the notation $p^*$ here to denote that $p^*$ is a critical point of the function.

Finally, we must check that this is an estimate of a maximum, which we can do by taking the second derivative and checking that the second derivative is negative. We will omit this since it's a bit intricate and tangential from our argument, but if you work it through, you will find that the second derivative is indeed negative at $p^*$. This means that $p^*$ is indeed an estimate of a maximum, which we would denote by $\hat p$.

Finally, using this result, we find that with $6$ heads in $10$ outcomes, we would obtain an estimate:
\begin{align*}
    \hat p &= \frac{6}{10} = 0.6
\end{align*}
which exactly aligns with our intuition.

So, why do we need estimation tools, if in our example, our intuition gave us the answer a whole lot faster? Unfortunately, the particular scenario we described was one of the *simplest possible examples* in which a parameter requires estimation. As the scenario grows more complicated, and *especially* when we extend to network-valued data, figuring out good ways to estimate parameters is extremely difficult. For this reason, we will describe some tools which are very relevant to network machine learning to learn about network parameters.

We will review estimation techniques for several of the approaches we discussed in Chapter 5, for Single Network Models.

## Erdös-Rényi (ER)

Recall that the Erdös-Rényi (ER) network has a single parameter: the probability of each edge existing, which we termed $p$. Due to the simplicity of a random network which is ER, fortunately we can resort to the Maximum Likelihood technique we delved into in the coin example above, and it turns out we obtain a very similar result with some caveats. In Chapter 5, we explored the derivation for the probability of observing a realization $A$ of a given random network $\mathbf A$ which is ER, which is equivalent to the likelihood of $A$. Recall this was:

\begin{align*}
    \mathbb P_\theta(A) &= p^{m} \cdot (1 - p)^{\binom{n}{2} - m}
\end{align*}

where $m = \sum_{i < j} a_{ij}$ is the total number of edges in the observed network $A$. Our approach here parallels directly the approach for the coin; we begin by taking the log of the probability:

\begin{align*}
    \log \mathbb P_\theta(A) &= \log \left[p^{m} \cdot (1 - p)^{\binom{n}{2} - m}\right] \\
    &= m \log p + \left(\binom n 2 - m\right)\log (1 - p)
\end{align*}

Next, we take the derivative with respect to $p$, set equal to $0$, and we end up with:
\begin{align*}
\frac{d}{d p}\log \mathbb P_\theta(A) &= \frac{m}{p} - \frac{\binom n 2 - m}{1 - p} = 0 \\
\Rightarrow p^* &= \frac{m}{\binom n 2}
\end{align*}
We omitted several detailed steps due to the fact that we show the rigorous derivation above. Checking the second derivative, which we omit since it is rather mathematically tedious, we see that the second derivative at $p^*$ is negative, so we indeed have found an estimate of the maximum, and will be denoted by $\hat p$. This gives that the Maximum Likelihood Estimate (or, the MLE, for short) of the probability $p$ for a random network $\mathbf A$ which is ER is:

\begin{align*}
    \hat p &= \frac{m}{\binom n 2}
\end{align*}

## *a priori* Stochastic Block Model

The *a priori* Stochastic Block Model also has a single paramter: the block matrix, $B$, whose entries $b_{kk'}$ denote the probabilities of edges existing or not existing between pairs of communities in the Stochastic Block Model. When we derived the probability for a realization $A$ of a random network $\mathbf A$ which could be characterized using the *a priori* Stochasic Block Model, we obtained that:
\begin{align*}
    \mathbb P_\theta(A) &= \prod_{k, k' \in [K]}b_{k'k}^{m_{k'k}} \cdot (1 - b_{k'k})^{n_{k'k - m_{k'k}}}
\end{align*}

where $n_{k'k} = \sum_{i < j}\mathbb 1_{\tau_i = k}\mathbb 1_{\tau_j = k'}$ was the number of possible edges between nodes in community $k$ and $k'$, and $m_{k'k} = \sum_{i < j}\mathbb 1_{\tau_i = k}\mathbb 1_{\tau_j = k'}a_{ij}$ was the number of edges in the realization $A$ between nodes within communities $k$ and $k'$. 

Noting that the log of the product is the sum of the logs, or that $\log \prod_i x_i = \sum_i \log x_i$, the log of the probability is:
\begin{align*}
    \log \mathbb P_\theta(A) &= \sum_{k, k' \in [K]} m_{k'k}\log b_{k'k} + \left(n_{k'k} - m_{k'k}\right)\log(1 - b_{k'k})
\end{align*}

We notice a side-note that we mentioned briefly in the network models section: in a lot of ways, the probability (and consequently, the log probability) of a random network which is an *a priori* SBM behaves very similarly to that of a random network which is ER, with the caveat that the probability term $p$, the total number of possible edges $\binom n 2$, and the total number of edges $m$ have been replaced with the probability term $b_{k'k}$, the total number of possible edges $n_{k'k}$, and the total number of edges $m_{k'k}$ which *apply only to that particular pair of communities*. In this sense, the *a priori* SBM is kind of like a collection of communities of ER networks. Pretty neat right? Well, it doesn't stop there. When we take the partial derivative of $\log \mathbb P_\theta(A)$ with respect to any of the probability terms $b_{l'l}$, we see an even more direct consequence of this observation:
\begin{align*}
    \frac{\partial }{\partial b_{l' l}}\log \mathbb P_\theta(A) &= \frac{\partial}{\partial b_{l'l}}\sum_{k, k' \in [K]} m_{k'k}\log b_{k'k} + \left(n_{k'k} - m_{k'k}\right)\log(1 - b_{k'k}) \\
    &= \sum_{k, k' \in [K]} \frac{\partial}{\partial b_{l'l}}\left[m_{k'k}\log b_{k'k} + \left(n_{k'k} - m_{k'k}\right)\log(1 - b_{k'k})\right]
\end{align*}
Now what? Notice that any of the summands in which $k \neq l$ and $k' \neq l'$, the partial derivative with respect to $b_{l'l}$ is in fact exactly $0$! Why is this? Well, let's consider a $k$ which is different from $l$, and a $k'$ which is different from $l'$. Notice that:
\begin{align*}
\frac{\partial}{\partial b_{l'l}}\left[m_{k'k}\log b_{k'k} + \left(n_{k'k} - m_{k'k}\right)\log(1 - b_{k'k})\right] = 0
\end{align*}
which simply follows since the quantity to the right of the partial derivative is not a funcion of $b_{l'l}$ at all! Therefore:
\begin{align*}
    \frac{\partial }{\partial b_{l' l}}\log \mathbb P_\theta(A) &= 0 + \frac{\partial}{\partial b_{l'l}}\left[m_{l'l}\log b_{l'l} + \left(n_{l'l} - m_{l'l}\right)\log(1 - b_{l'l})\right] \\
    &= \frac{m_{l'l}}{b_{l'l}} - \frac{n_{l'l} - m_{l'l}}{1 - b_{l'l}} = 0 \\
\Rightarrow b_{l'l}^* &= \frac{m_{l'l}}{n_{l'l}}
\end{align*}

Like above, we omit the second derivative test, and conclude that the MLE of the block matrix $B$ for a random network $\mathbf A$ which is *a priori* SBM is the matrix $\hat B$ with entries:
\begin{align*}
    \hat b_{l'l} &= \frac{m_{l'l}}{n_{l'l}}
\end{align*}

## Singular Value Decomposition*

In the succedding sections, we will begin to explore a set of techniques, known as *matrix decomposition*s, in which a matrix is broken into progressive sets of submatrices. In our case, the matrix will be the adjacency matrix corresponding to a realized network (or networks), and our goal will be to identify submatrices which when analyzed allow us to perform inference about the realized network. 

The key to the successive sections is a concept known as the **singular value decomposition**, or SVD. Note that for these and successive sections, we will present a simplified, and non-rigorous, review of the SVD and many results that are important for developing intuition around this decomposition. This description of the SVD has been modified to fit our purposes: particularly, the description we provide applies only to square matrices (such as the adjacency matrix of a network), and many of the properties we will cover will only apply to symmetric matrices (such as an adjacency matrix for a network which is undirected). For more details, or for explicit proofs, we would recommend a Linear Algebra textbook [Trefethan, LADR].

**Singular Value Decomposition** (SVD) of a square, symmetric matrix: Suppose that $A$ is a matrix which is square, has $n$ rows and columns, and  is symmetric (that is, $A \in \mathbb R^{n \times n}$, and for any $i, j \in [n]$, $a_{ij} = a_{ji}$). Then there exists a matrix $U$ which is square and orthogonal (that is, $U \in \mathbb R^{n \times n}$ aand $UU^T = U^T U = I$) and a matrix $\Sigma$ which is square, diagonal, and whose entries are non-negative, decreasing, and real-valued (that is, $\Sigma \in \mathbb R^{n \times n}$, $\sigma_{ij} = 0$ for any $i \neq j$, and $\sigma_{1} \geq \sigma_2 \geq ... \geq \sigma_n \geq 0$ for all $i \in [n]$), where:
\begin{align*}
    A &= U \Sigma U^T
\end{align*}
The factorization of $A$ into the product of the matrices $U$, $\Sigma$, aand $U^T$ is known as a **singular value decomposition** of $A$. Further, the diagonal entries of $\Sigma$, $\sigma_{ii}$, are known as the **singular values** of the matrix $A$. Finally, the columns of the matrix $U$, each of which is an $n$-dimensional unit vector $u_i$, are known as the **singular vectors** of the matrix $A$. Symbolically, this looks something like this:
\begin{align*}
    A &= \begin{bmatrix}
    \uparrow & \uparrow &  & \uparrow \\
    u_1 & \vec u_2 & ... & \vec u_n \\
    \downarrow & \downarrow &  & \downarrow
    \end{bmatrix}\begin{bmatrix}
    \sigma_1 & &  & \\
    & \sigma_2 &  & \\
    & & \ddots & \\
    & & & \sigma_n
    \end{bmatrix}\begin{bmatrix}
    \leftarrow & \vec u_1^T & \rightarrow \\
    \leftarrow & \vec u_2^T & \rightarrow \\
    & \vdots & \\
    \leftarrow & \vec u_n^T & \rightarrow \\
    \end{bmatrix}
\end{align*}
Let's illustrate the SVD using an example. We will sample a $50 \times 50$ realization of an adjacency matrix from a random network which is an *a priori* SBM, with $2$ communities. The first 25 nodes will be in the first coommunity, and the second 25 nodes will be in the second community. The within-community probabilities will both be $0.8$, and the between-community probabilities will be $0.1$:

In [None]:
import numpy as np
import graspologic
from graspologic.simulations import sbm
from graphbook_code import heatmap, draw_multiplot, draw_layout_plot

# block matrix
B = np.array([[0.8, 0.1], [0.1, 0.8]])
# community assignment vector
n = [25, 25]

A = sbm(n=n, p=B, directed=False, loops=False)

In [None]:

plt = draw_multiplot(A, labels=["1" for i in range(0, 25)] + ["2" for i in range(0, 25)])

Now what happens when we take the SVD of $A$? We will first show that the singular vectors $U$ and the singular values of $\Sigma$ when appropriately arranged do, in fact, comprise $A$:

In [None]:
from scipy.linalg import svd

U, s, Ut = svd(A)
print(" A = USU^T: ", np.allclose(A, np.dot(U, np.dot(np.diag(s), Ut))))

What does $U$ look like?

In [None]:
heatmap(U, title="$U$")

In this book, we do not need to know how to solve for, nor prove the existence of, the SVD. We will take these two facts for granted. Discussing the below, we will delve briefly into the concept of **matrix rank**, which we will define now.

**Matrix Rank**: The rank of a matrix $A$, defined $rank(A)$, is the number of linearly independent rows and columns of $A$. 

What does this mean? At a very high level, we can think of the matrix rank as telling us just how "simple" the matrix $A$ is. A matrix which is rank $1$ is very simple, in that all of its rows or columns can be expressed as a weighted sum of just a single vector. On the other hand, a matrix which has "full rank", or a rank of $n$, is a bit more complex, in that no row nor column can be expressed as a sum of other rows nor columns.

The next two results are critical to our understanding of the geometry of what is going on in the SVD:
1. The matrix $A$ can be expressed as the sum of the rank $1$ matrices $\sigma_i \vec u_i \vec u_i^T$; that is:
\begin{align*}
    A &= \sum_{i = 1}^n \sigma_i u_i u_i^T
\end{align*}
What does this result say? This result expresses that the matrix $A$ can be fully expressed as the sum of a bunch of rank-1 matrices $u_i u_i^T$, which are weighted by the corresponding singular value $\sigma_i$. Each matrix product is $n \times n$ because $u_i \in \mathbb R^{n \times 1}$ is an $n$-dimensional row-vector and $\vec u_i^T \in \mathbb R^{1 \times n}$ is an $n$-dimensional column-vector. Symbolically, this looks like:
\begin{align*}
    A &= \sigma_1 \begin{bmatrix}\uparrow \\ \vec u_1 \\ \downarrow\end{bmatrix}\begin{bmatrix}\leftarrow & \vec u_1^T & \rightarrow \end{bmatrix} + 
    \sigma_2 \begin{bmatrix}\uparrow \\ \vec u_2 \\ \downarrow\end{bmatrix}\begin{bmatrix}\leftarrow & \vec u_2^T & \rightarrow \end{bmatrix} + 
    ... + 
    \sigma_n \begin{bmatrix}\uparrow \\ \vec u_n \\ \downarrow\end{bmatrix}\begin{bmatrix}\leftarrow & \vec u_n^T & \rightarrow \end{bmatrix}
\end{align*}
Let's take a look at what $\sigma_1 \vec u_1 \vec u_1^T$ looks like next:

In [None]:
heatmap(s[0]*np.multiply(U[:,0].reshape(50,1), Ut[0,:].reshape(1, 50)), title="$\sigma_1 u_1 u_1^T$")

2. Let $A_k = \sum_{i = 1}^k \sigma_i u_i u_i^T$, where $k \leq n$. Then:
\begin{align*}
    ||A - A_k||_F &= \min_{\text{all rank $k$ square matrices }A' \in \mathbb R^{n \times n}} ||A - A'||_F
\end{align*}
This result will be the most important thing we need to undersand about the SVD itself. In particular, what this result says is that if we were to look at the sum of $A$ expressed above, and *only* look at the sum of the first $k$ of those terms, that rank $k$ matrix $A_k$ is the most similar rank $k$ matrix to $A$, according to the Frobenius norm. 

What happens when we choose $k = 1$? Well, we get the result above! Now that doesn't look too much like $A$. But what if we chose a few more vectors, like $k=2$?

In [None]:
k=2
heatmap(np.dot(U[:,0:k], np.dot(np.diag(s[0:k]), Ut[0:k, :])), title="$A_2$", vmin=0, vmax=1)

Well, it's not exact, but it looks pretty close, considering we only used $2$ $n$-dimensional vectors instead of a full $n \times n$ matrix! Visually, it looks like we've done an extremely good job of capturing the modular structure that $A$ had, and in fact, it actually looks *more* obvious than with $A$ itself. This property, and the reason why two  vectors captured two communities, will be key to the concept of the spectral embedding, which we will learn about in the next section.


How about with $k=10$?

In [None]:
k=10
heatmap(np.dot(U[:,0:k], np.dot(np.diag(s[0:k]), Ut[0:k, :])), title="$A_{10}$", vmin=0, vmax=1)

Which looks very close to the plot of the adjacency matrix of $A$ that we saw previously.