# Why Use Statistical Models?

Consider a social network. A common question as a scientist that we might have is how, exactly, we could describe the network in the simplest way possible. For instance, if we know nothing about the people, or how they might be connected in the social network, we might want to just say that a pair of people have a probability of being friends. On the other hand, if we know people within the social network are groups of students from different schools, we might want to say that people from the same school have a higher probability of being friends than people from different schools. The way we characterize the network in one of these (or other) ways is called the choice of an underlying statistical model.

The network we actually observe (for which we have vertices, edges, and perhaps network attributes) is *not* the true network; rather, we assume that the true network is a network for which we could *never* observe completely, as each time we look at the network, we will see it slightly differently. In our social network example above, for instance, a person in our network might have a slightly different group of friends depending on when we look at their friend circle. Stated another way, our observed network is merely a *realization* of an underlying **random network**. We describe our random network using sets of statistical assumptions, referred to as the **statistical network model**. In this book, we describe networks using an approach called *generative modelling*, which means that we use models which describe *how* the random network underlying our realization could have come about. That is, the statistical network model indicates what, exactly, is random, and how we can characterize its behavior statistically. 

```{admonition} Comparing a Univariate Statistical Model to a Random Network Model
Here, we will show a direct comparison of the traditional framework for univariate statistical models extends to random network models. For a univariate statistical model, let's imagine we are tossing a coin 100 times, and we want to determine what the probability of the coin landing on heads is. Every time we toss the coin, we will get an outcome that we can see (did the coin land on heads, or did it land on tails?). Let's call the outcome of the coin toss (heads or tails) $x_i$, where $i$ just indicates the index (between $1$ and $100$) of the particular coin toss. To determine the probability of the coin landing on heads, we will assume that the outcome of the coin is random, and that each time we toss the coin, we are realizing a random variable $X_i$. It is important to emphasize (again!) that this $X_i$ is random: it does not take any particular value. If the coin toss $x_i$ was heads the first time, when we flip it again, we might not get a heads the second time. Instead, we describe $X_i$ using a univariate statistical model. The univariate statistical model is a set of possible descriptions of $X_i$. For instance, we might think that it is a Bernoulli-distributed random variable, which means that if we knew the Bernoulli probability $p$, we would know that $X_i$ has a $p$ chance of landing on heads. In this case, the Bernoulli distribution is the model for $X_i$.

In much the same way, let's think about our social network example again. We have the topology of a network, $\pmb a$, where the vertices are school students in a county, and the edges are whether those two people are friends or not. Like each of the coin toss outcomes above, we get to see $\pmb a$, and know exactly what values $\pmb a$ takes. In the same was as above, we will assume that the network $\pmb a$ is a realization of a random **network** $\pmb A$. Like the random variables $\pmb X_i$ above, we can only describe $\pmb A$ using a statistical model, because it is a random quantity. If we looked at who students were friends with again, we might see that some people are friends who weren't before, and other people aren't friends who were friends before. This time, we will describe the random network $\pmb A$ using a random network model, which we will learn more about in the next few sections.

| | Coin-Toss | Social Network |
| --- | --- | --- |
| Observed Data | Outcome of a coin toss $x_i$ | Topology of a network $\pmb a$ |
| Random Variable | $X_i$, where $x_i$ is supposed is said to be a *realization* of $X_i$ | random-network $\pmb A$, where $\pmb a$ is a realization of $\pmb A$ |
| Statistical Model | the Bernoulli distribution | a Single Network Model |
```

Before we get started, it is important to clarify that we must pay careful attention to the age old aphorism attributed to George Box, a pioneering British statistician of the 20$^{th}$ century. George Box stated, "all models are wrong, but some are useful." In this sense, it is important to remember that the statistical model we select is, in practice, *never* the correct model (this holds for any aspect of statistics, not just network statistics). In the context of a network, this means that even if we have a model we think describes our network very well, it is *not* the case that the model we select actually describes the network precisely and correctly. Despite this, it is often valuable to use statistical models for the simple reason that assuming a stochastic process (that is, some *random* process) underlies our data allows us to convey *uncertainty*. Stated another way, even if we believe that the process underlying the network is not random at all, we can still extract value by using a statistical model. To understand the importance of uncertainty, consider the following scenarios:
1. Lack of information: In practice, we will almost never have all of the information about the underlying system that produced the network we observe. Uncertainty can be used in place of the information we did not get to observe. For instance, in our social network example, we might only know whether the school that people are from. We might not know things like which classes people have taken nor which grade they are in, even though we would expect these facts to impact whether a given pair of people might have a higher chance of being friends.
2. We might think the network is deterministic, rather than stochastic: In the extreme case, we might think that if we had *all* of the information which underlies the structure of a network, we could determine exactly what realizations would look like with perfect accuracy (that is, the network is deterministic). Even if we knew exactly what realizations of the network might look like, this description, too, is not likely to be very valuable. If we were to condition on everything, our model will be extremely complex and likely require a large amount of data. For instance, in our social network example, to know whether two people were friends with perfect accuracy, we might need to have intimate knowledge of every single person's life who is in our network (Did they just have a fight with somebody and de-connect with that person? Did they just go to a school dance and meet someone new?). 
3. We learn from uncertainty and simplicity: When we want to do statistical inference, it is rarely the case that we want to prioritize a complex, convoluted description of the data that matches the data near perfectly. Instead, we are usually interested in knowing how faithfully a much smaller, more generally applicable, set of variables might describe the network. This relates directly to the concept of the bias-variance tradeoff from Machine Learning, in which we prefer a model which is not too specific (lower bias) while still describing the system effectively (lower variance).

Therefore, it is crucial to incorporate randomness and uncertainty to understand network data. In practice, we select a model which is appropriate from a family of candidate models on the basis of three primary factors:
1. Utility: The model of interest possesses the level of refinement or complexity that we need to answer our scientific question of interest,
2. Estimation: The data has the level of breadth to facilitate estimation of the parameters of the model of interest, and
3. Model Selection: The model is appropriate for the data we are given.

For the rest of this section, we will develop intuition for the first point. Later sections will cover estimation of parameters and model selection.