# Why Use Statistical Models?

Consider a social network. A common question as a scientist that we might have is how, exactly, we could describe the network in the simplest way possible. For instance, if we know nothing about the people, or how they might be connected in the social network, we might want to just say that a pair of people have a probability of being friends. On the other hand, if we know people within the social network are groups of students from different schools, we might want to say that people from the same school have a higher probability of being friends than people from different schools. The way we characterize the network in one of these (or other) ways is called the choice of an underlying statistical model.

The network we actually observe (for which we have vertices, edges, and perhaps network attributes) is *not* the true network; rather, we assume that the true network is a network for which we could *never* observe completely, as each time we look at the network, we will see it slightly differently. In our social network example above, for instance, a person in our network might have a slightly different group of friends depending on when we look at their friend circle. Stated another way, our observed network is merely a *realization* of an underlying **random network**. If you are familiar with statistics, this is very similar to how we study random variables. Consider, for instance, studying the outcome of a coin flip. We migh suppose the existence of a random variable $X$ which represents the random outcome of the coin flip. We never get to see $X$ itself (the random process that leads to the outcome of the coin flip); rather, we see realizations of $X$ (the outcomes of coin flips). These outcomes take values of $1$ (heads) with probability $p$, and $0$ (tails) with probability $1 - p$. Here, instead of realizations of $X$ being either $0$s or $1$s, the realizations of our random network are networks (with vertices, edges, and attributes). Like in univariate statistics, merely stating that a network is a realization of a random network is, however, extremely loose. When we described $X$ as Bernoulli, we made a very specific statement about $X$: any description of $X$ which takes values which are not $0$ or $1$ with a positive probability are immediately imprecise descriptions of $X$. That is, by modelling $X$ as Bernoulli, we described a set of statistical assumptions (the Bernoulli distribution) which $X$ obeys. In the same way, we describe our random network using sets of statistical assumptions, referred to as the **statistical network model**. In this book, we describe networks using an approach called *generative modelling*, which means that we use models which describe *how* the random network underlying our realization could have come about. That is, the statistical network model indicates what, exactly, is random, and how we can characterize its behavior statistically. 

Before we get started, it is important to clarify that we must pay careful attention to the age old aphorism attributed to George Box, a pioneering British statistician of the 20$^{th}$ century. George Box stated, "all models are wrong, but some are useful." In this sense, it is important to remember that the statistical model we select is, in practice, *never* the correct model (this holds for any aspect of statistics, not just network statistics). In the context of a network, this means that even if we have a model we think describes our network very well, it is *not* the case that the model we select actually describes the network precisely and correctly. Despite this, it is often valuable to use statistical models for the simple reason that assuming a stochastic process (that is, some *random* process) underlies our data allows us to convey *uncertainty*. Stated another way, even if we believe that the process underlying the network is not random at all, we can still extract value by using a statistical model. To understand the importance of uncertainty, consider the following scenarios:
1. Lack of information: In practice, we will almost never have all of the information about the underlying system that produced the network we observe. Uncertainty can be used in place of the information we did not get to observe. For instance, in our social network example, we might only know whether the school that people are from. We might not know things like which classes people have taken nor which grade they are in, even though we would expect these facts to impact whether a given pair of people might have a higher chance of being friends.
2. We might think the network is deterministic, rather than stochastic: In the extreme case, we might think that if we had *all* of the information which underlies the structure of a network, we could determine exactly what realizations would look like with perfect accuracy (that is, the network is deterministic). Even if we knew exactly what realizations of the network might look like, this description, too, is not likely to be very valuable. If we were to condition on everything, our model will be extremely complex and likely require a large amount of data. For instance, in our social network example, to know whether two people were friends with perfect accuracy, we might need to have intimate knowledge of every single person's life who is in our network (Did they just have a fight with somebody and de-connect with that person? Did they just go to a school dance and meet someone new?). 
3. We learn from uncertainty and simplicity: When we want to do statistical inference, we typically want the *simplest* description possible which is still faithful to the observed data. Instead, we are usually interested in knowing how faithfully a much smaller, more generally applicable, set of variables might describe the network. In our social network example, if we condition on a large, extremely specific set of covariates, we lose our ability to generalize things we learn about the network to new networks. This is because we have to condition on a set which is unique to our network, or are too specific to learn about properties of our underlying network. For example, in our social network data, it might be more favorable to learn how two students being in or not in the same class impacts their probability of being friends, rather than how a single student's texting habits impact their friend circle specifically. Stated another way, simpler models tend to show more *generalizability* within the context of a network than networks which are extremely restrictive and specific.

Therefore, it is crucial to incorporate randomness and uncertainty to understand network data. In practice, we select a model which is appropriate from a family of candidate models on the basis of three primary factors:
1. Utility: The model of interest possesses the level of refinement or complexity that we need to answer our scientific question of interest,
2. Estimation: The data has the level of breadth to facilitate estimation of the parameters of the model of interest,
3. Model Selection: The model is appropriate for the data we are given.

For the rest of this section, we will develop intuition for the first point. Later sections will cover estimation of parameters and model selection.