# Why Use Statistical Models?

In network data science, we typically begin with a question of interest, and a network we use to answer that question. Consider, for instance, we may want to know whether two people being from the same school increases their chances of being friends. The data we have is a social network from a social media site, in which nodes represent students from the community, and the edges represent whether two people are friends on the social media site. Unfortunately, in addressing this question of interest, our social network dataset includes much with uncertainty. Perhaps a pair of students might be friends, but they never got around to adding each other on the social media site. Maybe our students had a fight and are no longer friends, but never bothered to delete one another as friends on the site. Other factors might exist that we don't know about (sports, hobbies, special interests) that influence whether two people are friends. Our social network might not capture all of the students, and we might be missing a large portion of the community all together. In many additional ways, our social network is noisy, and in order to address our question of interest, we need procedures which account for this uncertainty. To this end, we turn to *statistical modelling*. In statistical modelling, we explicitly specify our assumptions about uncertainty in the data we use to address our question of interest. Stated another way, our observed network is assumed to be a *realization* of a governing **random network**. From now on, when we say the word **network** without the word random in front of it, we are referring to the *realizations* of random networks.

In machine learning, we typically encounter situations in which we have $n$ observations in $d$ dimensions. Traditional statistical models include univariate statistical models (models for data with $1$ dimension) and multivariate statistical models (models for data with $d > 1$ dimensions), which can capture this traditional data representation. These models are well suited for discovering new insights about individual observations or collections of individual observations. Why do we need special statistical models for networks? Our realizations are not $n$ disparate observations in $d$-dimensions; a realization in network machine learning **is the full network itself**, consisting of nodes, edges, and potential network attributes. We seek to model a representation of the *entire* network so that we can convey insights about properties of the network. To address our question of interest above, we need to characterize how students relate to other students in the network, not describe individual students. To this end, we describe our random network using sets of statistical assumptions, referred to as the **statistical network model**. In this book, we typically describe networks using an approach called *generative modeling*, which means that we use models which describe *how* the random network governing our realization could have come about. Several models also use *discrimative modeling*, in which we have a target variable, and wish to model the probability of the target given an observed network.

```{admonition} Comparing a Univariate Statistical Model to a Random Network Model
Let's imagine we are tossing a coin, and we want to determine what the probability of the coin landing on heads is. Every time we toss the coin, we will get an outcome that we can see (did the coin land on heads, or did it land on tails?). We call the outcome of this coin toss $x$. To determine the probability of the coin landing on heads, we will assume that the outcome of the coin is random, and that each time we toss the coin, we are *realizing* a random variable $\mathbf{x}$. It is important to emphasize (again!) that $\mathbf{x}$ is random: it doesn't take any fixed value, whereas realizations $x$ take one of two possible values: heads or tails. If a coin toss was heads the first time (that is, a particular realization $x$ is heads), when we flip the coin again, we might not get a heads the second time (since the coin toss $\mathbf{x}$ is random until we realize it). We describe $\mathbf{x}$ using a *univariate statistical model*. For instance, we might think that $\mathbf{x}$ is a Bernoulli-distributed random variable, which means that if we knew the Bernoulli probability $p$, we would know that $\mathbf{x}$ has a $p$ chance of landing on heads. In this case, the Bernoulli distribution is the model for $\mathbf{x}$.

In much the same way, let's think about our social network example again. We have a simple network, $A$, where the nodes represent school students in a county, and the edges represent whether a given pair of students are friends on a social media site. Like each of the coin toss outcomes above, we get to see $A$, and know exactly what values $A$ takes. In the same way as above, we will assume that the network $A$ is a realization of a **random network** $\mathbf A$. Like the coin flip, we can describe $\mathbf A$ using a statistical model, because it is a random variable. The friends an individual has when we realize $\mathbf A$ are random, in the sense that they may be friends with some people they aren't actually friends with, not friends with some of their actual friends, or friends with arbitrary people they came into contact with in their lives. This level of uncertainty is captured in the randomness of $\mathbf A$. This time, we will describe $\mathbf A$ using a random network model, which we will learn more about in the next few sections.

Let's summarize what we learned above in a table to familiarize ourselves with the vocabulary:
| | Coin-Toss | Social Network |
| --- | --- | --- |
| Observed Data | Outcome of a coin toss $x$ | Simple network $A$ |
| Random Variable | $\mathbf{x}$, where $x$ is said to be a *realization* of $\mathbf{x}$ | random-network $\mathbf A$, where $A$ is a realization of $\mathbf A$ |
| Statistical Model | the Bernoulli distribution | a Random Network Model |
```

## Models aren't Right. Why do we Care?

It is important to clarify that we must pay careful attention to the age old aphorism attributed to George Box, a pioneering British statistician of the 20$^{th}$ century. George Box stated, "all models are wrong, but some are useful." In this sense, it is important to remember that the statistical model we select is, in practice, *never* the correct model (this holds for any aspect of machine learning, not just network machine learning). In the context of network science, this means that even if we have a model we think describes our network very well, it is *not* the case that the model we select actually describes the network precisely and correctly. Despite this, it is often valuable to use statistical models for the simple reason that assuming that a stochastic process (that is, some *random* process) which governs our data is what allows us to convey *uncertainty*. To understand the importance of leveraging uncertainty, consider the following scenarios:
1. Lack of information: In practice, we would never have all of the information about the system that produced the network we observe, and uncertainty can be used in place of that information. For instance, in our social network example, we might only know which school that people are from, but there are many other attributes that would impact the friend circle of a given student. We might not know things like which classes people have taken nor which grade they're in, but we would expect these facts to impact whether a given pair of people might have a higher chance of being friends. We can use uncertainty in our model to capture the fact that we don't know the classes nor grades of the students.
2. We might think the network is deterministic, rather than stochastic: In the extreme case, we might think that if we had *all* of the information which governs the network, then we could determine exactly what realizations would look like with perfect accuracy. Even if we knew exactly what realizations of the network might look like, this description, too, isn't likely to be very valuable. If we were to develop a model on the basis of everything, our model would be extremely complex and require a large amount of data. For instance, in our social network example, to know whether two people were friends with perfect accuracy, we might need to have intimate knowledge of every single person's life (Did they just have a fight with somebody and de-connect with that person? Did they just go to a school dance and meet someone new?). 
3. We learn from uncertainty and simplicity: When we do statistical inference, it is rarely the case that we prioritize a complex, convoluted model that mirrors our data suspiciously closely. Instead, we are usually interested in knowing how faithfully a simpler, more generally applicable model might describe the network. This relates directly to the concept of the bias-variance tradeoff from machine learning, in which we prefer a model which isn't too specific (lower bias) but still describes the system effectively (lower variance).

Therefore, it is crucial to incorporate randomness and uncertainty to understand network data. In practice, we select a model which is appropriate from a family of candidate models on the basis of three primary factors:
1. Utility: The model of interest possesses the level of refinement or complexity that we need to answer our scientific question of interest,
2. Estimation: The data has the level of breadth to facilitate estimation of the parameters of the model of interest, and
3. Appropriateness: The model is appropriate for the data we are given.

For the rest of this section, we will develop intuition for the first point. Later sections will cover estimation of parameters and model selection.