## Generalized Linear Models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from jupyprint import jupyprint, arraytex

[Recall the linear model...]

[Foundational statistical model in culture 1 language]

[Foundation machine learning algortihm in culture 2 language]

Where there are $k$ predictor variables:

$ \Large \hat{y_i} = b_0 + b_1 x_{1i} + ... b_k x_{ki} $

Let's keep it simple for now and use just one predictor (but it all generalizes to multiple predictors also):

$ \Large \hat{y_i} = b_0 + b_1 x_{i} $

Another way of writing this, using the notation of conditional probability is:

$ \Large E(y_i|x_{i}) = b_0 + b_1 x_{i} $

In english this reads as: 
- "the expected value of $y_i$, given the value of $x_i$ is equal to $b_0 + b_1 x_{i}$" 
- ... or ...
- "the expected value of $y_i$ is a linear function of $x_i$ with the slope $b_1$ and the intercept $b_0$".

Let's represent "$E(y_i|x_{1i})$" using the symbol "$\hat{\mu}_i$" (this symbol is called "mu").

We can now write the linear regression model as:

$ \Large \hat{\mu}_i = b_0 + b_1 x_{1i} $

Now let's write a really boring python function, called $f()$:

In [None]:
def f(mu):
    
    result = mu * 1
    
    return result

<b>Question: </b> just from looking at the function, which of these would be true:

$ \Large \hat{\mu}_i \neq f(\hat{\mu}_i)$

or

$ \Large \hat{\mu}_i == f(\hat{\mu}_i)$




In [None]:
# our simulated predictor variable scores
x = np.array([4, 6, 9, 2, 6, 7, 3, 3, 6, 7])

b_0 = 1

b_1 = 2

In [None]:
mu = b_0 + b_1 * x

mu

In [None]:
jupyprint("$\Large \hat{\mu}_i = b_0 + b_1 x_{1i} $")

In [None]:
jupyprint("$\Large \hat{\mu_i} ="+f" {b_0} + {b_1} * {arraytex(np.atleast_2d(x).T)} $")

In [None]:
jupyprint(f"$\Large {arraytex(np.atleast_2d(mu).T)} = {b_0} + {b_1} * {arraytex(np.atleast_2d(x).T)} $")

In [None]:
f(mu)

In [None]:
mu

In [None]:
all(f(mu) == mu)

[Let's call $f()$ the "identity function", because whatever input we give it, it returns the same value(s) as its output].

This is the *key thing to take away from this page*: **generalized linear models are just linear models where $f()$ is some other, more complex function than the identity function**.

$\Large f(\hat{\mu}_i) = b_0 + b_1 x_{i} $

$\Large f(\hat{\mu}_i) = b_0 + b_1 x_{1i} + b_2 x_{2i} + ... b_k x_{ki}$

What this allows is to use the slope/intercept framework of linear regression (and its associated interpreations) with different types of outcome variable.

[E.g. linear regression is used for numeric outcome variables, it produces predictions which range from $-\infty$ to $\infty$]

[Some outcome variables do not cover this range e.g. dummy-coded categorical variables which take values only of 0 or 1]

[We can use generalized linear models with these other types of variable]

[We just need a function, $f()$ which *maps* or *links* linear predictions on the $-\infty$ to $\infty$ scale to the actual scale of the outcome variable]

[One way of thinking of these models is that there is *some scale on which the model predictions form a straight line* (or plane/hyperplane, in the case of multiple predictors].

| Model type               | Range of predicted values | Example Outcome Variable  |Link name     | Link function $f(\mu)$|
|--------------------------|------------------------|------------------|-----------------------|-----------------------|
| Linear Regression        | (-∞, ∞)                | Income           |Identity       | 1 * $μ$               |
| Logistic Regression      | {0, 1}                 | Died/Survived    |Logit            | $ln(\frac{μ} {1 – μ})$|
| Poisson Regression       | 0, 1, 2, …  int(∞)     | Number of children|Log              | $ln(μ) $              |
| Gamma Regression         | (0, ∞)                 | Waiting times     |Inverse          | $1/μ $                |