# Generalized Linear Models 

refs:

* https://www.statsmodels.org/stable/glm.html
* https://stats.stackexchange.com/questions/29271/interpreting-residual-diagnostic-plots-for-glm-models

Generalized models can be used in situation where we do not have the linearuty assumption adn can be used to model different types of response variable. Multiple Linear Regerssion is one type of models tha you can go with GLM. See the table

1. Multiple Linear Reression: Y is numeric values
1. Logistic Regression: Y is proportion or probabilities
1. Poisson Regression: Y is countable


| model     | distribution  family | response variable | example                               | link function    |
|-----------|----------------------|-------------------|---------------------------------------|------------------|
|           | Poisson              | countable         |                                       | log              |
|           | Binomial             | probabilities     | seed will survive or not after 1 year | logits n probits |
|           | Normal               | numeric           | crop yields                           | identity         |
| LogLinear | Poisson              |                   |                                       |                  |





## GLM assumptions, Pros and Cons



* Assumptions 
    1. Ys are idependent and identical distributes
    1. Errors need to be independent but NOT normally distributed.

* Pros
    1. Does not assume linear relationship between Y and the coefficients. We do not need to transform the response Y to have a normal distribution
    1. **No need of the homogeneity of variance**
    1. Erros does not need to be normally distributed
    
* Crons

    1. Tends to require more data for good estimarions since it utilizes **MLE** instead of **OLS**

## GLM components




* Random: 
    
    * Define the response variable. This is defined by selecting probaility distribution function
    
* Systematic: The predictors to be included: $X_1, X_2, X_1 X_2,X_3^2$ and etc
    * linear predictor: $X \beta$

* Link: Connect the ramdom and systematic components. It is the function that linearize the model and compute the expected of the response variable Y.

   
   
For example: 

Linear Regression

* Random: 
    * $ Y \sim N(\mu, \sigma^2) $ 
    
* Systematic: Xs

* Link: 
    * $ E(Y) = \mu $
    * $g(E(Y)) = \mu \equiv \beta X $

Logistic Regression

* Random: 
    * $ Y \sim B(n,p) $
    
* Systematic: Xs

* Link: 
    * $ E(Y) = \mu $
    * $g(E(Y)) =  \beta X \equiv log (\frac{\mu}{1 - \mu}) $


## Linear Regression

For modelling numeric outcome not boundary 

Ex:
1. house price
2. freight market
3. **Crop yield and rainfall**

The linear 
* $E\{ Y|X\} \sim N(\mu(X), \sigma)$



<img src="../images/OLSassumptions-1.png" height="400" width="400">



### Assumption

Models based on OLS. Rememeber The acronym **LINE**

* L: linear relatioship between the coefficient s and the response variable
* I:  the errors are independent
* N: the responses are normally distributed at each level of X,
* E: the variance is constant


### Diagnostics

* Plot residuals: Look for patterns 

    1. Residuals vs Y
    1. Residuals vs Xs (included or not included in the model)


<img src="../images/resid-plots.gif" height="400" width="400">


* Check outliers

    1. leverage plots
    1. cook distances


## Logistic regression

For modelling binarry outcomes

Ex:
1. email sis span or not
2. treatment is effective


### Binomial distribution

https://en.wikipedia.org/wiki/Binomial_distribution



* PDF: $Y \sim B(n,p)$
    * parameters
        * $n$ is the number of trials
        * $p$ is the probability of succes
    * $k$ is the number of success

$
f(k,n;p) = Pr(X = \text{k success in n trials}) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}
$


* Mean: $np$
* STD: $npq$ where $q = 1-p$
    * $p = 1/2$ is the maximum
    * STD depends of the mean and the number of trials $n$

<img src="../images/Binomial_distribution_pdf.png" height="400" width="400">

### Assumptions


1. The response variable is binary

1. There should be no outliers in the data. 

1. There should be no high intercorrelations (multicollinearity)
    among the predictors.  This can be assessed by a correlation
    matrix among the predictors.
1. Independence The observations must be independent of one another


1. linearity of lof of the odds ratio: $log(\frac{p}{1-p})$ 

1. By definition, the variance of a binomial random variable is np(1−p), so that variability is highest when p=.5


**If the violation with the variance we can use quasibinomial family. Similar to poisson regression**

```r
glm(formula = cbind(YesVotes, NumVotes - YesVotes) ~ distance + 
    pctBlack + distance:pctBlack, family = quasibinomial, data = rrHale.df)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)        7.550902   4.585464   1.647    0.144
distance          -0.614005   0.412171  -1.490    0.180
pctBlack          -0.064731   0.065885  -0.982    0.359
distance:pctBlack  0.005367   0.006453   0.832    0.433

(Dispersion parameter for quasibinomial family taken to be 51.5967)

    Null deviance: 988.45  on 10  degrees of freedom
Residual deviance: 274.23  on  7  degrees of freedom

```

  * In the absence of overdispersion, we expect the dispersion parameter estimate to be 1.0. So if you use quaispoisson and the dispersion parameter is close to 1 you should use poisson, if bigger than 1 you should use quaispoisson or negative binomial disribution 


### Diagnostics 

* Check p-values
* Check performance test database
* Plot residuals 
    * Residuals vs Y
    * Residuals vs Xs
    
* Influence leverage plot

### Fit model

```r
mylogit <- glm(y ~ x1 + x2, data = sim, family = "binomial")
summary(mylogit)
mylogit$coefficients

```

## Poisson (Countable data) 

ref: 
* https://bookdown.org/roback/bookdown-bysh/ch-poissonreg.html


For modelling rate or numbers of events per time or space/location

Ex:
1. number of emails received per day
2. number of people sharing a house


Similar to Linera regression in Poisson regression we assume that the rsponse assume a poisson distribution for eache level of X

* $E\{ Y|X\} \sim Pois(\lambda)$


> Left linear Regression and right **Poisson regression**. **No need for constant variance** 

<img src="../images/OLSpois-1.png" height="400" width="400">


### Poisson distribution

https://en.wikipedia.org/wiki/Poisson_distribution


Expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known **constant mean rate** and **independently of the time since the last event**.

* PDF: $E\{Y|Xs\} \sim Pois(\lambda)$ 
    * parameters
        * $\lambda$ is the **rate**
    * $k$ is the number of occurrence

$
f(k;\lambda) = Pr(X = \text{k events in the inteval}) \equiv \frac{\lambda^k e^{-\lambda}}{k!}
$


* Mean: $\lambda$
* STD: $\sigma = \sqrt {\lambda}$

> If I am meassuring the number of emails received per day and take estimation and this mean rate is 10. Then the pdf will have a peak on 10 but values slight smaller or bigger are lieklly to happen as well. See the graph bellow:

<img src="../images/poisson_pmf.svg.png" height="400" width="400">

### Assumptions



1. The response variable is a **count per unit of time or space**, described by a Poisson distribution

1. **Independence** The observations must be independent of one another

1. **Linearity** of the log of the mean rate, log(λ), must be a linear function of x

1. **Mean=Variance** By definition, the mean of a Poisson random variable must be equal to its variance


What todo if the **variance** is much bigger than the **mean**? **over-dispersion**

* Use **quaispoisson** family that the variance is a parmeters of the distribution

```r
glm(formula = cases ~ city + age.range, family = quasipoisson(link = "log"), 
    data = nonmel, offset = log(n))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5060  -0.4857   0.0164   0.3693   1.2476  

Coefficients:
               Estimate Std. Error t value      Pr(>|t|)    
(Intercept)     -5.4834     0.1117  -49.08 0.00000000038 ***
cityDallas       0.8039     0.0563   14.29 0.00000195327 ***
age.range15_24  -6.1742     0.4932  -12.52 0.00000478575 ***
age.range25_34  -3.5440     0.1805  -19.64 0.00000022172 ***
age.range35_44  -2.3268     0.1373  -16.94 0.00000061180 ***
age.range45_54  -1.5790     0.1227  -12.87 0.00000396332 ***
age.range55_64  -1.0869     0.1195   -9.10 0.00003983884 ***
age.range65_74  -0.5288     0.1170   -4.52        0.0027 ** 
age.range75_84  -0.1157     0.1195   -0.97        0.3656    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.161)

    Null deviance: 2789.6810  on 15  degrees of freedom
Residual deviance:    8.2585  on  7  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4
```

* How to detect over-dispersion?

the glm with quaispoisson returns a parameter in the report called dispersion parmeter

    * In the absence of overdispersion, we expect the dispersion parameter estimate to be 1.0. So if you use quaispoisson and the dispersion parameter is close to 1 you should use poisson, if bigger than 1 you should use quaispoisson or negative binomial disribution 


* Use negative binomial distribution




### Diagnostics 

* Check p-values
* Check performance test database
* Plot residuals 
    * Residuals vs Y
    * Residuals vs Xs
    
* Influence leverage plot

### Fit model



```r
glm(formula = total ~ age, family = poisson, data = fHH1)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.5499422  0.0502754  30.829  < 2e-16 ***
age         -0.0047059  0.0009363  -5.026 5.01e-07 ***
---
(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2362.5  on 1499  degrees of freedom
Residual deviance: 2337.1  on 1498  degrees of freedom
AIC: 6714

```
