## Course: Building Statistical Models Using StatsModels
### Course Autor: Janani Ravi

# Module 1: Exploring Statstical Properties Using StatsModels

In [12]:
from inspect import getsource
import statistics

In [9]:
statistics.mean([1,2,3])

2

In [13]:
print(getsource(statistics.mean))

def mean(data):
    """Return the sample arithmetic mean of data.

    >>> mean([1, 2, 3, 4, 4])
    2.8

    >>> from fractions import Fraction as F
    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
    Fraction(13, 21)

    >>> from decimal import Decimal as D
    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
    Decimal('0.5625')

    If ``data`` is empty, StatisticsError will be raised.
    """
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    T, total, count = _sum(data)
    assert count == n
    return _convert(total/n, T)



Important to consider removing Skewness in data for modeling

# Module 2: Building Linear Models Using StatsModels

## Regression
### Requirements for using regression:
- have zero mean
- have constant  variance
- be independent of each other
- be independent of x
- be normally distributed 

## Heteroscedasticity: non-constant variance

- This can be a serious probelm building regression models
- Stock prices are heteroscedastic 
    - Trending up over time
    - Changing variance
    
### Detecting heteroscedasticity
- Scatter plot of residuals will have a fanning out shape (delta)
- R^2 is too good to be true (> 80%)

### Implications
- Overall regression equation is still unbiased
- However estimates of regression parameters now biased
- Confidence intervals may be worse than they appear
- Using regression for prediction could be risky

### Solutions
- Transform the data
    - use of log returns
    - for stock prices
        - use returns instead of price
        - or use log price
- Use different regression model
    - weighted least squares
    - generalized leasted squares
    
## Generalized Least Squares (GLS)
A technique for fitting a “better” regression line
between the residuals in an OLS model when they
exhibit heteroscedasticity

## Weighted Least Squares (WLS)
Weighted least squares (WLS) is a specialization of GLS
regression
- OLS minimizes Mean Square Error (MSE)
- WLS minimizes weighted MSE
- What weights to use?
- Need to specify - major drawback of WLS
- **typically very hard to use in real life, b/c finding right weights is hard**
- Use Cases
    - Data is hetereoscedastic
    - Regression should concentrate on
    - specific data points
    - Not all data points are equal
    - The linear regression is part of another
    - non-linear procedure
- Drawbacks
    - What weights to use?
    - Need to specify - major drawback of WLS
    - Need very precise weight estimates
    - Sensitive to outliers
    
**Transforming data is a more commonly used way of dealing with heteroscedasticity than using GLS or WLS**

## Generalized Linear Models
A flexible generalization of ordinary linear regression that allows for non-normal y-variables, **even categorical**

Elments of a GLM
- Probability distribution of Y
    - Normal
    - Binomial
    - Categorical
    - more
- Mean function
    - Relationship between regression parameters and mean of Y
- Link function
    - Transformation to make X-Y relationship linear

“It just works”: GLMs are a great way to fit linear models to binary or multinomial data without going deep into math

## Robust Linear Models
Modified regression algorithms that perform better than OLS in the presence of outliers (and also in cases of heteroscedasticity)
- Regression using OLS works well when the basic assumptions about the underlying data are true
- OLS regression is highly sensitive to outliers

Usually superior to OLS regression Still not as popular
- complex to understand
- multiple competing algorithms
- computationally intensive
- not supported in Excel and other popular tools

## Summary 
- Ordinary least squares regression makes many assumptions about data
- Generalized or weighted least squares for heteroscedasticity
- Generalized linear models for non-normal y-variables
- Robust linear models to cope with outliers

# Module 3: Exploring Time Series Data Using StatsModels

## Time Series
- A time series is a sequence of data taken at successive and usually equally spaced points in time.

- Time series models are especially vulnerable to problems of **non-stationarity**

## Stationarity 

### Non-stationary Data
- Mean changes over time
- Variance changes over time
- Autocorrelation changes over time
- Applying regression to non-stationary data yield poor model
    - inflated R^2
    - Problems with heteroscedasticity

### Stationary Data
- Mean of time series does not change over time
- Variance of time series does not change over time (homoscedasticity)
- Autocorrelation does not change over time

### Detecting Stationarity
- Statistical tests exist to test for non-stationarit
- In practice, simple forms of non-stationarity can be found from plotting data
- More complex forms require statistical tests

#### Visualization
- Non-stationary
    - trending in a certain direction (stocks trend up)
    - periods of high and low volitility
    - varying autocorrelation (spread changes)
    
### Fixing non-stationary data
Make non-stationary data, stationary
- Convert to log differences
- Convert to returns

## Time Series forecasting

### Auto Regressive Model (AR(p))
#### Autoregression Formula
> y_t = A + B * y_t-1
#### General Form of Linear Model
> Y_t = c + ((p Σ i=1) 𝜙_i * X_t) + ε_t
#### General Form of Autoregressive Model
> Y_t = c + ((p Σ i=1) 𝜙_i * Y_t-i) + ε_t

- Last p values of Y influence current value of Y
- p is the moving time lag
- **Future vlaues of Y depend on past values of Y and on current value of white noise**

#### White noise error term: ε_t
Make the same assumptions of white noise error term that we do with residuals in linear regression models
- zero-mean
- constant-variance
- normally distributed
- Independent and identically distributed (IID)

AR models have **single** error term
- models that depend on previous error terms are moving average (MA)

### Moving Average Model (MA(q))
Moving average of the last q values of ε (error)

#### General Form of Moving Average Model
> Y_t = 𝜇 + ε_t + 𝜽1 * ε_t-1 +...+ 𝜽_q ε_t-q

- Value of Y depends on last q values of the white noise process
- q is the moving time lag
- **Future values of Y depend on pastvalues of white noise alone**

### ARMA(p,q) Model
Combination of AR and MA models

#### General Form of Moving Average Model
> Y_t = 𝜇 + ε_t + 𝜽_1 * ε_t-1 +...+ 𝜽_q * ε_t-q + 𝜙_1 * Y_t-1 +...+ 𝜙_p * Y_t-p
- p is the lagger for AR
- q is the lagger for MA
- *Future values of Y depend on past values of Y and on current and past values of white noise**


### Finding p,q for AR(p) and MA(q)
```
                    ---> Find q for MA(q) Model
                ---> ACF Plot
            ---> autocorrelation
        ---> correlation
start --
        ---> Partial correlation
            ---> partial autocorrelation 
                ---> PACF Plot
                     ---> Find p for AR(q) Model
```    

#### Correlation
The measure of the relationship between two items or variables

#### Autocorrelation
Measures the relationship between a variable’s current value and past value
- ranges from -1 to 1
    - 1 == perfect positive correlation
    - -1 == perfect negative correlation

#### Partial Autocorrelation
Conceptually similar to autocorrelation; based on partial correlation of a series with lagged versions of itself

#### ACF and PACF Plots
- Lag-0 is always == 1
    - time series data is always perfected self correlated with itself
- Boundary shading is the bounds of statstical significance
    - use this to check how many of the lags are statstically significant
    
#### Finding correct q and p values 
- ACF plot of an MA(q) process cuts off after q lags
- PACF plot of an AR(p) process cuts off after p lags
- If lags significance tapers off slowly/no abrupt cut-off then don't use a lag at all
    - **MIGHT BE DIFFERENT FOR P AND Q RESPECTIVELY**
    - This follow the **principle of parsimony** AKA Occam's Razor
        - the simplest explanation is usually the right one
**Process (For both ACF and PACF seperately)**
1. Plot ACF/PACF
2. Find lag cutoff (if no good cut off for BOTH ACF and PACF then probably non-stationary)
    a. lag cutoff for ACF == p max 
    b. lag cuttoff for PACF == q max
3. Test a few models ARMA models
    a. p ranging from 0 to p max
    b. q ranging from 0 to q max
4. Select model with best score
    a. AIC
    b. BIC
    c. HQIC

#### Selecting a model 
##### AIC 
Akaike's Information Criterion
- Estimates the relative information lost by the model
- Lower score == less information lost
    - Less information lost == better model

##### BIC
Bayesian Information Criterion
- Similar to AIC
- Lower score == less information lost
    - Less information lost == better model
    
##### HQIC
Hannan and Quinn Information Criterion
- Again, lower score == less information lost
