<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/08a-statsmodels.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 08a-statsmodels

Introduction to statsmodels

* [statsmodels introduction](https://www.statsmodels.org/stable/index.html) (v0.12) -- statsmodels.org
  * statsmodels v0.10 is installed in colab
* [statsmodels getting started](https://www.statsmodels.org/devel/gettingstarted.html) (v0.13) -- statsmodels.org
  * multivariate least squares with categorical inputs

In [1]:
import statsmodels.api as sm
import numpy as np

# Getting started with Python-style API and simulated dataset
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

  import pandas.util.testing as tm


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.396
Model:                            OLS   Adj. R-squared:                  0.383
Method:                 Least Squares   F-statistic:                     31.78
Date:                Tue, 13 Jul 2021   Prob (F-statistic):           2.43e-11
Time:                        17:21:17   Log-Likelihood:                -10.043
No. Observations:                 100   AIC:                             26.09
Df Residuals:                      97   BIC:                             33.90
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4055      0.071     19.692      0.0

# R-style formulas

* The next cell peforms simple linear regression using R-style formulas
* It's the [introductory demo](https://www.statsmodels.org/stable/index.html) at statsmodels.org

In [2]:
# Getting started with statsmodels using R-style formulas
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
print(results.summary());

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Tue, 13 Jul 2021   Prob (F-statistic):           1.90e-08
Time:                        17:21:19   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233     

# R-style demo adapted to Python-style API

This cell produces identical results to the cell above

In [3]:
# R-style formula: Lottery ~ Literacy + np.log(Pop1831)
X = dat[['Literacy', 'Pop1831']].copy()
X.iloc[:,1] = np.log(X.iloc[:,1])
X = sm.add_constant(X)
y = dat['Lottery']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Tue, 13 Jul 2021   Prob (F-statistic):           1.90e-08
Time:                        17:21:19   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        246.4341     35.233      6.995      0.0

## Digression: a bug that doesn't throw errors!

Without `.copy()`, the code above generates a view-vs-copy warning that comes from Pandas.

* [https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)

Here's some pseudo-code that explains what's going on...
```
def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo
```

The next few cells provide an explicit example.

In [4]:
import pandas as pd

# One way to get rid of the warning...DO NOT DO THIS!!!
pd.options.mode.chained_assignment = None  # default='warn'

In [5]:
# Test dataframe
dfb = pd.DataFrame({'a': ['one', 'one', 'two',
                          'three', 'two', 'one', 'six'],
                    'c': np.arange(7)})

In [6]:
# Compare the result from this line...
dfb[dfb['a'].str.startswith('o')]['c'] = 42

In [7]:
# ...from the result of this line
dfb['c'][dfb['a'].str.startswith('o')] = 42

# Advertising dataset

In [8]:
import pandas as pd
url = "https://www.statlearning.com/s/Advertising.csv"
 
df = pd.read_csv(url, index_col=0)
df

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
...,...,...,...,...
196,38.2,3.7,13.8,7.6
197,94.2,4.9,8.1,9.7
198,177.0,9.3,6.4,12.8
199,283.6,42.0,66.2,25.5


Compare the next cell to p68 of ISLR, 1st Edition.

In [9]:
# Advertising dataset: sales vs TV
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales'].copy()

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     312.1
Date:                Tue, 13 Jul 2021   Prob (F-statistic):           1.47e-42
Time:                        17:21:19   Log-Likelihood:                -519.05
No. Observations:                 200   AIC:                             1042.
Df Residuals:                     198   BIC:                             1049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0326      0.458     15.360      0.0

The next cell computes residual standard error. Compare to p69 of ISLR.

In [10]:
print("Parameters:", results.params.to_dict())
print("Standard errors:", results.bse.to_dict())
rss = np.square(y - results.predict()).sum()
n = df.shape[0]
print("Residual Standard Error (RSE): {:.2f}".format(np.sqrt(rss / (n-2))))

Parameters: {'const': 7.032593549127698, 'TV': 0.047536640433019764}
Standard errors: {'const': 0.4578429402734786, 'TV': 0.0026906071877968716}
Residual Standard Error (RSE): 3.26
