<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/08a-statsmodels.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 08a-statsmodels

Introduction to statsmodels

* [statsmodels introduction](https://www.statsmodels.org/stable/index.html) (v0.12) -- statsmodels.org
  * statsmodels v0.10 is installed in colab
* [statsmodels getting started](https://www.statsmodels.org/devel/gettingstarted.html) (v0.13) -- statsmodels.org
  * multivariate least squares with categorical inputs

In [None]:
import statsmodels.api as sm
import numpy as np

# Getting started with Python-style API and simulated dataset
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

# R-style formulas

* The next cell peforms simple linear regression using R-style formulas
* It's the [introductory demo](https://www.statsmodels.org/stable/index.html) at statsmodels.org

In [None]:
# Getting started with statsmodels using R-style formulas
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
print(results.summary());

# R-style demo adapted to Python-style API

This cell produces identical results to the cell above

In [None]:
# R-style formula: Lottery ~ Literacy + np.log(Pop1831)
X = dat[['Literacy', 'Pop1831']].copy()
X.iloc[:,1] = np.log(X.iloc[:,1])
X = sm.add_constant(X)
y = dat['Lottery']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

## Digression: a bug that doesn't throw errors!

Without `.copy()`, the code above generates a view-vs-copy warning that comes from Pandas.

* [https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)

Here's some pseudo-code that explains what's going on...
```
def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo
```

The next few cells provide an explicit example.

In [None]:
import pandas as pd

# One way to get rid of the warning...DO NOT DO THIS!!!
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
# Test dataframe
dfb = pd.DataFrame({'a': ['one', 'one', 'two',
                          'three', 'two', 'one', 'six'],
                    'c': np.arange(7)})

In [None]:
# Compare the result from this line...
dfb[dfb['a'].str.startswith('o')]['c'] = 42

In [None]:
# ...from the result of this line
dfb['c'][dfb['a'].str.startswith('o')] = 42

# Advertising dataset

In [None]:
import pandas as pd
url = "https://www.statlearning.com/s/Advertising.csv"
 
df = pd.read_csv(url, index_col=0)
df

Compare the next cell to p68 of ISLR, 1st Edition.

In [None]:
# Advertising dataset: sales vs TV
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales'].copy()

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

The next cell computes residual standard error. Compare to p69 of ISLR.

In [None]:
print("Parameters:", results.params.to_dict())
print("Standard errors:", results.bse.to_dict())
rss = np.square(y - results.predict()).sum()
n = df.shape[0]
print("Residual Standard Error (RSE): {:.2f}".format(np.sqrt(rss / (n-2))))