### Importing required packages

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

In [2]:
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery','Literacy','Wealth','Region']].dropna()
df.head()

Unnamed: 0,Lottery,Literacy,Wealth,Region
0,41,37,73,E
1,38,51,22,N
2,66,13,61,C
3,80,46,76,E
4,79,69,83,E


In [3]:
model = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
result = model.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Sat, 22 Aug 2020   Prob (F-statistic):           1.07e-05
Time:                        08:17:56   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

---
Categorical value is automatically identified for String type column and dummy variable for regression is automatically created. In the above example, ***Region*** is a categorical variable with 5 categories. One category is automatically dropped.

We can explicitly define a variable as category variable using **`C()`** operator.

```Python 
model = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df)
```
---

### Multiplicative interactions
“:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:

In [4]:
res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()

print(res1.params,'\n')
print(res2.params)

Literacy:Wealth    0.018176
dtype: float64 

Literacy           0.427386
Wealth             1.080987
Literacy:Wealth   -0.013609
dtype: float64


---
### Functions

We can apply vectorized functions to the variables in your model or use custom defined python functions

In [5]:
smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit().params

Intercept           115.609119
np.log(Literacy)    -20.393959
dtype: float64

In [6]:
#custom function for lag variable
def lag(variable, lag):
    if lag == 0:
        return variable
    if isinstance(variable, pd.Series):
        return variable.shift(-lag) 
    else:
        variable = pd.Series(x)
        return variable.shift(-lag)

#generating time series dataframe
x = [0]
np.random.seed(10)
for i in range(1000):
    x.append(0.88 + 0.65 * x[i] + np.random.randn())
df_x = pd.DataFrame({'x':x})


#estimating model using custom made python function
mod1 = smf.ols(formula='x ~ lag(x,-1)', data=df_x).fit()
print(mod1.params)

Intercept     0.834675
lag(x, -1)    0.662489
dtype: float64
