### Fitting models using R-style formulas

statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. 

In [1]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas

In [2]:
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

Unnamed: 0,Lottery,Literacy,Wealth,Region
0,41,37,73,E
1,38,51,22,N
2,66,13,61,C
3,80,46,76,E
4,79,69,83,E


In [3]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print (res.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Thu, 04 Mar 2021   Prob (F-statistic):           1.07e-05
Time:                        16:14:20   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

### Categorical variables


In [4]:
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print (res.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Thu, 04 Mar 2021   Prob (F-statistic):           1.07e-05
Time:                        16:14:32   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         38.6517      9.456      4.

### Remove columns
Remove the intercept: 

In [5]:
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
#print (res.params) # prints only the coefficients
print (res.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Thu, 04 Mar 2021   Prob (F-statistic):           1.07e-05
Time:                        16:14:37   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
C(Region)[C]    38.6517      9.456      4.087   

### Multiplicative interactions
    - ”:” adds a new column to the design matrix with the product of the other two columns. 
    - “*” will also include the individual columns that were multiplied together.

In [6]:
res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print (res1.params, '\n')
print (res2.params)


Literacy:Wealth    0.018176
dtype: float64 

Literacy           0.427386
Wealth             1.080987
Literacy:Wealth   -0.013609
dtype: float64


### Using functions!

In [7]:
res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
print (res.params)

Intercept           115.609119
np.log(Literacy)    -20.393959
dtype: float64
