<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Key-intuition-of-regression-analysis" data-toc-modified-id="Key-intuition-of-regression-analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Key intuition of regression analysis</a></span></li><li><span><a href="#Regression-Analysis-with-Statsmodels" data-toc-modified-id="Regression-Analysis-with-Statsmodels-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Regression Analysis with Statsmodels</a></span><ul class="toc-item"><li><span><a href="#Data-transformation/preparation" data-toc-modified-id="Data-transformation/preparation-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data transformation/preparation</a></span></li><li><span><a href="#OLS" data-toc-modified-id="OLS-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>OLS</a></span></li><li><span><a href="#Robust-regression" data-toc-modified-id="Robust-regression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Robust regression</a></span></li><li><span><a href="#Discrete-choice-models" data-toc-modified-id="Discrete-choice-models-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Discrete choice models</a></span></li><li><span><a href="#Count-models" data-toc-modified-id="Count-models-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Count models</a></span></li></ul></li></ul></div>

# Setup

In [1]:
from IPython.display import Image
from IPython.core.display import HTML

In [2]:
import pandas as pd
import numpy as np

# Key intuition of regression analysis

In [3]:
Image(url="http://www.mostlyharmlesseconometrics.com/wordpress/wp-content/uploads/2009/05/k8769-193x300.gif")

<div class="girk">
Angrist and Pischke's book is definitely the place to start to familiarize with the conceptual and statistical aspects of regression analysis</div><i class="fa fa-lightbulb-o "></i>

Ultimately, when we estimate a regression model we want to infer the parameters linking a series of independent variables $X$ to a dependent variable $y$:

$y = \alpha + \beta_{1} * x_{1} + \beta_{2} * x_{2} ... + \beta_{k} * x_{k} + u $

# Regression Analysis with Statsmodels

In [4]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Data transformation/preparation

In [5]:
DF = pd.read_csv("block_9_data.csv")

In [6]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151254 entries, 0 to 151253
Data columns (total 13 columns):
timestamp         151254 non-null object
asin              151254 non-null object
helpful           151254 non-null object
overall           151254 non-null float64
reviewText        151232 non-null object
reviewTime        151254 non-null object
reviewerID        151254 non-null object
reviewerName      149761 non-null object
summary           151254 non-null object
unixReviewTime    151254 non-null int64
review_year       151254 non-null int64
review_month      151254 non-null int64
review_day        151254 non-null int64
dtypes: float64(1), int64(4), object(8)
memory usage: 15.0+ MB


$ rating = \alpha + \beta_{1} * reviews + \beta_{2} * average + u $ 

In [7]:
# weight
DF.loc[:, "weight"] = 1

In [8]:
# set index
DF.set_index(
    ["asin", "review_year", "review_month", "review_day"], inplace=True)

In [10]:
# reviews
DF.loc[:, "reviews"] = DF.groupby(level=["asin"])["weight"].transform(
    np.cumsum)

In [12]:
# average
DF.loc[:, "average"] = DF.groupby(level=["asin"])["overall"].transform(
    np.cumsum)
DF.loc[:, "average"] = DF["average"] / DF["reviews"]

In [13]:
DF.head().T

asin,616719923X,616719923X,616719923X,616719923X,616719923X
review_year,2013,2014,2013,2013,2013
review_month,6,5,10,5,5
review_day,1,19,8,20,26
timestamp,2013-06-01,2014-05-19,2013-10-08,2013-05-20,2013-05-26
helpful,"[0, 0]","[0, 1]","[3, 4]","[0, 0]","[1, 2]"
overall,4,3,4,5,4
reviewText,Just another flavor of Kit Kat but the taste i...,I bought this on impulse and it comes from Jap...,Really good. Great gift for any fan of green t...,"I had never had it before, was curious to see ...",I've been looking forward to trying these afte...
reviewTime,"06 1, 2013","05 19, 2014","10 8, 2013","05 20, 2013","05 26, 2013"
reviewerID,A1VEELTKS8NLZB,A14R9XMZVJ6INB,A27IQHDZFQFNGG,A31QY5TASILE89,A2LWK003FFMCI5
reviewerName,Amazon Customer,amf0001,Caitlin,DebraDownSth,Diana X.
summary,Good Taste,"3.5 stars, sadly not as wonderful as I had hoped",Yum!,Unexpected flavor meld,"Not a very strong tea flavor, but still yummy ..."
unixReviewTime,1370044800,1400457600,1381190400,1369008000,1369526400
weight,1,1,1,1,1


## OLS

In [14]:
FML = "overall ~ reviews + average"
OLS = smf.ols(FML, data = DF).fit()

In [15]:
print(OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                overall   R-squared:                       0.287
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                 3.040e+04
Date:                Fri, 27 Jul 2018   Prob (F-statistic):               0.00
Time:                        00:00:11   Log-Likelihood:            -2.0210e+05
No. Observations:              151254   AIC:                         4.042e+05
Df Residuals:                  151251   BIC:                         4.042e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0141      0.018     -0.773      0.4

## Robust regression

In [19]:
DF.reset_index(inplace=True)

In [20]:
FML = "overall ~ reviews + average"
ROBUST = smf.ols(
    FML, data=DF).fit(
        cov_type="cluster", cov_kwds={"groups": np.array(DF["asin"])})

In [21]:
print(ROBUST.summary())

                            OLS Regression Results                            
Dep. Variable:                overall   R-squared:                       0.287
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                 3.159e+04
Date:                Fri, 27 Jul 2018   Prob (F-statistic):               0.00
Time:                        00:12:05   Log-Likelihood:            -2.0210e+05
No. Observations:              151254   AIC:                         4.042e+05
Df Residuals:                  151251   BIC:                         4.042e+05
Df Model:                           2                                         
Covariance Type:              cluster                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0141      0.019     -0.731      0.4

## Discrete choice models

In [22]:
# low rating
DF.loc[:, "very_low"] = 0
DF.loc[DF["overall"] == 1, "very_low"] = 1

In [26]:
# formula
FML = "very_low ~ reviews + average"
from statsmodels.formula.api import logit
ROBUST = logit(FML, data = DF).fit()

Optimization terminated successfully.
         Current function value: 0.134610
         Iterations 8


In [27]:
print(ROBUST.summary())

                           Logit Regression Results                           
Dep. Variable:               very_low   No. Observations:               151254
Model:                          Logit   Df Residuals:                   151251
Method:                           MLE   Df Model:                            2
Date:                Fri, 27 Jul 2018   Pseudo R-squ.:                  0.1701
Time:                        00:21:54   Log-Likelihood:                -20360.
converged:                       True   LL-Null:                       -24534.
                                        LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.3637      0.072     46.619      0.000       3.222       3.505
reviews       -0.0011      0.000     -5.586      0.000      -0.001      -0.001
average       -1.6581      0.019    -86.329      0.0

## Count models

$ reviews = \alpha + \beta_{1} * average + u $

In [28]:
# formula
FML = "reviews ~ average"
from statsmodels.formula.api import poisson
POISSON = smf.poisson(FML, data = DF).fit()
print(POISSON.summary())

Optimization terminated successfully.
         Current function value: 37.681120
         Iterations 6
                          Poisson Regression Results                          
Dep. Variable:                reviews   No. Observations:               151254
Model:                        Poisson   Df Residuals:                   151252
Method:                           MLE   Df Model:                            1
Date:                Fri, 27 Jul 2018   Pseudo R-squ.:                 0.06150
Time:                        00:29:29   Log-Likelihood:            -5.6994e+06
converged:                       True   LL-Null:                   -6.0729e+06
                                        LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.8619      0.002   2620.159      0.000       5.858       5.866
average       -0.5104      0