# OLS - Wooldridge Computer Exercise
## Chapter 7, Exercise 14

## To add a heading:
- Insert a new cell
- Type or paste-in content
- Place a single / just one "pound-sign" in front of the heading content
- Select "Markdown"
- Press "Shift", "Enter" at same time to convert to clean commentary

## To add a sub-heading:
- Insert a new cell
- Type or paste-in content
- Place two "pound-signs" in front of the sub-heading
- Select "Markdown"
- Press "Shift", "Enter" at same time to convert to clean commentary

## To add new bulleted documentation:

- Insert a new cell
- Type or paste-in content
- Place a "dash" character in front of the bulleted content
- Select "Markdown"
- Press "Shift", "Enter" at same time to convert to clean commentary

# References
- Wooldridge, J.M. (2016). Introductory econometrics: A modern approach (6thed.). Mason, OH: South-Western, Cengage Learning.
- Residual Plots: https://medium.com/@emredjan/emulating-r-regression-plots-in-python-43741952c034
- Understanding residual plots: https://data.library.virginia.edu/diagnostic-plots/
- VIF: https://etav.github.io/python/vif_factor_python.html
- VIF: https://en.wikipedia.org/wiki/Variance_inflation_factor
- Extracting various values from regression results: https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html

# Instantiate libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import statsmodels
import statsmodels.api as sm
import statsmodels.stats.api as sms

from statsmodels.formula.api import ols
from statsmodels.compat import lzip

from statsmodels.graphics.gofplots import ProbPlot

#import pandas.tseries.api as sm
#from tseries.formula.apt import ols

from scipy.stats import ttest_ind, ttest_ind_from_stats
from scipy.special import stdtr
from scipy.stats import t
from math import sqrt

plt.style.use('seaborn') # pretty matplotlib plots

plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)

# Latex markup language 
from IPython.display import Latex

# Data Read from csv

In [2]:
%%time
#df = pd.read_csv(BytesIO(csv_as_bytes),sep='|',nrows=100000)
df1 = pd.read_csv('C://Users//mvrie//Downloads//firepit-master//Charity.csv',sep=',')
print(df1.head())

   OBS  RESPONSE  GIFT  RESPLASTMAIL  WEEKSLASTRESP  PROPRESPONSE  \
0    1         0     0             0     143.000000           0.3   
1    2         0     0             0      65.428571           0.3   
2    3         0     0             1      13.142857           0.3   
3    4         0     0             0     120.142857           0.3   
4    5         1    10             0     103.857143           0.2   

   MAILSPERYEAR  GIFTLASTRESP  AVERAGEGIFT  
0           2.5            10         10.0  
1           2.5            10         10.0  
2           2.5            10         10.0  
3           2.5            10         10.0  
4           2.5            10         10.0  
Wall time: 35 ms


In [3]:
df1['constant'] = 1

# Data Checks

- Columns

In [4]:
%%time
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4268 entries, 0 to 4267
Data columns (total 10 columns):
OBS              4268 non-null int64
RESPONSE         4268 non-null int64
GIFT             4268 non-null int64
RESPLASTMAIL     4268 non-null int64
WEEKSLASTRESP    4268 non-null float64
PROPRESPONSE     4268 non-null float64
MAILSPERYEAR     4268 non-null float64
GIFTLASTRESP     4268 non-null int64
AVERAGEGIFT      4268 non-null float64
constant         4268 non-null int64
dtypes: float64(4), int64(6)
memory usage: 333.6 KB
Wall time: 3.99 ms


# Job Helper Functions

In [5]:
# HELPER FUNCTION: ODDS RATIOS
def oddsratios():
  print("Logistic Regression Coefficients")
  print(result.params)
  print(" ")
  print("Logistic Regression Coefficient Confidence Intervals")
  print(result.conf_int())
  print(" ")
  params = result.params
  conf = result.conf_int()
  conf['OR'] = params
  conf.columns = ['2.5%', '97.5%', 'OR']
  print("Logistic Regression Odds Ratios w/Conf Intervals")
  print(np.exp(conf))
  

## Exercises i and ii:
### Estimate: $Response = \alpha + \beta_{1} RespLastMail + \beta_{2} AverageGift + \mu$

In [7]:
# create a clean data frame for the regression
modeldata = df1[['RESPONSE','constant'
#,'GIFT'   #THIS IS THE DOLLAR AMOUNT OF THE GIFT WHEN RESPONSE = 1
,'RESPLASTMAIL'
,'AVERAGEGIFT'
]].dropna() #subset the dataframe

# SIGNIFIES DROPPED VAR

train_cols = modeldata.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
print(train_cols)

logit = sm.Logit(modeldata['RESPONSE'], modeldata[train_cols])

# fit the model
result = logit.fit()

print(result.summary2())

# Uncomment oddsratios() once you have final model specification
oddsratios()

Index(['constant', 'RESPLASTMAIL', 'AVERAGEGIFT'], dtype='object')
Optimization terminated successfully.
         Current function value: 0.618032
         Iterations 7
                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.082      
Dependent Variable: RESPONSE         AIC:              5281.5199  
Date:               2020-03-08 13:56 BIC:              5300.5966  
No. Observations:   4268             Log-Likelihood:   -2637.8    
Df Model:           2                LL-Null:          -2872.3    
Df Residuals:       4265             LLR p-value:      1.3377e-102
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     7.0000                                        
-------------------------------------------------------------------
               Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
-------------------------------------------------------------------
constant      -0.9459    0.0496  -19.0511  0.0000 

## Exercises iii and iv: Add propresp and reestimate
### Estimate: $Response = \alpha + \beta_{1} RespLastMail + \beta_{2} AverageGift + \beta_{3} PropResponse + \mu$

In [8]:
# create a clean data frame for the regression
modeldata = df1[['RESPONSE','constant'
#,'GIFT'   #THIS IS THE DOLLAR AMOUNT OF THE GIFT WHEN RESPONSE = 1
,'RESPLASTMAIL'
,'AVERAGEGIFT'
,'PROPRESPONSE'
]].dropna() #subset the dataframe

# SIGNIFIES DROPPED VAR

train_cols = modeldata.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
print(train_cols)

logit = sm.Logit(modeldata['RESPONSE'], modeldata[train_cols])

# fit the model
result = logit.fit()

print(result.summary2())

# Uncomment oddsratios() once you have final model specification
oddsratios()

Index(['constant', 'RESPLASTMAIL', 'AVERAGEGIFT', 'PROPRESPONSE'], dtype='object')
Optimization terminated successfully.
         Current function value: 0.566549
         Iterations 6
                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.158      
Dependent Variable: RESPONSE         AIC:              4844.0639  
Date:               2020-03-08 14:00 BIC:              4869.4995  
No. Observations:   4268             Log-Likelihood:   -2418.0    
Df Model:           3                LL-Null:          -2872.3    
Df Residuals:       4264             LLR p-value:      1.2059e-196
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     6.0000                                        
-------------------------------------------------------------------
               Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
-------------------------------------------------------------------
constant      -2.3808    0.0919  -

## Exercise v:
### Estimate: $Response = \alpha + \beta_{1} RespLastMail + \beta_{2} AverageGift  + \beta_{3} PropResponse + \beta_{4} MailsPerYear + \mu$

In [9]:
# create a clean data frame for the regression
modeldata = df1[['RESPONSE','constant'
#,'GIFT'   #THIS IS THE DOLLAR AMOUNT OF THE GIFT WHEN RESPONSE = 1
,'RESPLASTMAIL'
,'AVERAGEGIFT'
,'PROPRESPONSE'
,'MAILSPERYEAR'
]].dropna() #subset the dataframe

# SIGNIFIES DROPPED VAR

train_cols = modeldata.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
print(train_cols)

logit = sm.Logit(modeldata['RESPONSE'], modeldata[train_cols])

# fit the model
result = logit.fit()

print(result.summary2())

# Uncomment oddsratios() once you have final model specification
oddsratios()

Index(['constant', 'RESPLASTMAIL', 'AVERAGEGIFT', 'PROPRESPONSE',
       'MAILSPERYEAR'],
      dtype='object')
Optimization terminated successfully.
         Current function value: 0.562280
         Iterations 7
                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.165      
Dependent Variable: RESPONSE         AIC:              4809.6193  
Date:               2020-03-08 14:07 BIC:              4841.4138  
No. Observations:   4268             Log-Likelihood:   -2399.8    
Df Model:           4                LL-Null:          -2872.3    
Df Residuals:       4263             LLR p-value:      2.8920e-203
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     7.0000                                        
-------------------------------------------------------------------
               Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
-------------------------------------------------------------------
const