<a href="https://colab.research.google.com/github/jgkorman/Statistical-Analysis-In-Python/blob/main/Assignment_5_PDLA_Korman_JamesG_ICPSR_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ICPSR PDLA Assignment 5**
Author: Korman, James, G



# **Directions**


This assignment will use the National Longitudinal Survey (1968 to 1988) with data on 5,159 young women (14-26 years of age in 1968). The variables are

* ln_wage: ln(wage/GNP deflator) (DV)
* age: age in current year
* msp: 1 if married, spouse present
* ttl_exp: total work experience

## **Reading in the Data & Descriptive Statistics**

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

import scipy.stats as st
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsap
from statsmodels.compat import lzip
from statsmodels.stats.diagnostic import het_white


In [8]:
dtafile = '/content/drive/MyDrive/Delaware/ICPSR_2022/Panel Data And Longitudinal Analysis/Assignments/Assignment 5/Copy of National Longitudinal Survey.dta'

df = pd.read_stata(dtafile)
df = df.set_index(['idcode', 'year'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,age,msp,ttl_exp,ln_wage
idcode,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,70,18.0,0.0,1.083333,1.451214
1,71,19.0,1.0,1.275641,1.028620
1,72,20.0,1.0,2.256410,1.589977
1,73,21.0,1.0,2.314102,1.780273
1,75,23.0,1.0,2.775641,1.777012
...,...,...,...,...,...
5159,80,35.0,0.0,5.000000,1.784807
5159,82,37.0,0.0,7.000000,1.871802
5159,83,38.0,0.0,8.076923,1.843853
5159,85,40.0,0.0,9.076923,1.799792


# **1. Question 1**

Using OLS, estimate the following pooled model: 

**ln_wagei,t = +b1agei,t + b2mspi,t + b3ttl_expi,t + ei,t.** 

Report the effects. What do you find?

In [9]:
!pip install linearmodels
from linearmodels import PooledOLS

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:

endog = df['ln_wage']
exog = df[['age', 'msp', 'ttl_exp']]
exog = sm.tools.tools.add_constant(exog)
mod = PooledOLS(endog, exog)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res.summary)


  # put in order ctt
Inputs contain missing values. Dropping rows with missing observations.


                          PooledOLS Estimation Summary                          
Dep. Variable:                ln_wage   R-squared:                        0.1801
Estimator:                  PooledOLS   R-squared (Between):              0.2450
No. Observations:               28494   R-squared (Within):               0.1215
Date:                Tue, Aug 02 2022   R-squared (Overall):              0.1801
Time:                        20:28:40   Log-likelihood                -1.658e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      2086.0
Entities:                        4710   P-value                           0.0000
Avg Obs:                       6.0497   Distribution:                 F(3,28490)
Min Obs:                       1.0000                                           
Max Obs:                       15.000   F-statistic (robust):             652.35
                            

With the pooled OLS model above using clustered standard errors clustered by entity, I find that age is a highly statistically significant predictor of log wage at the P < .01 level suggesting to us that as age increases, log wages decrease. Being married relative to not being married (msp variable) is not statistically significant thereby we can't say that it is a factor that determines log wages. The variable "total experience" however was statistically significant and positive suggesting to us the more total experience, the greater earnings there are. The model explained about 18% of the variaiton overall in log wages. 

# **2. Question 2** 

Using GLS, estimate the following random-effects models: 

**ln_wagei,t = +b1agei,t + b2mspi,t + b3ttl_expi,t + hi + ei,t.**

Report the effects. What do you find?

In [11]:
from linearmodels import RandomEffects
mod = RandomEffects(endog, exog)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res.summary)


Inputs contain missing values. Dropping rows with missing observations.


                        RandomEffects Estimation Summary                        
Dep. Variable:                ln_wage   R-squared:                        0.2788
Estimator:              RandomEffects   R-squared (Between):              0.2235
No. Observations:               28494   R-squared (Within):               0.1363
Date:                Tue, Aug 02 2022   R-squared (Overall):              0.1742
Time:                        20:28:40   Log-likelihood                   -5960.9
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      3671.0
Entities:                        4710   P-value                           0.0000
Avg Obs:                       6.0497   Distribution:                 F(3,28490)
Min Obs:                       1.0000                                           
Max Obs:                       15.000   F-statistic (robust):             655.71
                            

For the random effects model estimated directly above, in contrast to the Pooled OLS model, the random effects model performed slightly worse in that it  explained about 17% of the variaiton in log wages relative to 18% of the variation in log wages being explained by Pooled OLS model. However, the random effects model is not as biased as the pooled OLS model, in the sense that it controls for some of the unobserved heterogeneieties by explointing both within and between variance (while not accounting for all -- it is still un-biased or the bias is negligble. This is in contrast to the Pooled OLS model which doesn't account for/control any of the unobserved heterogeneities that could also explain the variation in y -- leading the Pooled model to be a biased one. 

For variable interpretation, age is ocne again negative and highly statistically significant suggesting to us as age increases log wages decrease. married(msp) relative to not being married is not a statistically signiifacnt predictor of log wages. Total experience similar to the pooled model is again highly statistically signifcaint and positive suggesting to us that as one gains overall experience, earnings should increase. 

# **3. Question 3**. 

Using OLS, estimate the following fixed-effects models:

 **ln_wagei,t = +b1agei,t+b2mspi,t+ b3ttl_expi,t + hi + ei,t.**
 
 Report the effects. What do you find?

In [12]:
# importing the fixed effects estimator 
from linearmodels import PanelOLS
mod = PanelOLS(endog, exog, entity_effects=True, time_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res.summary)

Inputs contain missing values. Dropping rows with missing observations.


                          PanelOLS Estimation Summary                           
Dep. Variable:                ln_wage   R-squared:                        0.0424
Estimator:                   PanelOLS   R-squared (Between):              0.2333
No. Observations:               28494   R-squared (Within):               0.0100
Date:                Tue, Aug 02 2022   R-squared (Overall):              0.1345
Time:                        20:28:41   Log-likelihood                   -3201.8
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      350.69
Entities:                        4710   P-value                           0.0000
Avg Obs:                       6.0497   Distribution:                 F(3,23767)
Min Obs:                       1.0000                                           
Max Obs:                       15.000   F-statistic (robust):             127.60
                            

For the fixed effects model estimated above, it only explained about 13% of the variation in log wages from the adjusted R squared -- this is plausible as it is the most restricted model in the sense that it only uses within variance to derive the estimates. The fixed effects model estiamted above also includes both time and entity effects to control for unobserved heterogenieities both across units, and over time. For the fixed effects model, the variable "age" is not statistically significant as opposed to the pooled OLS and random effects model suggesting to us age has no effect on log wages in eitehr direction. Being married(msp) again isn't statistically significant. Meanwhile, total experience again was positive and highly stastistically signifiacnt suggesting to us that total experience is a very salient predictor of log wages. 

# **4. Question 4**

How do your results from the pooled, FE and RE estimates/models compare?

Ovearll, my results from the pooled, FE, and RE model compare as would have been expected. The pooled OLS model is the most flexible model and explained the most amount of variation in log wages relative to the other models, but it is biased in its estimates and therefore untrustworthy. The random effects model again was the happy medium between the three models estimated in that it explained more of the variation in log wages relative to the pure FE model but less than the pooled OLS model. The fixed effects model was the most restrictive model but the most plauiblse model for drawing pure direct infereneces. It explained the least amount of variation in log wages, but we can be sure that the model is unbiased, as both time and entity effects were included where as for the random effects model we put up with a little bit of bias as a result of allowing the model to be more flexible and use between group variance, but we gain more efficiency. This is evident as the standard errors are smaller for the random effects model relative to the FE model suggesting to us greater precision/efficiency. 

# **5. Question 5**

Do you think the RE estinator is or is not theoretically appropriate? Explain your answer.
Use the Hausman test to determine if the RE estimator is appropriate.

I believe the random effects estimator is theoretically appropriate in this case. In this case, we can theorize the random effects as draws from a larger population and help to generalize our findings on the impact of our regressors on log wages. Due to limitations in python as a statistical software at the moment, we are only able to estimate the hasuman test in the presence of ISVLS estimation, and not random and or fixed effects.