# Effect of education on earnings

### Observational Study
An observational study where researchers observe subjects and measure variables of interest without intervening or assigning treatments.

### Project Idea: 
The project seeks to examine the causal effect of education on earnings using instrumental variables such as distance to college or lottery-based instruments. 

### Methodology: 
The project involves an observational study to examine the relationship between education and earnings, supplemented with instrumental variable analysis to address potential biases and estimate causal effects. Observational data would be collected on individuals' education levels and earnings without intervention from the researchers. Implement two-stage least squares (2SLS) regression to estimate the causal effect while addressing endogeneity concerns. 

### Instrumental Variable Analysis: 
To address endogeneity concerns inherent in observational studies (e.g., the possibility of education being correlated with unobserved factors that also affect earnings), instrumental variables such as distance to college or lottery-based instruments are used. These instrumental variables are assumed to be correlated with education but not directly with earnings, making them suitable for estimating the causal effect of education on earnings. The methodology involves implementing two-stage least squares (2SLS) regression, a statistical technique commonly used in instrumental variable analysis to estimate causal effects while addressing endogeneity concerns.

# Create data set

For this project, we'll create a synthetic dataset that simulates data on individuals' education levels, earnings, and instrumental variables such as distance to college or lottery-based instruments. We'll also introduce some confounding factors to make the analysis more realistic.

Here's how we can generate the synthetic dataset in Python:

In this synthetic dataset:

•	Education represents the number of years of education.

•	Earnings represent annual income, which is influenced by education levels.

•	Distance_to_College represents the distance in miles to the nearest college or educational institution.

•	Lottery_Win indicates whether an individual won the lottery, serving as a lottery-based instrument.

•	We also include confounding factors such as Age, Gender, Ethnicity, and Parental_Income to make the analysis more realistic.

This synthetic dataset can be used to examine the causal effect of education on earnings using instrumental variables and implement two-stage least squares (2SLS) regression to address endogeneity concerns in the analysis.


In [5]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data for individuals
n_individuals = 1000

# Generate education levels (years of education)
education = np.random.randint(8, 20, n_individuals)

# Generate earnings (annual income)
earnings = 5000 + 1000 * education + np.random.normal(0, 2000, n_individuals)

# Generate instrumental variables
distance_to_college = np.random.uniform(1, 50, n_individuals)  # Distance in miles
lottery_win = np.random.choice([0, 1], n_individuals, p=[0.9, 0.1])  # 10% chance of winning the lottery

# Generate confounding factors
age = np.random.randint(20, 65, n_individuals)
gender = np.random.choice(['Male', 'Female'], n_individuals)
ethnicity = np.random.choice(['White', 'Black', 'Hispanic', 'Asian'], n_individuals)
parental_income = np.random.normal(50000, 20000, n_individuals)

# Create a DataFrame to store the synthetic data
data = pd.DataFrame({
    'Education': education,
    'Earnings': earnings,
    'Distance_to_College': distance_to_college,
    'Lottery_Win': lottery_win,
    'Age': age,
    'Gender': gender,
    'Ethnicity': ethnicity,
    'Parental_Income': parental_income
})

# Display the first few rows of the synthetic dataset
print(data.head())

# Save the synthetic dataset to a CSV file
data.to_csv('synthetic_data_education_earnings.csv', index=False)


   Education      Earnings  Distance_to_College  Lottery_Win  Age  Gender  \
0         14  21377.602015            49.116490            0   49    Male   
1         11  15917.383139            11.282258            0   60    Male   
2         18  23347.932658            41.431403            0   44    Male   
3         15  21815.743894            26.494532            0   33    Male   
4         12  16982.418887            19.299667            0   54  Female   

  Ethnicity  Parental_Income  
0  Hispanic     63447.184178  
1     Black     56394.678289  
2     Asian     27614.533695  
3  Hispanic     69612.167097  
4  Hispanic     28071.972710  


# About two-stage least squares (2SLS) regression

Two-stage least squares (2SLS) regression is a statistical technique used to estimate the causal effect of an independent variable (the endogenous variable) on a dependent variable, while addressing potential endogeneity issues caused by omitted variables or measurement error.

Here's an overview of the 2SLS regression process:

### Assumptions of 2SLS:
1.	Exogeneity of Instruments: The instrumental variables used in the first stage are uncorrelated with the error term in the second stage regression equation.
2.	Relevance of Instruments: The instrumental variables should be correlated with the endogenous variable in the first stage regression.

### Steps in 2SLS Regression:

#### 1.	First Stage:
•	In the first stage, we regress the endogenous variable (variable of interest) on one or more instrumental variables.
•	The coefficients obtained from this regression represent the predicted values of the endogenous variable based on the instrumental variables.
•	This stage aims to estimate the relationship between the endogenous variable and the instrumental variables.

#### 2.	Predicted Values:
•	The predicted values of the endogenous variable obtained from the first stage regression are used as a proxy for the true values of the endogenous variable.
•	These predicted values are then included as regressors in the second stage regression.

#### 3.	Second Stage:
•	In the second stage, we regress the dependent variable on the predicted values of the endogenous variable obtained from the first stage regression, along with other relevant control variables.
•	This regression estimates the causal effect of the endogenous variable on the dependent variable while controlling for potential endogeneity issues.

### Interpretation of Results:
•	The coefficient of the predicted endogenous variable obtained from the second stage regression represents the estimated causal effect of the endogenous variable on the dependent variable.
•	The significance and sign of the coefficient indicate the strength and direction of the causal relationship between the variables of interest.
Advantages of 2SLS:
•	Addresses endogeneity: 2SLS regression helps to overcome biases caused by endogeneity, omitted variable bias, or measurement error by using instrumental variables.
•	Provides consistent estimates: Under the assumptions of exogeneity and relevance of instruments, 2SLS produces consistent estimates of causal effects.
Limitations of 2SLS:
•	Instrument validity: The success of 2SLS relies on the validity of instrumental variables. If the instruments are weakly correlated with the endogenous variable or are themselves endogenous, the estimates may be biased.
•	Complexity: Implementing 2SLS regression requires careful consideration of instrument selection, model specification, and diagnostic tests to ensure the validity of results.

Overall, 2SLS regression is a valuable technique for estimating causal effects in situations where traditional regression methods may produce biased estimates due to endogeneity issues. It is commonly used in econometrics and social sciences to analyze causal relationships in observational data.


# Interpretation of two-stage least squares 2SLS regression

Interpreting and expressing the results of a two-stage least squares (2SLS) regression to a more general business or product audience requires translating the technical findings into actionable insights that are relevant to their interests and goals. Here's how we might approach it:

### 1.	Interpretation of Results:

•	Causal Effect of Education on Earnings: We can explain that the estimated coefficient of education in the second-stage regression represents the average change in earnings associated with an additional year of education, while controlling for potential endogeneity issues.

•	Significance and Confidence: Emphasize whether the estimated coefficient of education is statistically significant (i.e., whether it differs from zero with a high degree of confidence). This indicates the reliability of the estimated effect.

•	Effect Size: Provide context by discussing the magnitude of the estimated effect. For example, if the coefficient of education is positive and significant, we can highlight the potential economic benefits of investing in education.

### 2.	Implications for Business or Product Strategy:

•	Investment in Education and Human Capital: Highlight the importance of investing in education and skill development for individuals and organizations. Emphasize the potential long-term benefits in terms of increased productivity, innovation, and competitiveness.

•	Talent Acquisition and Retention: Discuss how understanding the causal relationship between education and earnings can inform talent acquisition and retention strategies. Companies may prioritize hiring candidates with higher levels of education or provide educational opportunities for employees to enhance their skills and performance.

•	Policy and Social Impact: Consider broader implications for policymaking and social welfare. For example, policymakers may use the findings to design effective education policies aimed at promoting economic growth, reducing inequality, and improving social mobility.

### 3.	Follow-Up Activities:

•	Further Analysis and Validation: Conduct additional analyses to validate the findings and explore potential alternative explanations or mechanisms driving the observed relationships.

•	Longitudinal Studies: Consider longitudinal studies to examine the causal effects of education on earnings over time and assess the sustainability of the observed effects.

•	Survey or Qualitative Research: Supplement quantitative analysis with qualitative research methods, such as surveys or interviews, to gain deeper insights into individual experiences, motivations, and decision-making processes related to education and earnings.

•	Benchmarking and Comparison: Compare the results with industry benchmarks or similar studies to understand how education-earnings relationships vary across different contexts or populations.


By translating the technical results into meaningful insights and actionable recommendations, we can help business or product stakeholders make informed decisions and drive positive outcomes based on the findings of the 2SLS regression analysis.

# Coding two-stage least squares (2SLS) regression

To examine the causal effect of education on earnings using instrumental variables and implement two-stage least squares (2SLS) regression in Python, we'll use the statsmodels library, which provides functionality for estimating regression models and conducting statistical tests.
Here's how we can perform 2SLS regression:

In this code:

•	We load the synthetic dataset containing variables such as Education, Earnings, Distance_to_College, and Lottery_Win.

•	We conduct a two-stage least squares (2SLS) regression to estimate the causal effect of Education on Earnings while addressing endogeneity concerns.

•	In the first stage, we regress Education on instrumental variables (Distance_to_College and Lottery_Win) to obtain predicted values of Education.

•	In the second stage, we regress Earnings on the predicted values of Education obtained from the first stage to estimate the causal effect of Education on Earnings.

The regression results are printed, including coefficients, standard errors, t-statistics, and p-values. These results can be interpreted to understand the relationship between education and earnings while accounting for potential endogeneity issues using instrumental variables.


# First Stage Regression

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Load the synthetic dataset
data = pd.read_csv('synthetic_data_education_earnings.csv')

# Step 1: Estimate the first-stage regression (Education as the endogenous variable, Instrumental Variables as predictors)
# Define the endogenous variable (Education) and instrumental variables (Distance_to_College and Lottery_Win)
endog = data['Education']
exog = data[['Distance_to_College', 'Lottery_Win']]

# Add constant term to the exogenous variables
exog = sm.add_constant(exog)

# Fit the first-stage regression model
first_stage_model = sm.OLS(endog, exog)
first_stage_results = first_stage_model.fit()

# Print the first-stage regression results
print("First-Stage Regression Results:")
print(first_stage_results.summary())




First-Stage Regression Results:
                            OLS Regression Results                            
Dep. Variable:              Education   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     2.131
Date:                Fri, 12 Apr 2024   Prob (F-statistic):              0.119
Time:                        11:57:47   Log-Likelihood:                -2676.4
No. Observations:                1000   AIC:                             5359.
Df Residuals:                     997   BIC:                             5373.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
co

# First-Stage Regression Results:

### Interpretation:

•	The first-stage regression examines the relationship between education (as the endogenous variable) and instrumental variables (Distance_to_College and Lottery_Win).

•	The adjusted R-squared value of 0.002 indicates that the instrumental variables explain only a small proportion of the variance in education levels.

•	The p-values for Distance_to_College (p = 0.539) and Lottery_Win (p = 0.049) suggest that Distance_to_College is not statistically significant, while Lottery_Win is marginally significant at the 5% level.

### Implications for Business or Product Strategy:

•	The weak statistical significance of Distance_to_College indicates that it may not be a strong instrument for education.

•	The marginal significance of Lottery_Win suggests that winning the lottery may have a slight influence on education levels.

### Follow-Up Activities:

•	Further investigate the validity and strength of instrumental variables by exploring alternative instruments or conducting sensitivity analysis.

•	Consider collecting additional data or exploring alternative sources of variation to strengthen the instruments.

•	Conduct diagnostic tests, such as testing for instrument relevance and examining the strength of instruments, to assess the quality of the instruments.


# Second Stage Regression

In [3]:
# Step 2: Obtain predicted values of Education from the first-stage regression
data['Predicted_Education'] = first_stage_results.predict()

# Step 3: Estimate the second-stage regression (Earnings as the outcome variable, Predicted_Education as predictor)
# Define the outcome variable (Earnings) and predictor variable (Predicted_Education)
endog = data['Earnings']
exog = data['Predicted_Education']

# Add constant term to the predictor variable
exog = sm.add_constant(exog)

# Fit the second-stage regression model
second_stage_model = sm.OLS(endog, exog)
second_stage_results = second_stage_model.fit()

# Print the second-stage regression results
print("\nSecond-Stage Regression Results:")
print(second_stage_results.summary())


Second-Stage Regression Results:
                            OLS Regression Results                            
Dep. Variable:               Earnings   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     2.990
Date:                Fri, 12 Apr 2024   Prob (F-statistic):             0.0841
Time:                        11:57:49   Log-Likelihood:                -9714.4
No. Observations:                1000   AIC:                         1.943e+04
Df Residuals:                     998   BIC:                         1.944e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------


# Second-Stage Regression Results:

### Interpretation:

•	The second-stage regression examines the causal effect of education (predicted from the first stage) on earnings.

•	The adjusted R-squared value of 0.002 indicates that the predicted education levels explain only a small proportion of the variance in earnings.

•	The coefficient of Predicted_Education is positive (953.6713) but not statistically significant (p = 0.084).

### Implications for Business or Product Strategy:

•	The lack of statistical significance suggests that the estimated causal effect of education on earnings is inconclusive in this analysis.

•	While the positive coefficient implies a potential positive relationship between education and earnings, the lack of significance undermines the reliability of this relationship.

### Follow-Up Activities:

•	Explore alternative econometric methods or model specifications to address potential issues such as model misspecification or heteroscedasticity.

•	Consider collecting additional data or refining the analysis to improve the precision and reliability of the estimated causal effect.

•	Conduct sensitivity analysis to assess the robustness of the results to different modeling assumptions or specifications.

By following these steps, businesses or product teams can gain deeper insights into the relationship between education and earnings and make more informed decisions based on the results of the 2SLS regression analysis.
