# Effect of Job Training on Future Earnings

**Course:**  Foundations of Data Science

**Group Members:** Nibish Tamrakar, Karthik Subramanium,  Zubair Ali L

## 1. Introduction


### 1.1. The Question / Estimand

*What is the causal effect of participating in a job training program on future earnings, after adjusting for prior earnings?*

### 1.2. Data Description

*(Describe the dataset you are using. )*
* *Data Source: https://users.nber.org/~rdehejia/data/.nswdata2.html*
* *What do the rows and columns represent?*
* *Why is this dataset appropriate for your causal question?*

## 2. Causal Model

*(Describe your causal model. This section must include a **connected** Directed Acyclic Graph (DAG). Insert an image of your DAG here or create it with code.)*

### 2.1. Variables

*(Clearly label and describe your three variables (Treatment, Outcome, Confound). Do not use T,Y,Z. Use symbols that reflect the names of the variables from your dataset.)*

* **T (Treatment):** [Variable Name] - [Description]
* **Y (Outcome):** [Variable Name] - [Description]
* **Z (Confound):** [Variable Name] - [Description]

### 2.2. Assumed Causal Relationships

*(Clearly describe the assumed causal relationships between the variables as represented in your DAG. )*

## 3. Statistical Model

*(Describe the full statistical model you are using, preferably with statistical notation. DO NOT USE THIS EXACT MODEL DEFINITION IF IT IS NOT APPROPRIATE FOR YOUR PROJECT.)*

*(Example:)*  

$$ Y_i \sim \text{Normal}(\mu_i, \sigma) $$
$$ \mu_i = \alpha + \beta_T T_i + \beta_Z Z_i $$
$$ \alpha \sim \text{Normal}(0, 1) $$
$$ \beta_T \sim \text{Normal}(0, 1) $$
$$ \beta_Z \sim \text{Normal}(0, 1) $$
$$ \sigma \sim \text{Exponential}(1) $$

### 3.1. Justification of Priors

*(Justify the priors used in your model. This can be based on prior predictive simulation, information from outside the dataset, or both. )*

### 3.2. Justification of Outcome Distribution

*(Explain why your chosen distribution for the outcome variable (e.g., Normal, Bernoulli, Poisson) is reasonable given the observed data. )*

### 3.3. Handling the Confound

*(Explain how your statistical model properly handles the confounding variable to provide a valid causal estimate. )*

## 4. Model Validation on Simulated Data

*(Before analyzing the real data, you must validate your statistical model on simulated data where you pre-define the parameter values, construct a posterior approximation of your model using the simulated data, and evaluate how well your model estimates the pre-defined parameter values.)*

*(**Hint:** Use `arviz.plot_posterior()` with the `ref_val` parameter set to your fixed simulation parameter values to check how well your model estimates them. )*

In [2]:
import pandas as pd

# Define input and output file paths
dta_file = 'DATA/nsw.dta'
csv_file = 'DATA/nsw.csv'

try:
    # Read the .dta file
    df = pd.read_stata(dta_file)

    # Convert to CSV
    df.to_csv(csv_file, index=False)

    print(f"Successfully converted '{dta_file}' to '{csv_file}'")

except FileNotFoundError:
    print(f"Error: The file '{dta_file}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Successfully converted 'DATA/nsw.dta' to 'DATA/nsw.csv'


In [None]:
# 1. Define fixed parameter values (e.g., alpha_sim = 0.3, beta_sim = 0.5, etc)

# 2. Simulate data based on your causal/statistical model using these fixed values

# 3. Run your computational model (using PyMC) on this simulated data

# 4. Check how well your model estimates the parameter values. Do the posterior estimates from the model capture the true values you defined?


*(Provide a brief discussion of the simulation results, confirming that your model can successfully recover the fixed parameters.)*

## 5. Data Preparation (Real Data)

*(Load the real dataset. Perform any necessary cleaning, scaling, or transformations. )*

In [None]:
# Load data 

# Perform any cleaning/scaling/transformations 

## 6. Posterior Model (Analysis on Real Data)

*(This section contains the code for your computational model and the analysis of its output.)*

### 6.1. Computational Model Definition and Sampling

*(Provide the code for your PyMC model.  Ensure the code is well-organized and understandable. )*

In [None]:
# Define your statistical model in code (e.g., with pm.Model() as model: ...) and sample from the model


### 6.2. Model Diagnostics

*(Use built-in diagnostics to assess the quality of the posterior samples (e.g., trace plots, r-hat, effective sample size). )*

In [None]:
# Show posterior summary

# Plot trace plots

# Check R-hat and ESS

*(Briefly discuss the diagnostics. Did the model converge? Are the samples of good quality?)*

## 7. Posterior Predictive Checks

*(Visually compare your model's posterior predictions to the observed data. )*

*(The plot(s) in this section should include: )*
* *The observed data*
* *The posterior mean*
* *The uncertainty of the posterior mean (e.g., 89% HDI)*
* *The uncertainty of posterior predictions (e.g., 89% HDI)*

In [None]:
# Generate posterior predictive samples

# Create the posterior predictive check plot(s)


*(Discuss the results of the check(s). How well does your posterior approximation fit the observed data? Are there any notable discrepancies? )*

## 8. Discussion and Conclusion

### 8.1. Answering the Question

*(Discuss what was learned from the model. )*
* *What is the answer to the question you posed in the introduction?*
* *What is your estimate for the causal effect? Provide a plot showing the estimate of the distribution of the causal effect.*
* *Ensure your conclusions are supported by the evidence from your model results. *

### 8.2. Addressing the Confound

*(Explicitly address the confounding variable in your discussion. )*
* *What was the effect of the confound?*

## 9. Future Work

*(Use your current model results to guide future plans for expanding the analysis. )*

* *What are the limitations of your model?*
* *What other variables would you want to include to expand your inquiry?*
* *What other questions might you explore given what you have learned from your analysis?*

## 10. Group Member Contributions

*(List each section of the proposal and final write-up and state who worked on it.)*

* **Proposal:** Nibish Tamrakar, Karthik Subramanium,  Zubair Ali L
* **Introduction:** Nibish Tamrakar
* **Causal Model:** Member Name 2
* **Statistical Model:** Member Name 3
* **Model Validation on Simulated Data:** Member Name 1, Member Name 2
* **Data Preparation:** Member Name 3
* **Posterior Model (Analysis on Real Data):** Member Name 1
* **Posterior Predictive Checks:** Member Name 2
* **Discussion and Conclusion:** Member Name 3
* **Future Work:** Member Name 1, Member Name 2, Member Name 3
