In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("proj04.ipynb")

# Project 4: Econometrics

In this project we'll be taking a look at datasets related to college education and exploring questions revolving around the relationships between years of education and various factors.

In [2]:
import numpy as np
from datascience import *
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore')
plt.style.use('seaborn-muted')
%matplotlib inline
plt.rcParams["figure.figsize"] = [10,7]

## Part 1: College Distance

We will begin by examining the relationship between years of schooling and a person's distance to the nearest college when in high school. The idea here is to see if there are any effects of proximity to a college and how much education a person receives.

The data for this section is from the paper *Democratization or Diversion? The Effect of Community Colleges on Educational Attainment* by Cecilia Rouse (1995).

To explore this problem, we will import a dataset called `college_distance.csv`, which contains several relevant features for a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986.
The table contains the following columns:

- `yrsed`: Years of Education Completed$^1$
- `female`: Binary variable (1 = female, 0 otherwise)
- `black`: Binary variable (1 = black, 0 otherwise)
- `hispanic`: Binary variable (1 = Hispanic, 0 otherwise)
- `bytest`: Basic year composite test score. These are achievement tests given to high school seniors.
- `dadcoll`: Binary variable (1 = father is a college graduate, 0 otherwise)
- `momcoll`: Binary variable (1 = mother is a college graduate, 0 otherwise)
- `incomehi`: Binary variable (1 = family income > \$25,000 per year, 0 otherwise)
- `ownhome`: Binary variable (1 = family owns their home, 0 otherwise)
- `urban`: Binary variable (1 = high school in urban area, 0 otherwise)
- `cue80`: County unemployment rate in 1980 (%)
- `stwmfg80`: Average state hourly wage in manufacturing in 1980
- `dist`: Distance from 4-year college (in 10s of miles)
- `tuition`: Average state 4 year college tuition (in 1000s of dollars)

$^1$: Rouse computed years of education by assigning 12 years to all members of the senior class. Each additional year of secondary education counted as a one year. Student’s with vocational degrees were assigned 13 years, AA degrees were assigned 14 years, BA degrees were assigned 16 years, those with some graduate education were assigned 17 years, and those with a graduate degree were assigned 18 years.


In [3]:
distance = Table.read_table('college_distance.csv')
distance

female,black,hispanic,bytest,dadcoll,momcoll,ownhome,urban,cue80,stwmfg80,dist,tuition,yrsed,incomehi
0,0,0,39.15,1,0,1,1,6.2,8.09,0.2,0.88915,12,1
1,0,0,48.87,0,0,1,1,6.2,8.09,0.2,0.88915,12,0
0,0,0,48.74,0,0,1,1,6.2,8.09,0.2,0.88915,12,0
0,1,0,40.4,0,0,1,1,6.2,8.09,0.2,0.88915,12,0
1,0,0,40.48,0,0,0,1,5.6,8.09,0.4,0.88915,13,0
0,0,0,54.71,0,0,1,1,5.6,8.09,0.4,0.88915,12,0
1,0,0,56.07,0,0,1,0,7.2,8.85,0.4,0.84988,13,0
1,0,0,54.85,0,0,1,0,7.2,8.85,0.4,0.84988,15,0
0,0,0,64.74,1,0,1,1,5.9,8.09,3.0,0.88915,13,0
1,0,0,56.06,0,0,1,1,5.9,8.09,3.0,0.88915,15,0


<!-- BEGIN QUESTION -->

**Question 1.1:** What do you expect the sign of the relationship between years of schooling and distance to nearest college to be? Provide a possible and brief explanation for the sign.

<!--
BEGIN QUESTION
name: q1_1
manual: true
-->

negative because as distance to nearest college increases, years of schooling decreases. 

<!-- END QUESTION -->

**Question 1.2:** Consider the following single-variable regression:

$$\text{Years of Education} = \beta_1 \times \text{Distance from College} + \alpha$$ 

Fit the above regression of years of education `yrsed` onto distance to the nearest college `dist`. 

*Hint*: Make sure to always add a column of 1's.

<!--
BEGIN QUESTION
name: q1_2
-->

In [4]:
y_1_2 = distance.column("yrsed")
X_1_2 = distance.select("dist").to_df()
model_1_2 = sm.OLS(y_1_2, sm.add_constant(X_1_2))  
results_1_2 = model_1_2.fit()
results_1_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,28.48
Date:,"Thu, 29 Apr 2021",Prob (F-statistic):,1e-07
Time:,20:03:06,Log-Likelihood:,-7632.2
No. Observations:,3796,AIC:,15270.0
Df Residuals:,3794,BIC:,15280.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,13.9559,0.038,369.945,0.000,13.882,14.030
dist,-0.0734,0.014,-5.336,0.000,-0.100,-0.046

0,1,2,3
Omnibus:,7187.794,Durbin-Watson:,1.769
Prob(Omnibus):,0.0,Jarque-Bera (JB):,361.676
Skew:,0.41,Prob(JB):,2.9e-79
Kurtosis:,1.729,Cond. No.,3.73


In [5]:
grader.check("q1_2")

**Question 1.3:** What is the estimated relationship between distance and years of schooling? Assign `slope_1_3` to the estimated slope (to at least 4 decimal places). Is this statistically significant? Assign `significant_1_3` to either `True` or `False`, corresponding to whether or not the slope if statistically significant.
<!--
BEGIN QUESTION
name: q1_3
-->

In [6]:
slope_1_3 = -0.0734
significant_1_3 = True

In [7]:
grader.check("q1_3")

<!-- BEGIN QUESTION -->

**Question 1.4:** Interpret the slope coefficient on `dist` with the appropriate units. Does this value align with our intuition from 1.1? 
Interpret the y-intercept term. 
<!--
BEGIN QUESTION
name: q1_4
manual: true
-->

It seems that we estimate a slope of roughly -0.0734, meaning that a an increase to distance to college by 10 miles is expected to decrease years of education completed by .0734 years. This value does align with our intuition from 1.1 that the further away the college is, the probable decrease in years of education completed meaning a negative association, negative slope

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.5:** What could be a potential confounding variable in the above regression? Do you expect this confounding factor to overstate or understate our coefficient on `dist`? Why? Give clear reasoning for how the confounding variable affects both independent and dependent variables, like discussed in lecture.
<!--
BEGIN QUESTION
name: q1_5
manual: true
-->

I think there are many potential confounding variables in the above regression affecting our coefficient on distance. Some include family income, if father went to college, or mother went to college. I expect these confounding factors to overstate our coefficient on dist, because it could be the true reason for the increase in years of schooling rather than the distance. But by ignoring them, we over estimate the effect of distance. The confounding variables overestimate the effect of the independent variable on the dependent variable, as mentioned above, making it seem as if the independent variable is greatly correlated with the cause in change of depending variable when another correlation may be ignored for the actual change in dependent variable. 

<!-- END QUESTION -->

**Question 1.6:** Now consider the following longer regression:

$$\text{Years of Education} = \beta_1 \times \text{Distance from College} + \beta_2 \times \text{High Income Family} + \alpha$$ 

fit a regression model for years of education `yrsed` onto both distance to the nearest college `dist`, and family income `incomehi`.
<!--
BEGIN QUESTION
name: q1_6
-->

In [8]:
y_1_6 = distance.column("yrsed")
X_1_6 = distance.select("dist", "incomehi").to_df()
model_1_6 = sm.OLS(y_1_6, sm.add_constant(X_1_6))  
results_1_6 = model_1_6.fit()
results_1_6.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.052
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,103.2
Date:,"Thu, 29 Apr 2021",Prob (F-statistic):,2.18e-44
Time:,20:03:28,Log-Likelihood:,-7545.8
No. Observations:,3796,AIC:,15100.0
Df Residuals:,3793,BIC:,15120.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,13.6874,0.042,325.508,0.000,13.605,13.770
dist,-0.0582,0.013,-4.315,0.000,-0.085,-0.032
incomehi,0.8463,0.064,13.293,0.000,0.722,0.971

0,1,2,3
Omnibus:,2060.901,Durbin-Watson:,1.838
Prob(Omnibus):,0.0,Jarque-Bera (JB):,307.997
Skew:,0.404,Prob(JB):,1.32e-67
Kurtosis:,1.862,Cond. No.,6.71


In [9]:
grader.check("q1_6")

**Question 1.7:** Now what is the estimated relationship between distance and years of schooling? Assign `slope_1_7` to the estimated slope (to at least 4 decimal places). Is this statistically significant? Assign `significant_1_7` to either `True` or `False`, corresponding to whether or not the slope if statistically significant.
<!--
BEGIN QUESTION
name: q1_7
-->

In [10]:
slope_1_7 = -0.0582
significant_1_7 = True

In [11]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

**Question 1.8:** How does this new slope for distance to college compare to the previous one? What does this say about family income?
<!--
BEGIN QUESTION
name: q1_8
manual: true
-->

This implies that when we control for a person's family income (i.e. we get rid of that source of bias), we only see that on average an increase of 10 miles to distance to college is associated with an expected .0582 years decrease of years of education completed instead of the earlier .0734 decrease. Furthermore, looking at the 95% confidence interval for 𝛽̂2, family income, we see that it does not contain 0, which would imply that family income has a strong non-zero association with years of education completed. 

<!-- END QUESTION -->

**Question 1.9:** Now fit a linear regression with the additional regressors `bytest`, `female`, `black`, `hispanic`, `incomehi`,  `ownhome`, `dadcoll`, `momcoll`, `cue80`, and `stwnfg80` (along with `dist`).

<!--
BEGIN QUESTION
name: q1_9
-->


In [12]:
y_1_9 = distance.column("yrsed")
X_1_9 = distance.select("dist", "incomehi","bytest", "female", "black", "hispanic","ownhome", "dadcoll", "momcoll", "cue80", "stwmfg80").to_df()
model_1_9 = sm.OLS(y_1_9, sm.add_constant(X_1_9))  
results_1_9 = model_1_9.fit()
results_1_9.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.281
Method:,Least Squares,F-statistic:,135.7
Date:,"Thu, 29 Apr 2021",Prob (F-statistic):,1.92e-263
Time:,20:03:35,Log-Likelihood:,-7015.1
No. Observations:,3796,AIC:,14050.0
Df Residuals:,3784,BIC:,14130.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,8.8614,0.250,35.487,0.000,8.372,9.351
dist,-0.0308,0.012,-2.497,0.013,-0.055,-0.007
incomehi,0.3666,0.061,6.042,0.000,0.248,0.486
bytest,0.0924,0.003,29.187,0.000,0.086,0.099
female,0.1434,0.050,2.842,0.005,0.044,0.242
black,0.3538,0.071,4.967,0.000,0.214,0.493
hispanic,0.4024,0.074,5.418,0.000,0.257,0.548
ownhome,0.1456,0.067,2.185,0.029,0.015,0.276
dadcoll,0.5699,0.074,7.731,0.000,0.425,0.714

0,1,2,3
Omnibus:,116.663,Durbin-Watson:,1.928
Prob(Omnibus):,0.0,Jarque-Bera (JB):,98.499
Skew:,0.326,Prob(JB):,4.08e-22
Kurtosis:,2.554,Cond. No.,539.0


In [13]:
grader.check("q1_9")

<!-- BEGIN QUESTION -->

**Question 1.10:** Further compare the slope on distance to college in our most recent regression with the previous two regressions. What does this suggest about the collective group of variables we included in addition to `dist`, regarding the idea of omitted variable bias?
<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

This implies that when we control for the collective group of variables(i.e. we get rid of that source of bias), we only see that on average an increase of 10 miles for distance to college is associated with a decrease of .0308 years of education instead of the earlier .0582 decrease. This suggests that the collective group variables were biasing the regression and causing the independent variable distance to be overestimated. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.11:** The value of the coefficient on `dadcoll` should be positive. What does this coefficient measure? Interpret this effect.
<!--
BEGIN QUESTION
name: q1_11
manual: true
-->

This coefficient measures if on average having a dad who graduated from college is associated with any increase or decrease in years of education completed. The effect specifically from the regression on question 1.9 is that on avergae having a dad who is a colllege graduate is associated with a 0.5699 increase in years of education completed. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.12:** Explain why `cue80` and `stwmfg80` appear in the regression. Are the signs of their estimated coefficients what you would have believed? Explain.

<!--
BEGIN QUESTION
name: q1_12
manual: true
-->

if unemployment rate is high more people may want to go to college because there's not many job opportunities. If the average wage in manufacturing is high people may choose to work in manufacturing trade rather than go to higher education and go to college. The signs of their estimated coefficients are exactly what I would have believed. Where higher unemployment rate is associated with a approx .02 increase in associated of another year of education whereas hourly wage in manufacturing is associated with a approx .05 decrease in one year of education

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.13:** A policymaker who wants to increase the average years of schooling of the population sees your results and concludes that more colleges should to be built such that people are closer to colleges. Do you agree with this proposal? Why or why not?

<!--
BEGIN QUESTION
name: q1_13
manual: true
-->

I disagree because it seems other factors may have a greater increase on average years of schooling completed like mom and dad graduating college, or other variables like your income so building more schools may not be the best solution. 

<!-- END QUESTION -->

**Question 1.14:** Let's try and make a prediction. 

Bob is a white non-Hispanic male. His high school was 20 miles from the nearest college. His base-year composite test score `bytest` was 58. 
His family income in 1980 was \\$26,000,
and his family owned a home. His mother attended college, but his father did not.
The unemployment rate in his county was 7.5%, and the state average manufacturing
hourly wage was \\$9.75. Predict Bob’s years of completed schooling using the regressions
in 1.6 and 1.9. 

Assign `bob_schooling_1_9` to the predicted schooling using the model in 1.9 and assign `bob_schooling_1_6` to the predicted schooling using the model in 1.6.

<!--
BEGIN QUESTION
name: q1_14
-->


In [14]:
#bob_schooling_1_9 = (20) + (1)+ (58) + (1) + (1) + (7.5) + (9.75) + 8.8614   
                    #dist + incomehi + bytest + ownhome + momcoll + unemployment + wage
#bob_schooling_1_6 = (20) + (1)+ 13.6874
                    #dist + incomehi
bob_schooling_1_9 = (-0.0308 * 20) + (0.3666) + (0.0924 * 58) + (0.1456) + (0.3792) + (0.0244 * 7.5) + (-0.0502 * 9.75) + 8.8614
bob_schooling_1_6 = (-0.0582 * 20) + (0.8463) + 13.6874
print("Bob's predicted years of schooling in the long model is:", bob_schooling_1_9)
print("Bob's predicted years of schooling in the short model is:", bob_schooling_1_6)

Bob's predicted years of schooling in the long model is: 14.18955
Bob's predicted years of schooling in the short model is: 13.3697


In [15]:
grader.check("q1_14")

## Part 2: Diploma Effect

Before we were only considering years of schooling as a continuous variable and not paying much attention to the significance of certain years. 
In reality, it seems natural to think that 16 years of schooling, which is how long it takes for most people to obtain a Bachelor's degree, is a more significant jump from 15 than a typical one-year increase in schooling would be.

In this next part we examine the question of whether or not a diploma makes a significant difference in a person's earnings. In other words, is there a significant difference between certain jumps in schooling (11 to 12, 15 to 16) that would indicate a benefit from a diploma in addition to the additional year of schooling?

Below you will see a table from Jaeger and Page's paper entitled *Degrees Matter: New Evidence on Sheepskin Effects in the Returns to Education* (1996). Let's take a minute to understand it.
First, each column corresponds to a diffferent regression that they performed, with the title of the column denoting the demographic of people that were regressed on. So in the first column, they performed a regression only on the white men of their dataset. 
- The outcome variable is log hourly wage.
- The years of schooling dummy variables bucket individuals into 10 groups *except* the dummy variable corresponding to 12 years of schooling. The groups are 0-8, 9, 10, 11, 13, 14, 15, 16, 17, and 18+ years of schooling. Note that these buckets are *mutually exclusive and collectively exhaustive*.
- The diploma variables are  dummy variables that represent the individual's highest diploma received. For example, if Alice received 16 years of schooling and received a High School, Bachelor's, and Master's degree, only the dummy variable for 16 years of schooling and dummy variable for Master's degree will be 1.

The regression Jaeger and Page conduct roughly looks as follows:

$$\text{Log income} = \sum_i \beta_i \times \text{Dummy variable for having i years of education} + \sum_j \gamma_j \times \text{Dummy variable for having highest degree j} + \alpha$$

![title](jaeger_page.png)

<!-- BEGIN QUESTION -->

**Question 2.1:** 
Why do you think Jaeger and Page estimate 4 different models separately for white men, white women, black men, and black women? 

<!--
BEGIN QUESTION
name: q2_1
manual: true
-->

They regress the 4 different models separately because they could be confounding factors and your gender and race may affect the years of schooling for various reasons so it is better to avoid overestimation and any type of bias by regressing separate. Also so they avoid omitted variable bias

<!-- END QUESTION -->



From now on, to keep the rest of the project short, we will only examine the regression results of model 1, i.e. the regression conducted on white men.

<!-- BEGIN QUESTION -->

**Question 2.2:** Why might the effect in earnings of the 14th year of education be larger than that of the 15th?

<!--
BEGIN QUESTION
name: q2_2
manual: true
-->

14 years may relate to getting a 2 years associate degree from a community college whereas a 15 years of education may be not being able to finish bachelors, and having no degree, meaning highest education completed is still highschool whereas for 14 it might be associates degree

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3:** Notice that there is no dummy variable associated for individuals with 12 years of schooling. Why might this be? As a result of this exclusion, how do you interpret the coefficient on 14 years of schooling.

<!--
BEGIN QUESTION
name: q2_3
manual: true
-->

There are no dummy variables associated for individuals with 12 years of schooling because the original research mentioned at the beginning of the project surveyed those people who had atleast completed 12 years of schooling. 

I interpret 14 years of schoooling then to mean that 12 years of schooling was ofcourse completed. There is multicolinearity between the years of schooling. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.4:** Why is the coefficient on doctoral degrees less than that on high school degrees? 
Does this mean that high school graduates make more than PhD graduates? Why or why not?

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->

It does not mean high school graduates make more than PhD graduates it just means that there are less people who may go on to get doctorates or may take more time to make more money compared to people who have been working right after highschool for the years that person spent in their doctorate program. It may also have to do with multicolinearity causuing ambiguity. 

<!-- END QUESTION -->



## Conclusion
Very nice, you've finished Project 4! You've conducted a soup to nuts analysis that involved performing, comparing, and interpreting several regressions to examine sources of omitted variable bias. 
You've also interpreted a jam-packed table from a noteworthy economics paper that fortified your intuition on ordinary least squares linear regression. We hope you enjoyed the project just as much as we did writing it :')

Congratulations for finishing your last project in Data 88! 

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [18]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()