# Estimating Labor Market Returns to Education

**Note:** Most students taking this class are Duke MIDS students who have worked with `pandas` previously. As a result, these exercises are very light on basic pandas Series and DataFrame manipulations. If you are new to `pandas`, I would advise looking into some addition practice opportunities with `pandas`, as discussed in the [Advice for Non-MIDS Students](../not_a_mids_student.ipynb) page. 

In this exercise, we're going to use data from the [American Communities Survey (ACS)](https://usa.ipums.org/usa/acs.shtml) to study the relationship betwen educational attainment and wages. The ACS is a survey conducted by the United States Census Bureau (though it is not "The Census", which is a counting of every person in the United States that takes place every 10 years) to measure numerous features of the US population. The data we will be working with includes about 100 variables from the 2017 ACS survey, and is composed of only 10% of all the people actually surveyed by the ACS. This data comes from [IPUMS](https://usa.ipums.org/usa/), which provides a very useful tool for getting subsets of major survey datasets, not just from the US, but [from government statistical agencies the world over](https://international.ipums.org/international-action/sample_details).

This is *real* data, meaning that you are being provided the data as it is provided by IPUMS. 

Within this data is information on both the educational background and current earnings of a representative sample of Americans. We will now use this data to estimate the labor-market returns to graduating high school and college, and to learn something about the meaning of an educational degree. 

### Exercise 1

Data for these [exercises can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). First, download `US_ACS_2017_10pct_sample.dta`. 

### Exercise 2

Now import `US_ACS_2017_10pct_sample.dta` into a pandas DataFrame. This can be done with the command `pd.read_stata`, which will read in files created in the program Stata (and which uses the file suffix `.dta`. 

In [1]:
import pandas as pd
import numpy as np
# Download the data
acs = pd.read_stata("https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta")

### Exercise 3

It can be difficult to get an easy overview of all the variables included in a large DataFrame. First, try to see them all using the command:

```
acs.columns
```


In [2]:
acs.columns

Index(['year', 'datanum', 'serial', 'cbserial', 'numprec', 'subsamp', 'hhwt',
       'hhtype', 'cluster', 'adjust',
       ...
       'migcounty1', 'migmet131', 'vetdisab', 'diffrem', 'diffphys', 'diffmob',
       'diffcare', 'diffsens', 'diffeye', 'diffhear'],
      dtype='object', length=104)

As you will see, `python` doesn't like to print out all the different variables. To get everything printed out, we can loop over all the columns and print them one at a time with the command:

```
for c in acs.columns: print(c)
```

Try it. 

In [3]:
for c in acs.columns: print(c)

year
datanum
serial
cbserial
numprec
subsamp
hhwt
hhtype
cluster
adjust
cpi99
region
stateicp
statefip
countyicp
countyfip
metro
city
citypop
strata
gq
farm
ownershp
ownershpd
mortgage
mortgag2
mortamt1
mortamt2
respmode
pernum
cbpernum
perwt
slwt
famunit
sex
age
marst
birthyr
race
raced
hispan
hispand
bpl
bpld
citizen
yrnatur
yrimmig
language
languaged
speakeng
hcovany
hcovpriv
hinsemp
hinspur
hinstri
hcovpub
hinscaid
hinscare
hinsva
hinsihs
school
educ
educd
gradeatt
gradeattd
schltype
degfield
degfieldd
degfield2
degfield2d
empstat
empstatd
labforce
occ
ind
classwkr
classwkrd
looking
availble
inctot
ftotinc
incwage
incbus00
incss
incwelfr
incinvst
incretir
incsupp
incother
incearn
poverty
migrate1
migrate1d
migplac1
migcounty1
migmet131
vetdisab
diffrem
diffphys
diffmob
diffcare
diffsens
diffeye
diffhear


### Exercise 4

To understand the relationship between education and wages, we need variables for: 

- Age
- Income
- Education
- Employment status (is the person actually working)

These quantities of interest correspond to the following variables in our data: `age`, `inctot`, `educ`, and `empstat`. 

### Exercise 5

Once you have found the right variables, subset your DataFrame so you only have the variables you need. This will make the DataFrame easier to work with. 

In [4]:
acs = acs[['age', 'empstat', 'inctot', 'educ']]

In [5]:
acs.head()

Unnamed: 0,age,empstat,inctot,educ
0,4,,9999999,nursery school to grade 4
1,17,employed,6000,grade 11
2,63,employed,6150,4 years of college
3,66,not in labor force,14000,grade 12
4,1,,9999999,n/a or no schooling


### Exercise 6

Look at your data using the `.sample()` command. `.sample()` will give you a random subset of rows from a DataFrame, and is a very good way to eye-ball your data. Try `.sample(10)` to get 10 random rows. Do you see any immediate problems? Write them down with your partner. 

In [6]:
acs.sample(10)

Unnamed: 0,age,empstat,inctot,educ
51701,32,not in labor force,0,grade 12
28927,39,employed,63400,5+ years of college
221662,26,not in labor force,40240,grade 12
155642,70,not in labor force,37800,grade 12
125168,45,employed,35000,grade 12
313335,94,not in labor force,13500,4 years of college
112181,86,not in labor force,31200,grade 11
249730,79,not in labor force,14900,grade 12
226638,12,,9999999,"grade 5, 6, 7, or 8"
77383,61,employed,20000,grade 12


### Exercise 7

One obvious problem is that many people seem to have incomes of $9,999,999. Moreover, people with those incomes seem to be very young children. 

What you are seeing is one method (a relatively old one) for representing missing data. In this case, 9999999 is a method of saying "this person is too young to work, so their income value is missing". 

So let's begin by dropping anyone who has `inctot` equal to 9999999. 

In [7]:
acs = acs.query('inctot != 9999999')

### Exercise 8

OK, the other thing we want to do is only consider people who are employed. So subset the dataset for the people for whom `empstat` is equal to "employed". 

Note that our decision to only look at people who are employed impacts how we should interpret the relationship we estimate between education and income. Because we are only looked at employed people, we will be estimating the relationship between education and income *for people who are employed*. That means that if education affects the *likelihood* someone is employed, we won't capture that in this analysis. 

In [8]:
acs = acs.loc[acs.empstat == "employed", ]
acs.sample(10)

Unnamed: 0,age,empstat,inctot,educ
175130,24,employed,50000,4 years of college
286294,38,employed,42200,2 years of college
225621,48,employed,117000,5+ years of college
277512,44,employed,36900,1 year of college
14380,67,employed,31350,4 years of college
120951,46,employed,68000,4 years of college
272275,49,employed,38000,grade 12
170413,46,employed,295000,5+ years of college
278062,65,employed,75000,5+ years of college
166874,56,employed,43000,grade 12


### Exercise 9

When we dropped anyone with `inctot == 9999999`, we assumed we'd dropped all the children in the data (since when we looked at the data, those values seemed to appear for children). But we never checked the age cutoff being used. Let's check now: what's the minimium age of people still in our current data?

### Exercise 10

Now let's turn to education. The `educ` variable seems to have a lot of discrete values. Let's see what values exist, and their distribution, using the `value_counts()` method. This is an *extremely* useful tool you'll use a lot! Try the following code (modified for the name of your dataset, of course):

```
acs['educ'].value_counts()
```

In [9]:
acs['educ'].value_counts()

grade 12                     47815
4 years of college           33174
1 year of college            22899
5+ years of college          20995
2 years of college           14077
grade 11                      2747
grade 5, 6, 7, or 8           2092
grade 10                      1910
n/a or no schooling           1291
grade 9                       1290
nursery school to grade 4      468
Name: educ, dtype: int64

### Exercise 11

There are a lot of values in here, so let's just check a couple. What is the average value of `inctot` for people whose highest grade level is "grade 12" (in the US, that is someone who has graduated high school)?

In [10]:
acs.loc[acs['educ']=='grade 12', 'inctot'].mean()

38957.76068179442

### Exercise 12

What is the average income of someone who graduated college ("4 years of college")? What does that suggest is the value of getting a college degree after graduating high school?

### Exercise 13

What is the average income for someone who has not finished high school? What does that suggest is the value of a high school diploma?

### Exercise 14

Complete the following table:

- Average income for someone who has not finished high school: _________
- Average income for someone who only completed 9th grade: _________
- Average income for someone who only completed 10th grade: _________
- Average income for someone who only completed 11th grade: _________
- Average income for someone who finished high school (12th grade) but never started college: _________
- Average income for someone who completed 4 year of college (in the US, this means graduating college): _________

In [11]:
acs.groupby('educ').mean()

Unnamed: 0_level_0,inctot
educ,Unnamed: 1_level_1
n/a or no schooling,32276.878389
nursery school to grade 4,27592.649573
"grade 5, 6, 7, or 8",30684.196941
grade 9,27171.907752
grade 10,23018.795812
grade 11,21541.686931
grade 12,38957.760682
1 year of college,43123.872571
2 years of college,48679.305392
4 years of college,75485.052933


### Exercise 15

Why do you think there is no benefit from moving from grade 9 to grade 10, or grade 10 to grade 11, but there is a huge benefit to moving from grade 11 to graduating high school (grade 12)?

## Take-aways

Congratulations! You just discovered "the sheepskin effect!:" people with degrees tend to earn more than people who have the same education, but don't have the certificate of completion. 

In economics, this is viewed as evidence that the reason employers pay people with high school degrees more than those without degree is *not* that they think those who graduated high school have learned specific, useful skills. If that were the case, we would expect employee earnings to rise with every year of high school, since in each year of high school we learn more. Instead, this suggests employees pay high school graduates more because they think *the kind of people* who can finish high school are the *kind of people* who are likely to succeed. Finishing high school, in other words, isn't about accumulating specific knowledge; it's about showing that you *are the kind of person* who can rise to the challenge of finishing high school, also suggesting you are also the kind of person who can succeed as an employee. 

(Obviously, this does not tell us whether that is an *accurate* inference, just that that seems to be how employeers think.) 

The high school degree, in other words, is a *signal* about the kind of person you are (an idea that earned [Michael Spence](https://en.wikipedia.org/wiki/Michael_Spence) a Nobel Prize in Economics). 