# Week 2 Quiz: Manipulating DataFrames

**Note**: 

> This exercise has been written out in something called a Jupyter Notebook. We'll discuss Jupyter Notebooks in more detail later in this specialization—they are very a powerful tool for data science communication!—but for the time being, the notebook is just a convenient way for us to write out the exercise. You don't need to *do* anything with the notebook except read its contents—just use write your Python code in a regular `.py` file.

**WARNING:**

> When asked to round your answers to a certain number of decimals, do *not* round any results until you've finished your computations and have your final answer! For example, if you were to calculate the average hourly wage for workers, and you did so by first calculating the average weekly salary of workers and the average hours worked per week, then divided the first number by the second, you should NOT round the average weekly salary of workers or the average hours worked per week. Rounding intermediate results can lead to compounding errors that cause problems for the autograder.

## Estimating Labor Market Returns to Education

In this exercise, we're going to use data from the [American Communities Survey (ACS)](https://usa.ipums.org/usa/acs.shtml) to study the relationship between educational attainment and wages. The ACS is a survey conducted by the United States Census Bureau (though it is not "The Census," which is a counting of every person in the United States that takes place every 10 years) to measure numerous features of the US population. The data we will be working with includes about 100 variables from the 2017 ACS survey, and is a 10% sample of the ACS (which itself is a 1% sample of the US population, so we're working with about a 0.1% sample of the United States). 

This data comes from [IPUMS](https://usa.ipums.org/usa/), which provides a very useful tool for getting subsets of major survey datasets, not just from the US, but [from government statistical agencies the world over](https://international.ipums.org/international-action/sample_details).

This is *real* data, meaning that you are being provided the data as it is provided by IPUMS. Documentation for all variables used in this data can be found [here](https://usa.ipums.org/usa-action/variables/group) (you can either search by variable name to figure out the meaning of a variable in this data, or search for something you want to see if a variable with the right name is in this data). 

Within this data is information on both the educational background and current earnings of a representative sample of Americans. We will now use this data to estimate the labor-market returns to graduating high school and college, and to learn something about the meaning of an educational degree. 

## Exercises

### Exercise 1

Import `US_ACS_2017_10pct_sample.dta` into a pandas DataFrame. 

This can be done with the command `pd.read_stata("US_ACS_2017_10pct_sample.dta")`, which will read in files created in the program Stata (and which uses the file suffix `.dta`). This is a format commonly used by social scientists. We will discuss the range of tools for importing data built into pandas in more detail next week!

In [1]:
import pandas as pd
import numpy as np


# Download the data
acs = pd.read_stata("data/US_ACS_2017_10pct_sample.dta")

## Getting to Know Your Data

When you get a new dataset like this, it's good to start by trying to get a feel for its contents and organization. Toy datasets you sometimes get in classes are often very small, and easy to look at, but this is a pretty large dataset, so you can't just open it up and get a good sense of it. Here are some ways to get to know your data. 

### Exercise 2

How many observations are in your data?

In [2]:
results = dict()
results["ex2_num_obs"] = len(acs)
print(f"There are {results['ex2_num_obs']:,} observations")

There are 319,004 observations


### Exercise 3

How many variables are in your data?

In [3]:
# Either:
results["ex3_num_vars"] = len(acs.columns)
print(f"There are {results['ex3_num_vars']} variables")

There are 104 variables


In [4]:
# Or you can do:

acs.shape[1]

104

### Exercise 4

 Let's see what variables are in this dataset. First, try to see them all using the command:


```python
acs.columns
```

In [5]:
acs.columns

Index(['year', 'datanum', 'serial', 'cbserial', 'numprec', 'subsamp', 'hhwt',
       'hhtype', 'cluster', 'adjust',
       ...
       'migcounty1', 'migmet131', 'vetdisab', 'diffrem', 'diffphys', 'diffmob',
       'diffcare', 'diffsens', 'diffeye', 'diffhear'],
      dtype='object', length=104)

As you will see, `python` doesn't like to print out all the different variables when there are this many in a dataset. 

To get everything printed out, we can loop over all the columns and print them one at a time with the command:

```
for c in acs.columns: print(c)
```

It's definitely a bit of a hack, but honestly a pretty useful one!

### Exercise 5

That's a *lot* of variables, and definitely more than we need. In general, life is easier when working with these kinds of huge datasets if you can narrow down the number of variables a little. In this exercise, we will be looking at the relationship between education and wages, we need variables for: 

- Age
- Income
- Education
- Employment status (is the person actually working)

These quantities of interest correspond to the following variables in our data: `age`, `inctot`, `educ`, and `empstat`. 

Subset your data to just those variables. 

In [6]:
acs = acs[["age", "inctot", "educ", "empstat"]]

### Exercise 6 

Now that we have a more manageable number of variables, it's often very useful to look at a handful of rows of your data. The easiest way to do this is probably the `.head()` method (which will show you the first five rows), or the `tail()` method, which will show you the last five rows. 

But to get a good sense of your data, it's often better to use the `sample()` command, which returns a random set of rows. As the first and last rows are sometimes not representative, a random set of rows can be very helpful. Try looking at a random sample of 20 rows (note: you don't have to run `.sample()` ten times to get ten rows. Look at the `.sample` help file if you're stuck. 

In [7]:
acs.sample(20)

Unnamed: 0,age,inctot,educ,empstat
69396,66,9100,grade 9,not in labor force
137595,33,76000,grade 12,employed
228414,60,42500,grade 12,not in labor force
97765,59,135000,4 years of college,employed
126584,18,5000,grade 12,employed
144794,42,103000,5+ years of college,employed
224927,54,30000,grade 12,employed
184296,64,340000,4 years of college,not in labor force
254791,19,0,1 year of college,not in labor force
125683,6,9999999,nursery school to grade 4,


### Exercise 7

Do you see any immediate problems? What issues do you see?

> Uh, yup! People have incomes of 9 million?! And those people tend to be children?

### Exercise 8 

One problem is that many people seem to have incomes of $9,999,999. Moreover, people with those incomes seem to be very young children. 

What you are seeing is one method (a relatively old one) for representing missing data. In this case, the value 9999999 is being used as a **sentinel value** — a way to denote missing data that was used back in the day when there was no way to add a special data type for missing data. In this case, it identifies observations where the person is too young to work, so their income value is missing. 

So let's begin by dropping anyone who has `inctot` equal to 9999999.

After dropping, how many observations do you have?

In [8]:
acs = acs[acs["inctot"] != 9_999_999]
results["ex8_updated_num_obs"] = len(acs)

print(f"After dropping there are {results['ex8_updated_num_obs']:,} observations")

After dropping there are 265,103 observations


### Exercise 9

OK, the other potential problem is that our data includes lots of people who are unemployed and people who are not in the labor force (this means they not only don't have a job, but also aren't looking for a job). For this analysis, we want to focus on the wages of people who are currently employed. So subset the dataset for the people for whom `empstat` is equal to "employed". 

Note that our decision to only look at people who are employed impacts how we should interpret the relationship we estimate between education and income. Because we are only looking at employed people, we will be estimating the relationship between education and income *for people who are employed*. That means that if education affects the *likelihood* someone is employed, we won't capture that in this analysis.

(You might also want to run `.sample()` after this just to make sure you were successful in your subsetting).

After this subsetting, how many observations do you have?

In [9]:
acs = acs[acs.empstat == "employed"]
acs.sample(10)

Unnamed: 0,age,inctot,educ,empstat
206672,41,21000,2 years of college,employed
84485,36,76000,4 years of college,employed
267055,61,8400,grade 12,employed
134484,31,48000,5+ years of college,employed
108191,30,5000,grade 12,employed
217980,49,80000,5+ years of college,employed
277779,52,95000,4 years of college,employed
127634,67,21300,grade 12,employed
20292,32,15000,grade 12,employed
3140,28,56000,grade 12,employed


In [10]:
results["ex9_updated_num_obs"] = len(acs)

print(
    "After subsetting for employed people "
    f"there are {results['ex9_updated_num_obs']:,} observations"
)

After subsetting for employed people there are 148,758 observations


### Exercise 10

Now let's turn to education. The `educ` variable seems to have a lot of discrete values. Let's see what values exist, and their distribution, using the `value_counts()` method. This is an *extremely* useful tool you'll use a lot! Try the following code (modified for the name of your dataset, of course):

```python
acs["educ"].value_counts()
```

In [11]:
acs["educ"].value_counts()

educ
grade 12                     47815
4 years of college           33174
1 year of college            22899
5+ years of college          20995
2 years of college           14077
grade 11                      2747
grade 5, 6, 7, or 8           2092
grade 10                      1910
n/a or no schooling           1291
grade 9                       1290
nursery school to grade 4      468
Name: count, dtype: int64

### Exercise 11

There are a lot of values in here, so let's just check a couple. What is the average value of `inctot` for people whose highest grade level is "grade 12" (in the US, that is someone who has graduated high school)? 

*Please round your answer to two decimal places* remembering to not round any intermediate results.

In [12]:
results["ex11_grade12_income"] = acs.loc[acs.educ == "grade 12", "inctot"].mean()

print(
    f"The average income for an employed person \n"
    f"whose highest completed grade is Grade 12 ${results['ex11_grade12_income']:,.2f}."
)

The average income for an employed person 
whose highest completed grade is Grade 12 $38,957.76.


### Exercise 12

What is the average income of someone who has completed an undergraduate degree but not done any postgraduate education ("4 years of college")? 

*Please round your answer to two decimal places* remembering to not round any intermediate results.

In [13]:
results["ex12_college_income"] = acs.loc[
    acs.educ == "4 years of college", "inctot"
].mean()

print(
    f"The average income for an employed person \n"
    f"with an undergraduate degree but no  \n"
    f"postgraduate education is ${results['ex12_college_income']:,.2f}."
)

The average income for an employed person 
with an undergraduate degree but no  
postgraduate education is $75,485.05.


### Exercise 13

In percentage terms, how much does an employed college graduate earn as compared to someone who is only a high school graduate? In other words, calculate how much more an employed college graduate earns than an employed high school graduate, and divide that by the average earnings of a high school graduate.

Put your answer in percentage terms (so 100 implies they earn the same amount). *Please round your answer to one decimal place* remembering to not round any intermediate results.

In [14]:
results["ex12_college_income_pct"] = 100 * (
    results["ex12_college_income"] / results["ex11_grade12_income"]
)

print(
    f"The avg employed college graduate earns {results['ex12_college_income_pct']:.1f}%\n"
    "the salary of the average employed high school graduate."
)

The avg employed college graduate earns 193.8%
the salary of the average employed high school graduate.


In [15]:
# The flip would be

flipped = 100 * (results["ex11_grade12_income"]) / results["ex12_college_income"]

print(
    f"The avg employed high school graduate earns {flipped:.1f}%\n"
    "what the average college graduate earns."
)

The avg employed high school graduate earns 51.6%
what the average college graduate earns.



### Exercise 14

What does that suggest is the value of getting a college degree after graduating high school?

> Getting a college education yields a *huge* wage premium.

### Exercise 15

What is the difference in the average income for somebody whose highest grade completed is Grade 11 and somebody whose highest grade completed is Grade 10? *Please round your answer to two decimal places.* Remember to not round any intermediate results.


In [19]:
import re

results = dict()

for level in ["grade 9", "grade 10", "grade 11", "grade 12", "4 years of college"]:
    avg_income = acs.loc[acs["educ"] == level, "inctot"].mean()
    print(f"those who have finished {level} earn {avg_income:,.2f}")
    results[f"{re.sub(' ', '_', level)}"] = avg_income

print(
    f'The difference between Grade 11 and Grade 10 graduates is ${results["grade_11"]-results["grade_10"]:,.2f}'
)

those who have finished grade 9 earn 27,171.91
those who have finished grade 10 earn 23,018.80
those who have finished grade 11 earn 21,541.69
those who have finished grade 12 earn 38,957.76
those who have finished 4 years of college earn 75,485.05
The difference between Grade 11 and Grade 10 graduates is $-1,477.11


### Exercise 16

What is the difference in the average income for somebody whose highest grade completed is Grade 12 (has finished high school) and somebody whose highest grade completed is Grade 11?

*Please round your answer to two decimal places.* Remember to not round any intermediate results.

In [18]:
print(
    f'The difference between Grade 12 and Grade 11 graduates is ${results["grade_12"]-results["grade_11"]:,.2f}'
)

The difference between Grade 12 and Grade 11 graduates is $17,416.07


### Exercise 17

Moving from Grade 10 to Grade 11 and moving from Grade 11 to Grade 12 both constitute getting one additional year of high school education. Are these two increases in education associated with the same change in average earnings? Can you think of a reason why?


## Take-aways

Congratulations! You just discovered "the sheepskin effect!": people with degrees tend to earn substantially more than people who have *almost* as much education, but don't have an actual degree. 

In economics, this is viewed as evidence that the reason employers pay people with high school degrees more than those without degree is *not* that they think those who graduated high school have learned specific, useful skills. If that were the case, we would expect employee earnings to rise with every year of high school, since in each year of high school we learn more. 

Instead, this suggests employees pay high school graduates more because they think *the kind of people* who can finish high school are the *kind of people* who are likely to succeed at their jobs. Finishing high school, in other words, isn't about accumulating specific knowledge; it's about showing that you *are the kind of person* who can rise to the challenge of finishing high school, also suggesting you are also the kind of person who can succeed as an employee. 

(Obviously, this does not tell us whether that is an *accurate* inference, just that that seems to be how employeers think.) 

In other words, in the eyes of employers, a high school degree is a *signal* about the kind of person you are, not certification that you've learned a specific set of skills (an idea that earned [Michael Spence](https://en.wikipedia.org/wiki/Michael_Spence) a Nobel Prize in Economics). 

### Data Citation

The ACS data used in this exercise are a subsample of the IPUMS USA data available from [usa.ipums.org.](usa.ipums.org)

Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0

These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.

Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.