# Cleaning Data Exercises

In this exercise, we'll be returning to the American Community Survey data we used previously to measuring racial income inequality in the United States. In today's exercise, we'll be using it to measure the returns to education and how those returns vary by race and gender.




## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_cleaning.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:

```python
assert set(results.keys()) == {
    "ex5_age_young",
    "ex5_age_old",
    "ex7_avg_age",
    "ex8_avg_age",
    "ex9_num_college",
    "ex11_share_male_w_degrees",
    "ex11_share_female_w_degrees",
    "ex12_comparing",
}
```


### Submission Limits

Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.

## Exercises

### Exercise 1

For these cleaning exercises, we'll return to the ACS data we've used before one last time. We'll be working with `US_ACS_2017_10pct_sample.dta`. Import the data (please use url for the autograder).

In [1]:
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

# Download the data
acs = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master"
    "/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta?raw=True"
)

### Exercise 2

For our exercises today, we'll focus on `age`, `sex`, `educ` (education), and `inctot` (total income). Subset your data to those variables, and quickly look at a sample of 10 rows.

In [2]:
acs = acs[["age", "sex", "educ", "inctot"]].copy()

In [3]:
acs.sample(10)

Unnamed: 0,age,sex,educ,inctot
45400,43,female,grade 12,8100
261107,74,female,grade 12,8400
283744,6,male,nursery school to grade 4,9999999
221953,18,male,grade 12,5000
235877,45,male,5+ years of college,841000
308767,28,male,4 years of college,50000
257116,24,female,4 years of college,20000
134982,56,male,grade 12,60200
173144,9,male,nursery school to grade 4,9999999
113986,58,female,4 years of college,20000


### Exercise 3

As before, all the values of `9999999` have the potential to cause us real problems, so replace all the values of `inctot` that are `9999999` with `np.nan`. 

In [4]:
acs["inctot"] = acs["inctot"].replace(9999999, np.nan)

### Exercise 4

Attempt to calculate the average age of people in our data. What do you get? Why are you getting that error?

You *should* get an error in trying to answer this question, but **PLEASE LEAVE THE CODE THAT GENERATES THIS ERROR COMMENTED OUT SO YOUR NOTEBOOK WILL RUN IN THE AUTOGRADER**. 

Then talk about the error in a markdown cell.

In [5]:
# Code I'd run, but which I'm commenting out so this notebook runs:

# acs["age"].mean()

> I get a `TypeError`. Namely: `TypeError: 'Categorical' with dtype category does not support reduction 'mean'` It appears I'm getting the error because `age` is being stored as a Categorical rather than as a numeric type, and `mean` only works for numeric data.

### Exercise 5

We want to be able to calculate things using age, so we need it to be a numeric type. Check the current type of `age`, and look at all the values of `age` to figure out why it's categorical and not numeric. You should find two problematic categories. Store the values of these categories in `"ex5_age_young"` and `"ex5_age_old"` (once you find them, it should be clear which is which).

In [6]:
results = dict()

# One way to find problems:

# Make string so can use `str.isnumeric`
acs["age"] = acs["age"].astype("str")

# us `str.isnumeric`
problems = acs.loc[~acs["age"].str.isnumeric(), "age"].unique()
problems

array(['less than 1 year old', '90 (90+ in 1980 and 1990)'], dtype=object)

In [7]:
# Or just poke around!

# I could run this code, but I'm commenting it out so this notebook isn't awful to look at:

# for i in acs.age.value_counts().index:
#     print(i)

In [8]:
results["ex5_age_young"] = problems[0]
results["ex5_age_old"] = problems[1]
print(
    f"The problematic values are '{results['ex5_age_young']}' and '{results['ex5_age_old']}'"
)

The problematic values are 'less than 1 year old' and '90 (90+ in 1980 and 1990)'


### Exercise 6

In order to convert `age` into a numeric variable, we need to replace those problematic entries with values that `pandas` can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. 

**Hint 1:** Categorical variables act like strings, so you might want to use string methods! 

**Hint 2:** Remember that characters like parentheses, pluses, asterisks, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!

**Hint 3:** Because the US Census has been conducted regularly for hundreds of years but exactly how the census has been conducted has occasionally changed, variables are sometimes coded in a way that might be interpreted in different ways for different census years. For example, hypothetically, one might write `90 (90+ in 1980 and 1990)` if the Censuses conducted in 1980 and 1990 used to top-code age at 90 (any values *over* 90 were just coded as 90), but more recent Censuses no longer top-coded age and recorded ages over 90 as the respondent's actual age. We're only working with more recent data, so anyone with a value of `90 (90+ in 1980 and 1990)` can safely be assumed to be 90 years old. People with an age less than 1 year old can be treated as being of age `0`.

In [9]:
acs["age"] = acs["age"].str.replace("less than 1 year old", "0", regex=True)
acs["age"] = acs["age"].str.replace("90 \(90\+ in 1980 and 1990\)", "90", regex=True)

  acs["age"] = acs["age"].str.replace("90 \(90\+ in 1980 and 1990\)", "90", regex=True)


In [10]:
# Check that worked
acs.loc[~acs["age"].str.isnumeric(), "age"].unique()

array([], dtype=object)

In [11]:
assert acs["age"].str.isnumeric().all()

### Exercise 7

Now convert age from a categorical to numeric. Calculate the average age amoung this group, and store it in `"ex7_avg_age"`.

In [12]:
acs["age"] = acs["age"].astype("int")
results["ex7_avg_age"] = acs["age"].mean()
print(f"The average age is {results['ex7_avg_age']:.2f}")

The average age is 41.30


### Exercise 8

Let's now filter out anyone in our data whose age is less than 18. Note that before made `age` a numeric variable, we couldn't do this! Again, calculate the average age and this time store it in `"ex8_avg_age"`. 

Use this sample of people 18 and over for all subsequent exercises.

In [13]:
acs = acs[acs["age"] >= 18].copy()
results["ex8_avg_age"] = acs["age"].mean()
print(f"The average age of those above 18 is {results['ex8_avg_age']:.2f}")

The average age of those above 18 is 49.76


### Exercise 9

Create an indicator variable for whether each person has *at least* a college Bachelor's degree called `college_degree`. Use this variable to calculate the number of people in the dataset with a college degree. You may assume that to get a college degree you need to complete at least 4 years of college. Save the result as `"ex9_num_college"`.

In [14]:
acs["educ"].value_counts(dropna=False)

educ
grade 12                     92576
4 years of college           47212
1 year of college            38746
5+ years of college          29801
2 years of college           20753
grade 5, 6, 7, or 8           5975
grade 11                      5816
grade 10                      4078
n/a or no schooling           3644
grade 9                       3145
nursery school to grade 4     1288
Name: count, dtype: int64

In [15]:
# Could do with two conditions and |
acs["college_degree"] = (acs["educ"] == "4 years of college") | (
    acs["educ"] == "5+ years of college"
)
acs["college_degree"].value_counts(dropna=False)

college_degree
False    176021
True      77013
Name: count, dtype: int64

In [16]:
# Or you could do with `.isin`
acs["college_degree"] = acs["educ"].isin(["4 years of college", "5+ years of college"])
acs["college_degree"].value_counts(dropna=False)

college_degree
False    176021
True      77013
Name: count, dtype: int64

In [17]:
# Check it worked
acs.loc[acs["college_degree"] == 1, "educ"].value_counts()

educ
4 years of college           47212
5+ years of college          29801
n/a or no schooling              0
nursery school to grade 4        0
grade 5, 6, 7, or 8              0
grade 9                          0
grade 10                         0
grade 11                         0
grade 12                         0
1 year of college                0
2 years of college               0
Name: count, dtype: int64

In [18]:
# Remember booleans are just special cases of 0 and 1, so
# the count that's true we can get with `sum`
results["ex9_num_college"] = acs["college_degree"].sum()
print(
    f"There are {results['ex9_num_college']:,.0f} people in our data with a college degree."
)

There are 77,013 people in our data with a college degree.


### Exercise 10

Let's examine how the educational gender gap. Use `pd.crosstab` to create a cross-tabulation of `sex` and `college_degree`. `pd.crosstab` will give you the number of people who have each combination of `sex` and `college_degree` (so in this case, it will give us a 2x2 table with Male and Female as rows, and `college_degree` True and False as columns, or vice versa. 

In [19]:
pd.crosstab(acs["college_degree"], acs["sex"])

sex,male,female
college_degree,Unnamed: 1_level_1,Unnamed: 2_level_1
False,85821,90200
True,36181,40832


### Exercise 11

Counts are kind of hard to interpret. `pd.crosstab` can also normalize values to give percentages. Look at the `pd.crosstab` help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

Store the share (between 0 and 1) of men with college degrees in `"ex11_share_male_w_degrees"`, and the share of women with degrees in `"ex11_share_female_w_degrees"`.

In [20]:
tab = pd.crosstab(acs["college_degree"], acs["sex"], normalize="columns")
tab

sex,male,female
college_degree,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.703439,0.688381
True,0.296561,0.311619


In [21]:
for g in ["male", "female"]:
    results[f"ex11_share_{g}_w_degrees"] = tab.loc[True, g]
    print(
        f"The share of {g}s with college degrees is "
        f"{results[f'ex11_share_{g}_w_degrees']:.4f}"
    )

The share of males with college degrees is 0.2966
The share of females with college degrees is 0.3116


### Exercise 12

Now, let's recreate that table for people who are 40 and over and people under 40. Over time, what does this suggest about the absolute difference in the share of men and women earning college degrees? Has it gotten larger, stayed the same, or gotten smaller? Store your answer (either `"the absolute difference has increased"` or `"the absolute difference has decreased"`) in `"ex12_comparing"`.

In [22]:
older = acs[40 <= acs["age"]]
older_table = pd.crosstab(older["college_degree"], older["sex"], normalize="columns")
older_table

sex,male,female
college_degree,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.682123,0.699144
True,0.317877,0.300856


In [23]:
older_gap = older_table.loc[True, "male"] - tab.loc[True, "female"]
print(
    f"the gap between men and women's college attainment"
    f" for those OVER 40 is {older_gap:.3f}"
)

the gap between men and women's college attainment for those OVER 40 is 0.006


In [24]:
younger = acs[acs["age"] < 40]
younger_table = pd.crosstab(
    younger["college_degree"], younger["sex"], normalize="columns"
)
tab

sex,male,female
college_degree,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.703439,0.688381
True,0.296561,0.311619


In [25]:
younger_gap = tab.loc[True, "male"] - tab.loc[True, "female"]
print(
    f"the gap between men and women's college attainment"
    f" for those UNDER 40 is {younger_gap:.3f}"
)

the gap between men and women's college attainment for those UNDER 40 is -0.015


In [26]:
print(
    f"The male-female gap has gone from {older_gap:.3f} to {younger_gap:.3f}. "
    f"Thus the absolute change from older to younger cohorts is {np.abs(younger_gap) - np.abs(older_gap):.3f}."
)

The male-female gap has gone from 0.006 to -0.015. Thus the absolute change from older to younger cohorts is 0.009.


In [27]:
results["ex12_comparing"] = "the absolute difference has increased"

### Exercise 13

In words, what is causing the change noted in Exercise 12 (i.e., looking at the tables above, tell me a story about Men and Women's College attainment).

> While a larger proportion of men than women have a college degree in the older cohort, that relationship is flipped for younger Americans; among those under 40, a larger share of adult women have college degrees then men.
>
> Interestingly, this appears to be the result of *both* an increasing share of women getting college degrees *and* a decreasing share of men getting college degrees:

In [28]:
print(
    f"Among older Americans, {older_table.loc[True, 'male']:.1%} of \n"
    f"men have college degrees; among younger men, this number has \n"
    f"fallen to {younger_table.loc[True, 'male']:.1%}."
)

print(
    f"By contrast, among older Americans {older_table.loc[True, 'female']:.1%} of \n"
    f"women have college degrees; among younger women, this number has \n"
    f"risen to {younger_table.loc[True, 'female']:.1%}."
)

Among older Americans, 31.8% of 
men have college degrees; among younger men, this number has 
fallen to 25.7%.
By contrast, among older Americans 30.1% of 
women have college degrees; among younger women, this number has 
risen to 33.4%.


## Want More Practice?

Calculate the educational racial gap in the United States for White Americans, Black Americans, Hispanic Americans, and other groups. 

Note that to do these calculations, you'll have to deal with the fact that unlike most Americans, the American Census Bureau treats "Hispanic" not as a racial category, but a linguistic one. As a result, the racial category "White" in `race` actually includes most Hispanic Americans. For this analysis, we wish to work with the mutually exclusive categories of "White, non-Hispanic", "White, Hispanic", "Black (Hispanic or non-Hispanic)", and a category for everyone else. 

In [29]:
assert set(results.keys()) == {
    "ex5_age_young",
    "ex5_age_old",
    "ex7_avg_age",
    "ex8_avg_age",
    "ex9_num_college",
    "ex11_share_male_w_degrees",
    "ex11_share_female_w_degrees",
    "ex12_comparing",
}

In [30]:
sorted(list(results.keys()))

['ex11_share_female_w_degrees',
 'ex11_share_male_w_degrees',
 'ex12_comparing',
 'ex5_age_old',
 'ex5_age_young',
 'ex7_avg_age',
 'ex8_avg_age',
 'ex9_num_college']

In [31]:
len(results.keys())

8