<a href="https://colab.research.google.com/github/nzarama-kouadio/Nzarama_Kouadio_DE_Mini_Project9/blob/main/Mini_Project_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Exercise 1

Today, we will be using the ACS data we used during out first `pandas` exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with *extremely* high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States.

To begin, load the ACS Data we used in our first pandas exercise. That [data can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). We'll be working with `US_ACS_2017_10pct_sample.dta`.

In [None]:
import pandas as pd
import numpy as np

content = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/raw/refs/heads/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta?download="
)
content.sample(5)

Unnamed: 0,year,datanum,serial,cbserial,numprec,subsamp,hhwt,hhtype,cluster,adjust,...,migcounty1,migmet131,vetdisab,diffrem,diffphys,diffmob,diffcare,diffsens,diffeye,diffhear
135233,2017,1,1145178,2017001000000.0,2,67,53,married-couple family household,2017011000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
50440,2017,1,921491,2017001000000.0,3,49,11,married-couple family household,2017009000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
119138,2017,1,485384,2017001000000.0,3,29,157,married-couple family household,2017005000000.0,1.011189,...,167,not in identifiable area,,has cognitive difficulty,has ambulatory difficulty,no independent living difficulty,yes,has vision or hearing difficulty,yes,yes
216229,2017,1,1115486,2017001000000.0,2,64,60,"female householder, no husband present",2017011000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
222557,2017,1,684252,2017001000000.0,2,70,46,married-couple family household,2017007000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no


### Exercise 2

Let's begin by calculating the mean US incomes from this data (recall that income is stored in the `inctot` variable). Store the answer in `results` under the key `"ex2_avg_income"`.

In [None]:
ex2_avg_income = content["inctot"].mean()
print(f"The average income of our sanmple is: {ex2_avg_income:.2f}")

The average income of our sanmple is: 1723646.27


### Exercise 3

Hmmm... That doesn't look right. The average American is definitely not earning that much a year! Let's look at the values of `inctot` using `value_counts()`. Do you see a problem?

Now use `value_counts()` with the argument `normalize=True` to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? Store that proportion (between 0 and 1) as `"ex3_share_making_9999999"`. What percentage has an income of 0? Store that proportion as `"ex3_share_making_zero"`.

(Recall `.value_counts()` returns a Series, so you can pull values out with our usual pandas tools.)

In [None]:
inctot_value_proportions = content["inctot"].value_counts(normalize=True)

ex3_share_making_9999999 = inctot_value_proportions.loc[9999999]
print(
    f"The proportion of our sample with an income of 9_999_999 is: {ex3_share_making_9999999:.2f}"
)

ex3_share_making_zero = inctot_value_proportions.loc[0]
print(
    f"The proportion of our sample with an imcome of 0 is: {ex3_share_making_zero:.2f}"
)

The proportion of our sample with an income of 9_999_999 is: 0.17
The proportion of our sample with an imcome of 0 is: 0.11


### Exercise 4

As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of "sentinel value" is that pandas doesn't understand that this is supposed to denote missing data, and so when it averages the variable, it doesn't know to ignore 9999999.

To help out `pandas`, use the `replace` command to replace all values of 9999999 with `np.nan`.

In [None]:
content["inctot"] = content["inctot"].replace(9999999, np.nan)
content["inctot"].sample(5)

Unnamed: 0,inctot
136473,
48522,85000.0
65743,56000.0
17488,65100.0
244631,23000.0


### Exercise 5

Now that we've properly labeled our missing data as `np.nan`, let's calculate the average US income once more. Store the answer in `results` under the key `"ex5_avg_income"`.

In [None]:
ex5_avg_income = content["inctot"].mean()
print(f"The mean income of our sample size is: {ex5_avg_income:.2f}")

The mean income of our sample size is: 40890.18


### Exercise 6

OK, now we've been able to get a reasonable average income number. As we can see, a major advantage of using `np.nan` is that `pandas` knows that `np.nan` observations should just be ignored when we are calculating means.

But it's not enough to just get rid of the people who had `inctot` values of 9999999. We also need to know why those values were missing. Suppose, for example, that the value of 9999999 was used for anyone who made more than 100,000 dollars: if we just dropped those people, then our estimate of average income wouldn't mean much, would it?

So let's make sure we understand *why* data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let's make sure that's true by looking at the distribution of the variable `age` for people for whom `inctot` is missing (i.e. subset the data to people with `inctot` missing, then look at the values of `age` with `value_counts()`).

Then do the opposite: look at the distribution of the `age` variable for people who whom `inctot` is *not* missing.

Can you determine when 9999999 was being used? Is it ok we're excluding those people from our analysis?

Note: In this data, Python doesn't understand `age` is a number; it thinks it is a string because the original data has categories like "90 (90+ in 1980 and 1990)" and "less than 1 year old". So you can't just use `min()` or `max()`. We'll discuss converting string variables into numbers in a future class.

In [None]:
# subset the data to people with `inctot` missing, then look at the values of `age` with `value_counts()`

missing = content.loc[content["inctot"].isna()]

age = missing["age"].value_counts()

# Then do the opposite: look at the distribution of the `age` variable for people who whom `inctot` is *not* missing.

not_missing = content.loc[~content["inctot"].isna()]

age_not_missing = not_missing["age"].value_counts()

# Exclude the kids from our dataframe then
content = content.loc[
    ~content["age"].isin(
        [
            "1",
            "2",
            "3",
            "4",
            "5",
            "6",
            "7",
            "8",
            "9",
            "10",
            "11",
            "12",
            "13",
            "14",
            "less than 1 year old",
        ]
    )
]

# Check if it worked
content.age.value_counts()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
60,4950
54,4821
59,4776
56,4776
58,4734
...,...
5,0
4,0
3,0
2,0


### Exercise 7

Great, so now we know why those people had missing data, and we're ok with excluding them.

But as we previously noted, there are also a lot of observations of zero income in our data, and it's not clear that we want everyone with a zero-income *should* be included in this average, since those may be people who are retired, or in school.

Let's limit our attention to people who are currently working by subsetting to only employed respondents. We can do this using `empstat`. Remember you can use `value_counts()` to see what values of `empstat` are in the data!

In [None]:
employed = content["empstat"].value_counts()

content = content.loc[content["empstat"] == "employed"]

print(content.sample(5))

        year  datanum   serial      cbserial          numprec subsamp  hhwt  \
307948  2017        1  1244945  2.017001e+12                2      79    62   
141755  2017        1   540931  2.017001e+12  1 person record      36    20   
45685   2017        1   692845  2.017001e+12                2      79    20   
17182   2017        1   461366  2.017001e+12                3      45    77   
262304  2017        1  1066261  2.017001e+12                3      86    32   

                                  hhtype       cluster    adjust  ...  \
307948    hhtype could not be determined  2.017012e+12  1.011189  ...   
141755  female householder, living alone  2.017005e+12  1.011189  ...   
45685     hhtype could not be determined  2.017007e+12  1.011189  ...   
17182    married-couple family household  2.017005e+12  1.011189  ...   
262304   married-couple family household  2.017011e+12  1.011189  ...   

        migcounty1                        migmet131  \
307948          85  dallas-fort

### Exercise 8

Now let's estimate the racial income gap in the United States. What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

**Note:** these values are not quite accurate estimates. As we'll discuss in later lessons, to get completely accurate estimates from the ACS we have to take into account how people were selected to be interviewed. But you get pretty good estimates in most cases even without weights—your estimate of the racial wage gap without weights is within 5\% of the corrected value.

**Note:** This is actually an underestimate of the wage gap. The US Census treats Hispanic respondents as a sub-category of "White." While all ethnic distinctions are socially constructed, and so on some level these distinctions are all deeply problematic, this coding is inconsistent with what most Americans think of when they hear the term "White," a term *most* Americans think of as a category that is mutually exclusive of being Hispanic or Latino (categories which are also usually conflated in American popular discussion). With that in mind, most researchers working with US Census data split "White" into "White, Hispanic" and "White, Non-Hispanic" using `race` *and* `hispan`. But for the moment, just identify "White" respondents using the value in `race`.

Store your results in `results` under the keys `"ex8_avg_income_black"`, `"ex8_avg_income_white"`, and the percentage difference as `ex8_racial_difference`. Please note the wording above when calculating the percentage difference to ensure you get the reference category correct, and interpret your result as well.

In [None]:
content.race.value_counts()

white_people = content.loc[content["race"] == "white"]
ex8_avg_income_white = white_people["inctot"].mean()
print(f"The average income of White people is: {ex8_avg_income_white:.2f}")

black_people = content.loc[content["race"] == "black/african american/negro"]
ex8_avg_income_black = black_people["inctot"].mean()
print(f"The average income of Black people is: {ex8_avg_income_black:.2f}")

ex8_racial_difference = (
    (ex8_avg_income_white - ex8_avg_income_black) / ex8_avg_income_black
) * 100
print(
    f"The percentage difference in income between the 2 groups is: {ex8_racial_difference:.2f}%"
)

The average income of White people is: 60473.15
The average income of Black people is: 41747.95
The percentage difference in income between the 2 groups is: 44.85%


### Exercise 9


As noted above, these estimates are not actually *quite* correct because we aren't taking into account the fact that when the US Census decided who to survey, not all Americans had the same likelihood of being asked. The US American Community Survey is an example of a *weighted* survey (essentially, people from smaller subpopulations have a higher likelihood of being included to ensure enough individuals in the final survey to constitute a representative sample that can be used statistically).

To calculate a weighted average that takes into account these survey weights (in other words, a more accurate estimate of US incomes), you need to use the following formula:

$$weighted\_mean\_of\_x = \frac{\sum_i x_i * weight_i}{\sum_i weight_i}$$

(As you can see, when $weight_i$ is constant for all observations, this just simplifies to our normal formula for mean values. It is only when weights vary across individuals that weights must be explicitly addressed).

In this data, weights are stored in the variable `perwt`, which is the number of people for which each observation is a stand-in (the inverse of that observation's sampling probability).

Using the formula, re-calculate the *weighted* average income for both populations and store them as `ex9_avg_income_white` and `ex9_avg_income_black`.

In [None]:
def weighted_mean(dataset):
    nominator = sum(dataset.inctot * dataset.perwt)
    denominator = sum(dataset.perwt)

    return nominator / denominator


ex9_avg_income_white = weighted_mean(white_people)
ex9_avg_income_black = weighted_mean(black_people)

print(f"The average weighted income of White people is: {ex9_avg_income_white:.2f}")
print(f"The average weighted income of Black people is: {ex9_avg_income_black:.2f}")

The average weighted income of White people is: 58361.48
The average weighted income of Black people is: 40430.95


### Exercise 10

Now calculate the weighted average income gap between *non-Hispanic* White Americans and Black Americans. What percentage more do employed White non-Hispanic Americans earn than employed Black Americans? Store as `"ex10_wage_gap"`.

In [None]:
non_hispanic_white_people = white_people.loc[white_people["hispan"] == "not hispanic"]

weighted_income_non_hispanic = weighted_mean(non_hispanic_white_people)

print(
    f"The average weighted income of White non hispanic people is: {weighted_income_non_hispanic:.2f}"
)

ex10_wage_gap = (
    (weighted_income_non_hispanic - ex9_avg_income_black) / ex9_avg_income_black
) * 100

print(f"The percentage is: {ex10_wage_gap:.2f}%")

The average weighted income of White non hispanic people is: 61669.29
The percentage is: 52.53%
