# Vector Exercises: Working With Subsets

## Income Inequality

In these exercises, we will return to the vector with estimates of the total income (from all sources) of a random sample of American households collected by the U.S. Census Bureau between 2015 and 2019 as part of the American Community Survey (ACS).

(As before, apologies to people who are not from the United States -- most of our users come from the US, so picking the United States seemed like the least bad option. However, if you are interested in completing these same exercises for your own country, head over to [IPUMS International](https://international.ipums.org/international/) to see if analogous income data has been made available by your country's Census Bureau. Simply click on the "Browse Data" button, then "Select Sample" in the top left to find the most recent data available for your country. Then see if you can find income data under the "Select Harmonized Variables" "PERSON" or "HOUSEHOLD" drop-down menus. Note that income data is hard to collect, so it's probably not available for most countries.)


### Gini Index

A standard measure of income inequality is the [Gini Index / Gini Coefficient](https://en.wikipedia.org/wiki/Gini_coefficient). The measure takes on a value of 0 when everyone in a population has the same income, and a value of 1 when all the income in population accrues to a single person.

For discrete data, the Gini Index is defined as:

$$Gini\ Index = \frac{2 \sum_{i=1}^n i y_i}{n \sum_{i=1}^n y_i} -\frac{n+1}{n}$$

Where $i$ is each households' rank ordering from poorest to richest, and $y_i$ is the income of household $i$. We can calculate this from a vector of incomes in Python with the following function:

```python
def gini(incomes):

    # Get number of observations
    n = len(incomes)

    # Generate rankings i
    sorted_incomes = np.sort(incomes, ascending=True)
    ranks = np.arange(1, n + 1)
    
    # Top term of left part of equation
    top = 2 * (income * ranks).sum()

    # Bottom term of left part of equation
    bottom = np.sum(incomes) * n

    # Right part of equation
    correction = (n + 1) / n

    return top / bottom - correction
```

1. Using this function, calculate the Gini Index of income inequality for this household income data. As before, use the command `np.loadtxt("data/us_household_incomes.txt")` to load the vector of incomes, and make sure to assign the result of that command to a new variable. 
2. Go compare your estimate to that of [other countries here.](https://www.indexmundi.com/facts/indicators/SI.POV.GINI/rankings) (Note: in this table, estimated Gini values have been multiplied 100.)
3. Congratulations! You have been hired by the President of the United States to advise them on their efforts to reduce income inequality. The first policy that the president has asked you to evaluate is whether income inequality would be decreased more by 




3. 
4. 
5. Estimate your own *household's* gross (pretax) income and write it down (we will be working with household gross income, so if multiple people in your household work add up their incomes.) 
   - If you aren't an American, head over to [this OECD website](https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm) to find the Purchasing Power Parity (PPP) exchange-rate between your currency and the US dollar in 2019. Note that this exchange rate may be quite different from the official exchange rate -- [PPP is a method of calculating exchange rates](https://en.wikipedia.org/wiki/Purchasing_power_parity) that is meant to take into account differences in cost-of-living and labor across countries to generate an exchange rate that accurately reflects buying power.
6. Now make a guess about what share of American households make more money than you.
7. We are now going to load a vector of total incomes for a representative sample of US households between 2015 and 2019. Use the command `np.loadtxt("data/us_household_incomes.txt")` to load the vector. Make sure to assign the result of that command to a new variable.
8. What was the mean (average) household income in the United States between 2015 and 2019?
9. What was the median household income in the United States during this period? (hint: if `np.mean()` returns the mean of a vector, you can probably guess how to get the median... :) ). The median income is the income of the household that earned more than 50% of American households, and earned less than 50% of American households.
   -  You will notice that the median household income is significantly less than the average -- that's because there are a small amount of households in the US that earn a great deal of money and pull up the average.
10. Now let's see if your impression of where *your* household sat in the US income distribution was correct! The function `np.percentile(v, p)` returns the value for which the percent of observations `p` in the vector `v` have values less than the return value. So for example, `np.percentile(incomes, 10)` will return the household income (if I named the vector that I got in question 3 `incomes`) that is greater than 10% of the values in `incomes` (and implicitly the income that is less than the income of the other ~90% of households). If you plug in the percentile that you guessed in question 2, does that look like your household's gross income? 
   - Odds are that you guessed that you are somewhere near the middle of the US income distribution, but that you are not -- this is actually [a well-documented sociological phenomenon!](https://www.theatlantic.com/politics/archive/2013/08/why-americans-all-believe-they-are-middle-class/278240/).
11. Try out different percentiles until you find where *your* household fits into the US income distribution!

### Data Citation

The ACS data used in this exercise are a subsample of the IPUMS USA data available from [usa.ipums.org.](usa.ipums.org)

Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0

These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.

Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.

@Michael: again, I'm guessing I'll need to make these deterministic to allow for auto grading?

In our previous exercise, you created a vector with at least six of your favorite numbers called `my_favorite_numbers`. In this exercise we will work with that vector (which you will need to re-create here), but instead of manipulating the entire vector as a single object like we did before, in this exercise we will do some subsetting and manipulation of components of this vector.

0. Re-create your `my_favorite_numbers` vector with at least six of your favorite numbers.
1. Now let's make a vector of True/False values. Create a Boolean vector that is true if the number is greater than `5` called `big`. If you look at `big`, do the values make sense?
2. Now use `big` to return only the values of `my_favorite_numbers` that are greater than 5.
3. Now, using the same logic, try and get all the values of `my_favorite_numbers` that are bigger than the average of `my_favorite_numbers`. (Hint: you'll need to use a function we've seen.)
4. Now, if you used more than one line to do number 3, try and do it in one line of code.

## Vector Exercise 2: The US Income Distribution

In these exercises, we will load and work with a vector that contains estimates of the total income (from all sources) of a random sample of Americans collected by the U.S. Census Bureau between 2015 and 2019 as part of the American Community Survey (ACS).

(Apologies to people who are not from the United States -- most of our users come from the US, so picking the United States seemed like the least bad option. However, if you are interested in completing these same exercises for your own country, head over to [IPUMS International](https://international.ipums.org/international/) to see if analogous income data has been made available by your country's Census Bureau. Simply click on the "Browse Data" button, then "Select Sample" in the top left to find the most recent data available for your country. Then see if you can find income data under the "Select Harmonized Variables" "PERSON" or "HOUSEHOLD" drop-down menus. Note that income data is hard to collect, so it's probably not available for most countries.)

1. Estimate your own *household's* gross (pretax) income and write it down (we will be working with household gross income, so if multiple people in your household work add up their incomes.) 
   - If you aren't an American, head over to [this OECD website](https://data.oecd.org/conversion/purchasing-power-parities-ppp.htm) to find the Purchasing Power Parity (PPP) exchange-rate between your currency and the US dollar in 2019. Note that this exchange rate may be quite different from the official exchange rate -- [PPP is a method of calculating exchange rates](https://en.wikipedia.org/wiki/Purchasing_power_parity) that is meant to take into account differences in cost-of-living and labor across countries to generate an exchange rate that accurately reflects buying power.
2. Now make a guess about what share of American households make more money than you.
3. We are now going to load a vector of total incomes for a representative sample of US households between 2015 and 2019. Use the command `np.loadtxt("data/us_household_incomes.txt")` to load the vector. Make sure to assign the result of that command to a new variable.
4. What was the mean (average) household income in the United States between 2015 and 2019?
5. What was the median household income in the United States during this period? (hint: if `np.mean()` returns the mean of a vector, you can probably guess how to get the median... :) ). The median income is the income of the household that earned more than 50% of American households, and earned less than 50% of American households.
   -  You will notice that the median household income is significantly less than the average -- that's because there are a small amount of households in the US that earn a great deal of money and pull up the average.
6. Now let's see if your impression of where *your* household sat in the US income distribution was correct! The function `np.percentile(v, p)` returns the value for which the percent of observations `p` in the vector `v` have values less than the return value. So for example, `np.percentile(incomes, 10)` will return the household income (if I named the vector that I got in question 3 `incomes`) that is greater than 10% of the values in `incomes` (and implicitly the income that is less than the income of the other ~90% of households). If you plug in the percentile that you guessed in question 2, does that look like your household's gross income? 
   - Odds are that you guessed that you are somewhere near the middle of the US income distribution, but that you are not -- this is actually [a well-documented sociological phenomenon!](https://www.theatlantic.com/politics/archive/2013/08/why-americans-all-believe-they-are-middle-class/278240/).
7. Try out different percentiles until you find where *your* household fits into the US income distribution!

### Data Citation

The ACS data used in this exercise are a subsample of the IPUMS USA data available from [usa.ipums.org.](usa.ipums.org)

Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0

These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.

Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.

## Vector Exercise 4: Family and Friends

Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.

1. What is the mean age of the people in your vector? Find out in two ways,
with and without using the `np.mean()` command.

2. How old is the youngest person in your vector? (Use a numpy command to find out.)

3. What is the age gap between the youngest person and the oldest person in your vector?
(Again use numpy to find out, and try to be as general as possible in the sense that
your code should work even if the elements in your vector, or their order, change.)

4. How many people in your vector are above age 25? (Again, try to make your code
work even in the case that your vector changes.)

5. Replace the age of the oldest person in your vector with the age of someone
else you know.

6. Create a new vector that indicates how old each person in your vector
will be in 10 years.

7. Create a new vector that indicates what year each person in your vector
will turn 100 years old.
