# Vector Exercises: Working With Subsets

## Income Inequality

In these exercises, we will return to the vector with estimates of the total income (from all sources) of a random sample of American households collected by the U.S. Census Bureau between 2015 and 2019 as part of the American Community Survey (ACS).

(As before, apologies to people who are not from the United States -- most of our users come from the US, so picking the United States seemed like the least bad option. However, if you are interested in completing these same exercises for your own country, head over to [IPUMS International](https://international.ipums.org/international/) to see if analogous income data has been made available by your country's Census Bureau. Simply click on the "Browse Data" button, then "Select Sample" in the top left to find the most recent data available for your country. Then see if you can find income data under the "Select Harmonized Variables" "PERSON" or "HOUSEHOLD" drop-down menus. Note that income data is hard to collect, so it's probably not available for most countries.)


### Gini Index

A standard measure of income inequality is the [Gini Index / Gini Coefficient](https://en.wikipedia.org/wiki/Gini_coefficient). The measure takes on a value of 0 when everyone in a population has the same income, and a value of 1 when all the income in population accrues to a single person.

For discrete data, the Gini Index is defined as:

$$Gini\ Index = \frac{2 \sum_{i=1}^n i y_i}{n \sum_{i=1}^n y_i} -\frac{n+1}{n}$$

Where $i$ is each households' rank ordering from poorest to richest, and $y_i$ is the income of household $i$. We can calculate this from a vector of incomes in Python with the following function:

```python
def gini(incomes):

    # Get number of observations
    n = len(incomes)

    # Generate rankings i
    sorted_incomes = np.sort(incomes, ascending=True)
    ranks = np.arange(1, n + 1)
    
    # Top term of left part of equation
    top = 2 * (income * ranks).sum()

    # Bottom term of left part of equation
    bottom = np.sum(incomes) * n

    # Right part of equation
    correction = (n + 1) / n

    return top / bottom - correction
```

1. Using this function, calculate the Gini Index of income inequality for this household income data. As before, use the command `np.loadtxt("data/us_household_incomes.txt")` to load the vector of incomes, and make sure to assign the result of that command to a new variable. 
2. Go compare your estimate to that of [other countries here.](https://www.indexmundi.com/facts/indicators/SI.POV.GINI/rankings) (Note: in this table, estimated Gini values have been multiplied 100). Does does the US compare to other countries? Is that what you expected? **Note:** The Gini Index of income is only one metric of inequality! Results would be very different if we were to calculate, for example, the ratio of the income of the top 0.1% of earners to the income of the lowest-earning 10% of the population, or if we calculated this metric using wealth instead of income!
3. Congratulations! You have been hired by the President of the United States to advise them on their efforts to reduce income inequality. The first policy that the president has asked you to evaluate is whether income inequality would be decreased more by giving every household that makes less than $40,000 a check for $5,000 or giving every household that makes less than $30,000 a check for $7,000. Modify the household incomes in our data to reflect these policies and calculate the resulting Gini Indices. Which is more effective?

**Note:** Vectors are mutable (like lists), so each time you want to modify the vector of incomes, first create a clean copy with the `.copy()` method (e.g. `experiment1 = income_vector.copy()`). 

4. Now the president would like to know whether income inequality can be reduced more through these transfers or by applying a tax of 5% to people making more than $500,000. Calculate the consequence of such a tax on the Gini Index. Would that policy be more or less effective than transfers to low earners?

### Data Citation

The ACS data used in this exercise are a subsample of the IPUMS USA data available from [usa.ipums.org.](usa.ipums.org)

Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0

These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.

Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.