# Indices and Missing Values Exercises

One of the defining features of `pandas` is the use of indices for data alignment. Like many features in `pandas`, it can make live very easy, but if you aren't careful, it can also lead to problems. This is especially true because indices lead to behavior that is very different from what one sees in other languages and library (like `R`, `numpy`, and `julia`). So let's spend a little timing practicing interacting with indices (and missing values)!

## Missing Values

### Exercise 1

Today, we will be using the ACS data we used during out first `pandas` exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with *extremely* high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States. 

To begin, load the ACS Data we used in our first pandas exercise. That [data can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). We'll be working with `US_ACS_2017_10pct_sample.dta`. 

### Exercise 2

Let's begin by calculating the median US incomes from this data (recall that income is stored in the `inctot` variable).

### Exercise 3

Hmmm... That doesn't look right. The average American is definitely not earning 1.7 million dollars a year. Let's look at the values of `inctot` using `value_counts()`. Do you see a problem?

Now use `value_counts()` with the argument `normalize=True` to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? What percentage has an income of 0?

### Exercise 4

As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of "sentinel value" is that pandas doesn't understand that this is supposed to denote missing data, and so when it averages the variable, it doesn't know to ignore 9999999. 

To help out `pandas`, use the `replace` command to replace all values of 9999999 with `np.nan`. 

### Exercise 5

Now that we've properly labeled our missing data as `np.nan`, let's calculate the average US income once more. 

### Exercise 6

OK, now that seems like a reasonable number. As we can see, a major advantage of using `np.nan` is that `pandas` knows that `np.nan` observations should just be ignored when we are calculating means. 

However, as we previously noted, there are a lot of observations of zero income in our data, and it's not clear that we want everyone with a zero-income *should* be included in this average, since those may be people who are retired, or in school. 

Let's limit our attention to people who are currently working by using `empstat` to subset for people who are currently working. Remember you can use `value_counts()` to see what values of `empstat` are in the data!

### Exercise 7

Now suppose that we want to find the minimum age of everyone who has an income reported. In other words, we want to *ignore* all the rows that have a missing value for `inctot`, then calculate the minimum value of age for those people. 

## Index alignment

To illustrate how index alignment can sometimes lead to problems, let's consider the following example: suppose we're throwing a party, and we plan to give people prizes based on the order in which they arrive. The first person to arrive at the party will get 20 dollars, the second will get 10 dollars, and the third person doesn't get anything. 

To keep track of how many prizes everyone gets, we build a DataFrame with all the party attendees, their arrival order, and a column for keeping track of how much they've received in prizes. 

The we can also build a Series with the prize amounts we plan to give people. 

### Exercise 8

Use the code below to get started: 

```python
import pandas as pd
attendees = pd.DataFrame({'names': ["Jill", "Kumar", "Zaira"], 
                          'prizes': [0, 0, 0],
                          'arrival_order': [2, 1, 3]})
arrival_prizes = pd.Series([20, 10, 0])
```

### Exercise 9

Now let's sort our `attendees` list by `arrival_order` so that the first row is the person who arrived first, the second is the person who arrived second, etc. to match how we've organized `arrival_prizes`. 

### Exercise 10

Now let's "give" everyone their arrival prizes by adding arrival prizes to people's prize column: 

```
attendees['prizes'] = attendees['prizes'] + arrival_prizes
```

### Exercise 11

Now let's look at the result. Does it look like what you expected? Do you know what went wrong?

After you've formulated your thoughts, continue to [Discussion](exercise_indices_missing_discussion.ipynb). 