# Box Plots and Five Number Summaries Using Pandas

In the `pandas` library, it is easy to create box plots and five number summaries for each column of data. 

In this notebook, we will import some data from a CSV file into a pandas dataframe.

The data come from the **2024 American Community Survey**, which is a wide-ranging survey of US residents conducted by the Census Bureau. You can read more about it on the [Census Bureau Website](https://www.census.gov/programs-surveys/acs.html).

In [None]:
# Preliminaries - import required libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt

# Set display options and load data into a dataframe called acs
pd.set_option("display.precision", 2)
acs = pd.read_csv('/home/shared/ACS_2024.csv')

# We will use the following dictionaries to recode some of the data
st_dict = {1: 'Connecticut', 2: 'Maine', 3: 'Massachusetts', 4: 'New Hampshire', 5: 'Rhode Island', 6: 'Vermont', 
11: 'Delaware', 12: 'New Jersey', 13: 'New York', 14: 'Pennsylvania', 21: 'Illinois', 22: 'Indiana', 23: 'Michigan', 
24: 'Ohio', 25: 'Wisconsin', 31: 'Iowa', 32: 'Kansas', 33: 'Minnesota', 34: 'Missouri', 35: 'Nebraska', 36: 'North Dakota', 
37: 'South Dakota', 40: 'Virginia', 41: 'Alabama', 42: 'Arkansas', 43: 'Florida', 44: 'Georgia', 45: 'Louisiana', 
46: 'Mississippi', 47: 'North Carolina', 48: 'South Carolina', 49: 'Texas', 51: 'Kentucky', 52: 'Maryland', 53: 'Oklahoma', 
54: 'Tennessee', 56: 'West Virginia', 61: 'Arizona', 62: 'Colorado', 63: 'Idaho', 64: 'Montana', 65: 'Nevada', 66: 'New Mexico', 
67: 'Utah', 68: 'Wyoming', 71: 'California', 72: 'Oregon', 73: 'Washington', 81: 'Alaska', 82: 'Hawaii', 
83: 'Puerto Rico', 96: 'State groupings (1980 Urban/rural sample)', 97: 'Military/Mil. Reservations', 
98: 'District of Columbia', 99: 'State not identified'}

mf_dict = {1:"Male", 2:"Female", 9:"Missing/blank"}


In [None]:
# Recoding / cleaning
acs.drop(columns=['SAMPLE', 'SERIAL', 'CBSERIAL', 'CLUSTER', 'HHWT', 'GQ', 'PERNUM', 'PERWT'], inplace=True)
acs = acs[acs['INCTOT']<999998]
acs['State'] = acs['STATEICP'].replace(st_dict)
acs['Gender'] = acs['SEX'].replace(mf_dict)
acs.drop(columns=['STATEICP', 'SEX', 'YEAR'], inplace=True)
acs.rename(columns={'INCTOT':'Total Income', 'AGE':'Age'}, inplace=True)
new_cols = ['State', 'Gender', 'Age', 'Total Income']
acs = acs[new_cols]
acs

Now that we have loaded and cleaned the data, we can start to analyze it. You will see there are 4 columns:

- State
- Gender
- Age
- Total Income

The first two columns are categorical data and the last two are numerical data.

We can use the `.describe()` method to get the five number summary for the numerical data.

In [None]:
acs.describe()

We can use **boolean indexing** to describe subsets of the dataset. The format for this is

In the example below, we use `.describe()` on just those rows where the state is 'Maine'.

In [None]:
# Summary statistics for Maine residents only
acs[acs['State']=='Maine'].describe()

In [None]:
# Summary statistics for Utah residents only
acs[acs['State']=='Utah'].describe()

## Using Box Plots

It looks like Maine residents are older than Utah residents on average. Let's compare the distribution of ages using box plots.

In [None]:
# Create a box plot comparing ages of Maine residents with Utah residents, based on sample.
ME = acs[acs['State']=='Maine']
UT = acs[acs['State']=='Utah']
plt.boxplot([ME['Age'], UT['Age']])
plt.title('Age Distribution in Maine vs Utah')
plt.xticks([1, 2], ['ME', 'UT'])
plt.show()

# Exercise 1

Create a subset of the dataframe for just Colorado using `CO = acs[acs['State']=='Colorado']`.

Use the `.describe()` method to see the summary statistics for Colorado.

Create a box plot to visualize the distribution of ages of Colorado residents. 

Create side-by-side box plots to compare ages of ME, UT and CO.

In [None]:
# Type your code here


In [None]:
# Type your code here


## Comparing Incomes

These data give incomes for all individuals in the survey (15 or older), whether they are at school, working or retired.

Let's look at the income distribution for Colorado.

In [None]:
CO = acs[acs['State']=='Colorado']
plt.boxplot(CO['Total Income'])
plt.show()

## Hiding Outliers

You'll notice there are a lot of outliers at the high end of the income scale. This is typical with income data. We can hide the outliers using the `showfliers=False` options.

In [None]:
plt.boxplot(CO['Total Income'], showfliers=False)
plt.xticks([1], ['Colorado'])
plt.title('Total Income of Colorado Residents (ACS 2024)')
plt.grid(axis='y')
plt.show()

## Exercise 2

Create side-by-side box plots comparing incomes of Colorado, California and Mississippi. Use `showfliers=False` initially to see the plots without outliers. Then do another plot with outliers included to compare.

Start by creating dataframe subsets just for California and Mississippi, called `CA` and `MS` respectively.

Once you've done this, answer the questions that follow.

In [None]:
# Type your code here


In [None]:
# Type your code here


## Some questions

1. Which state of the 3 has the highest median total income?
2. Which state has the lowest median total income?
3. Based on these data, which state has the person with the highest total income?
4. Is is fair to say that the poorest 25% of residents in California (based on total income) earn less than the poorest 25% in Colorado?

Answer 1: 

Answer 2: 

Answer 3:

Answer 4: 

## Exercise 3

Suppose we wanted to compare incomes males and females, but only for working age individuals, so excluding those 65 or older. Write the code to do this. Your code should result in side-by-side box plots showing the total income distribution for males and females.

In [None]:
# Tyope your code here
