# Case Study: American Community Survey Data

Let's look at an example of how we might utilize these workflows on an actual dataset, specifically educational outcome attribute data from the American Community Survey.
- <a href="https://data.census.gov/table/ACSST5Y2022.S1501?t=Education:Educational Attainment&g=050XX00US18141,18141$8600000&moe=false">Link to the ACS dataset</a>
  * *Note: I'm working with 2022's 5 year estimate, which I've renamed `data.csv`.*

In [None]:
import pandas as pd, re # import statements
df = pd.read_csv("data.csv") # load data
df # show output

A rich dataset, but not in particularly usable shape at the moment.


## Step #1: Building an Outline

A good starting point is sketching out how you want the data to look at the end of reshaping operations. We can start thinking of this as the data processing version of pseudocode.

For this dataset, you might want something like....

| Area | Gender | Variable | Value |
| --- | --- | --- | --- |
| *geographic unit* | *gender breakdown* | *specific field* | *amount or measure* |

Based on where the data is now, some things we might need to do...
- Break out the current column headers
- Make the updated column headers their own columns
- Probably relabel the columns to more user-friendly labels

Let's dig in!

## Step #2: Splitting Column Headers

Let's start splitting the column labels into a hierarchical index.

We can use `!!` as a separator for regular expression string methods.
- [Click here](https://colab.research.google.com/drive/1Rd9KMoJ2AdtwVMNXwibakUbMGxz5Z6nM?usp=sharing#scrollTo=c1TKLsmiSS9N) for a Jupyter Notebook that provides a deep dive into regular expressions (regex) and string methods in Python.

In [None]:
df.columns = df.columns.str.split("!!", 2, expand=True) # split column headers into multi-level index based on separator
df # show output

Now we have a multi-index and can start to reshape this data.

## Step #3: Transposing the DataFrame

Now, we can use `.transpose` to invert columns and rows.

In [None]:
df = df.T # transpose
df # show output

We're making progress!

## Step #4: Reassign Header & Subset the Data

We need the first row of data to serve as column headers. We can do this by subsetting our dataframe.

In [None]:
header = df.iloc[0] # isolate first row to be new header
df = df[1:] # subset dataframe (everything past the first row)
df.columns = header # reassign headers
df # show output

## Step #5: Reindexing

Right now, the area and coverage are part of a row multi-index. We can reset the index to make these columns.

In [None]:
df = df.reset_index() # reset index
df # show output

We might also want to relabel some columns at this point.

In [None]:
df.columns.values[0] = 'area' # rename columns
df.columns.values[1] = 'coverage'
df.columns.values[2] = 'type'
df # show updated df

## Step #6: Melting Variable Labels

If our desired structure is

| Area | Gender | Variable | Value |
| --- | --- | --- | --- |
| *geographic unit* | *gender breakdown* | *specific field* | *amount or measure* |

<br>Then now might be a good time to melt the column labels.

In [None]:
df = pd.melt(df, id_vars=['area', 'coverage', 'type']) # melt variable column
df.columns.values[3] = 'variable'
df # show output

In [None]:
df.head()

## Step #7: Subsetting & Filtering

Last but not least, we might want to subset our `DataFrame` for meaningful columns, and remove rows with `NaN` values.

In [None]:
df = df[['area', 'coverage', 'variable', 'value']] # subset columns
df = df[df['value'].notnull()] # remove rows with NaN in value
df = df.reset_index(drop=True) # reset index
df['area'] = df['area'].str.replace("ZCTA5 ", "") # clean up area column to be able to join on zip code
df # show output

## Wrap Up

Voila! Let's see all of those steps together:

In [None]:
import pandas as pd, re # import statements
df = pd.read_csv("data.csv") # load data

df.columns = df.columns.str.split("!!", 2, expand=True) # split column headers into multi-level index based on separator
df = df.T # transpose

header = df.iloc[0] # isolate first row to be new header
df = df[1:] # subset dataframe (everything past the first row)
df.columns = header # reassign headers
df = df.reset_index() # reset index

df.columns.values[0] = 'area' # rename columns
df.columns.values[1] = 'coverage'
df.columns.values[2] = 'type'

df = pd.melt(df, id_vars=['area', 'coverage', 'type']) # melt variable column
df.columns.values[3] = 'variable'

df = df[['area', 'coverage', 'variable', 'value']] # subset columns
df = df[df['value'].notnull()] # remove rows with NaN in value
df = df.reset_index(drop=True) # reset index
df['area'] = df['area'].str.replace("ZCTA5 ", "") # clean up area column to be able to join on zip code

df # show output