# Data Analysis with Python & pandas

## Learning Objectives

- Explore real-world data with Python
- Practice using the popular pandas library to work with a tabular dataset
- Perform basic data cleanup and manipulation
- Perform basic analysis & visualization of a dataset

### The dataset

- We'll be using a dataset from the US Bureau of Labor Statistics: the Consumer Price Index (CPI). The [CPI](https://www.bls.gov/cpi/) is  used to measure inflation across a wide spectrum of goods and services. 

- We'll be interacting with CPI data through a series of [text tables](https://download.bls.gov/pub/time.series/cu/), which are suitable for use with tools like Python.

- There are several files in this directory. Today we'll work with the file called `cu.data.0.Current`.

#### exercise
-----
Open the [Current file](https://download.bls.gov/pub/time.series/cu/cu.data.0.Current) in your browser and take a few moments to inspect the format. 



1.   How is it structured/organized?
2.   What additional information do you need to interpret this dataset?



### importing the pandas library

- The pandas library is a third-party, open-source Python library. It's modeled on functionality from the R language, particularly on R's DataFrame object. The pandas library gives us a way to use R-type DataFrames in Python.

- Pandas is also highly optimized, so working with large datasets, so it's performance is generally much better than Python code you or I could write on our own. 

- The pandas library is already installed in a Google Colab environment. But because it's an external library, we have to import pandas into our Python session before we can use it. 

In [None]:
import pandas as pd

### Loading the CPI data file

- When loading a data file, it's important to understand the format.

- The CPI `Current` file is plain text, but it's separated into columns. 

- The generic term for this kind of file is **CSV** (comma-separated values).

- Pandas has a `read_csv` [method](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) we can use to load files of this type.

In [None]:
# All Urban Consumers (Current Series)
all_items_url = 'https://download.bls.gov/pub/time.series/cu/cu.data.0.Current'
all_items = pd.read_csv(all_items_url)

In [None]:
all_items = pd.read_csv(all_items_url, sep='\s+')

### DataFrame Basics: Rows & Columns

- Just as in a spreadsheet, a DataFrame allows us to work with rows and/or columns of data.

- In pandas, this is called **slicing**.

#### exercise
-------

Run the following cells and see if you can determine what each command is doing.


In [None]:
all_items['series_id']

In [None]:
all_items[['series_id', 'period', 'value']]

In [None]:
all_items.loc[0:10]

In [None]:
all_items.loc[99:500, ['series_id', 'year']]

In [None]:
all_items.year.max(), all_items.year.min()

### Filtering DataFrames

- In a spreadsheet, we often filter the rows by values in particular columns.

- We can do the same in pandas, but the syntax is actually much more flexible (as we'll see).

In [None]:
# To find rows for a particular year
all_items.loc[all_items.year == 2020]

In [None]:
# To find rows where the year is greater than a certain value
all_items.loc[all_items.year > 2000]

In [None]:
# To find rows where the series_id contains a certain item code
all_items.loc[all_items.series_id.str.contains('SEEB')]

### Manipulating Columns

- Before we get into our analysis, let's make the data easier to work with.

- The `series_id` column actually contains multiple pieces of information.

#### exercise
-----
Using the [documentation](https://download.bls.gov/pub/time.series/cu/cu.txt), can you figure out how to parse the `series_id` values into their components? 

In [None]:
#all_items['survey_code'] = all_items.series_id.str[:]
#all_items['seasonal_code'] = all_items.series_id.str[:]
#all_items['periodicity_code'] = all_items.series_id.str[:]
#all_items['area_code'] = all_items.series_id.str[:]
#all_items['item_code'] = all_items.series_id.str[:]

 ### Summarizing DataFrames

- The CPI is time-series data, consisting of values for various kinds of goods & services (items), across various geographic regions, at discrete moments in time.

- An initial question we might pose: are these data points distributed uniformly? In other words, are all items and regions represented across the entire time span? 

- To answer this question, we can use a powerful feature of pandas called `groupby`. 

- The `groupby` method allows you to perform the same operation across subsets of your data.

- A prime use for it is summarizing data by subset.

In [None]:
# What is the range of years?
all_items.year.unique()

In [None]:
# How many different item groups are represented?
all_items.item_code.unique().size

In [None]:
# How many observations are there per item code?
all_items.groupby('item_code').size()

In [None]:
# How many different items have data each year?
all_items.groupby('year').item_code.apply(lambda x: x.unique().size)

#### exercise
----
See if you can write some code to determine whether all the regions (area codes) are covered for all years (regardless of item).

### Filtering by Groups


- The `groupby` method can be used in conjunction with the `filter` method to filter our dataset by groups.

- The `filter` method applies a condition to each group and returns data only from groups where the condition evaluates to `True`.

- We can supply a condition to `filter` using the `lambda` syntax we saw above.

- We've seen that CPI data is not consistent for all item types across all years and regions/areas. How can we analyze data for only those items that have a CPI calculated for all years, 1997-2022?


In [None]:
# Determine how many unique years are in the dataset
num_years = all_items.year.unique().size

In [None]:
# Let's isolate a single area to make sure our data is consistent
# 0000 is the code for U.S. city average, so let's use that one
us_items = all_items.loc[all_items.area_code == '0000'].copy()

In [None]:
# to make sure our data is consistent, let's also limit by periodicity and seasonality
us_items = us_items.loc[(us_items.periodicity_code == 'R') & (us_items.seasonal_code == 'U')].copy()

In [None]:
# within each item group, determine if the number of unique years matches the number across the whole dataset
complete_items = us_items.groupby('item_code').filter(lambda x: x.year.unique().size == num_years)

In [None]:
# Now we have a slightly smaller set of items than we started with
complete_items.item_code.unique().size

### Filtering one DataFrame by another

Often we'll want to work with data that are spread across multiple tables. The pandas library includes functionality to make that easier.

- The CPI file we've been using has codes to represent items. 

- To understand which items are included, we can download an additional file provided by the Bureau of Labor Statistics, which maps each item code to its descriptions. 

- Since we've filtered our dataset to exclude certain items, we can filter this secondary dataset using elements from our `complete_items` DataFrame.

In [None]:
# We can load the item description table same as we did with the CPI table above
items_desc_url = 'https://download.bls.gov/pub/time.series/cu/cu.item'
items_desc = pd.read_csv(items_desc_url, sep='\t')

In [None]:
# We've seen how to access the unique item codes in our complete_items DataFrame
item_codes = complete_items.item_code.unique()

In [None]:
# Now we can use .loc together with the .isin method to filter the item descriptions to just those item codes
item_desc_keep = items_desc.loc[items_desc.item_code.isin(item_codes)]

In [None]:
# We can export this new, filtered table of item descriptions and review it in a spreadsheet
item_desc_keep.to_csv('filtered-item-table.csv', index=False)

#### exercise
-----

Pick a handful of items from the CSV file (spreadsheet) you downloaded to include in your analysis. This file represents only those items with CPI values for the complete time span (1997-2022) of this dataset. 

Create a Python list of your chosen item codes below.

In [None]:
#my_codes = []

In [None]:
# Now we can filter our complete_items DataFrame by our selected codes
my_items = complete_items.loc[complete_items.item_code.isin(my_codes)].copy()

### Working with time series data

- CPI data is calculated monthly. 

- Our dataset (`my_items`) indicates the date by the combination of two columns: the `year` and the `period`, which can be either a month code or a seasonal code. 

- Multiple time scales are present in this dataset.

- In order to do consistent analyes, we should pick a particular time scale and discard the other data points.

- In what follows, we'll isolate the month-level data and combine the month code and year columns to create a proper date element.

#### exercise
----
Can you describe -- logically, not in terms of Python syntax -- how we can combine the values in the `period` and `year` columns to create a date value?

In [None]:
# Remove annual average and semiannual values
my_items = my_items.loc[~my_items.period.isin(['M13', 'S01', 'S02', 'S03'])].copy()

In [None]:
# We create a datetime column in four steps
# 1.
# 2.
# 3.
# 4.


### Reshaping DataFrames

- Now that we've cleaned up our data sample and created a Python datetime column, we can analyze it as a timeseries.

- The CPI values in the `value` column don't reflect actual prices; rather, they index change from one period to the next.

- Generally, it's easier to compare the CPI between different kinds of items by looking at the _percentage change_ over time. 

- Pandas has a built-in method for calculating this metric. Let's use it to plot the percentage change for the items in in our sample.

- We'll also change the shape of our DataFrame to make it easier to plot.

In [None]:
# We're interested in the percentage change in CPI for each item, so we need to use groupby again
# We'll reassign the result to a new column
my_items['pct_change'] = my_items.groupby('item_code').value.pct_change()

In [None]:
# To create a DataFrame with the item values side by side, we can use the .pivot method
my_items_pivot = my_items.pivot(index='month', columns='item_code', values='pct_change')

### Visualizing DataFrames

- There are lots of options when it comes to data visualization with Python.

- A very commonly used library is called [matplotlib](https://matplotlib.org/). 

- pandas has some matplotlib functionality embedded into its DataFrame API.

In [None]:
# We can make a basic line graph comparing the rate of change in the CPI over time for our sample of items 
# just by calling the plot() method
my_items_pivot.plot()

### Customizing matplotlib plots

- Matplotlib plots are highly customizable.

- Let's look at a few of the settings that can make our plot more legible.

  - We can make our plot bigger with the `figsize` argument.
  - We can give our plot a title and change the units on the Y axes to show percentages out of 100 (rather than 1)
  - We can improve our legend by supplying the actual names of the items measured, rather than their codes.

In [None]:
# after pivoting our DataFrame, the item codes are now the names of the columns
my_items_pivot.columns

In [None]:
# We can use the column names to select the rows from our table of item descriptions that contain the names of these items
# To do that, we first need to make the item_code column the INDEX of the item_desc_keep DataFrame
item_desc_keep = item_desc_keep.set_index('item_code')

In [None]:
# Now we can filter the index of that DataFrame by the column names in our pivot DataFrame
# We're filtering by the names of the columns, not their values
my_item_desc = item_desc_keep.loc[my_items_pivot.columns]

In [None]:
# Finally, we can rename the columns on our pivot DataFrame, using the item_name column on the filtered
# DataFrame of item descriptions
my_items_pivot.columns = my_items_desc.item_name

In [None]:
# To change the units on the axis, we need access to something called a tick formatter
# This functionality lives in a separate matplotlib module, which we need to import
import matplotlib.ticker as mtick

In [None]:
# We set the figure size and assign the return value of plot to a new variable
ax = my_items_pivot.plot(figsize=(20, 6))
# Our ax variable now gives us access to special matplotlib formatting methods
# We can use set_title to set the plot title
ax.set_title('CPI percent change, 1997-2022')
# mtick.PercentFormatter(1.0) instructs matplotib to format values as percentages, using the argument (1.0) as the denominator
# in calculating the percentage
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))