# Data Analysis with Python and Pandas
## Michael Chambers - Nov. 26th, 2018
**Goal**: Exposure to Python and Pandas for Data Analysis  

![example image](figs/flu.png)

**Objectives**:  
- Overview of basic Python syntax and how to use Jupyter Notebooks (will be used in both lessons for the day)  
- Work through a basic workflow using a CDC dataset (load, clean, manipulate, visualize, save)  
- Exercise: apply the same workflow to a different question in the CDC dataset  

**Recommended Prior Reading Assignment**: [Intro to Pandas for Excel Super Users – Joan Wang](https://towardsdatascience.com/intro-to-pandas-for-excel-super-users-dac1b38f12b0)  

Additional Resources:
- [Download Anaconda](https://www.anaconda.com/download/#macos)
- [Python DataScience Handbook – Jake VanderPlas](https://github.com/jakevdp/PythonDataScienceHandbook)
- [Reproducible Data Analysis Videos – Jake VanderPlas](https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ)  
- [19 Essential Snippets in Pandas – Jeff Delaney](https://jeffdelaney.me/blog/useful-snippets-in-pandas/)
- [Merge and Join DataFrames with Pandas in Python – Shane Lynn](https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/)

In [None]:
# show example image of data at top

# Python 🐍 
![xkdc](https://imgs.xkcd.com/comics/python.png)

In [None]:
import this

# Jupyter Notebooks 📓
---
* Incredible resource and tool for data science
* Offers a reproducible environment that tells a story
* It's also a great sandbox environment to learn Python

### A quick note about cell types in Jupyter

print('this is not a coding cell')

In [None]:
print('this is a coding cell')

In [None]:
# run a code cell: 2 + 2


In [None]:
# we can create a variable called "x"


In [None]:
# that variable refers to the object we set it to


In [None]:
# we can use functions to manipulate our objects
# "print()" is a built-in python function


### Supercharging Python with Packages
As soon as you start up a python session you instantly have access to [~70 functions](https://docs.python.org/3/library/functions.html).  If we want to do something specific with our object that goes beyone the scope of the built-in functions we are presented with three options:  
1. Sit and cry
2. Define our own functions (that's a bit out of scope for this lesson)
3. Use packages!

**Packages** (or libraries) are collections of functions that other folks have already written.  This is a huge component of why python is an incredible programing language, chances are someone's already written something that you can use.

Before we can start using functions provided by packages we first have to import the package.

# Pandas 🐼
---
### [Legend has it...](https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/)

Pandas was develped by Wes McKinney (from Akron, Ohio!!!). Math guy from MIT that went into finance, found that the problem with hedge fund management was dealing with the data (sourcing new data, merging it with the old, and cleaning it all up to optimize the input). He got bummed out with Excel and R but was smitten with Python, though he realized there was no robust package for data analysis. So he built Pandas in 2008 and released the project to the public in 2009.

Here's where it get's crazy, he left the world if finance to pursue a PhD in statistics at Duke, thus dropping Pandas development. During that period he realized Python as a language could explode as a statistical computing language, it had the potential, but was still missing robust packages. So he dropped out to push Pandas to become the a cornerstone of the Python scientific ecosystem.

And he put all his tips and tricks for Pandas into a book: Python for [Data Data Analysis](https://github.com/wesm/pydata-book)

### What is Pandas?

"[Pandas] enables people to analyze and work with data who are not expert computer scientists. You still have to write code, but it's making the code intuitive and accessible. It helps people move beyond just using Excel for data analysis." ~Wes

- The go-to data analysis library for Python
    - Import and wrangle your raw data
    - Manipulate and visualize
- Allows for mixed data types in the same array

The DataFrame is your friend!

- Two primary object types used in Pandas:
    - DataFrames - like an Excel spreadsheet
    - Series - like a single column in a spreadsheet
- DataFrames are the primary object used in Pandas (it's like an Excel sheet)
- Each DataFrame has:
    - columns: the variables being measured
    - rows: the observations being made
    - index: maintains the order of the rows

In [None]:
# how to import a package 
# note: "pd" is an abbreviation that allows us to access the functions provided in the package
# you will find that many popular python packages have conventional abbreviations when imported


### Our dataset...
We'll be examining some CDC influznea surveillence data from 2008-2018, which was sourced from the [CDC FluView website](https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html).  What we're interested in are the influenza-like-illness cases per week ("ILITOTAL").

In [None]:
# import the data we'll be using
# note: the path to the data is relative to the Jupyter notebook location: ILINet.csv


In [None]:
# there are a few functions we can use to quickly get a feel for the dataset
# the "head()" function will show us the first 5 rows of our dataframe


### Some functions to try:
```
df.shape
df.info()
df.describe()
df.columns
df.head()
df.tail()
df.sample(10)
```
Note: some of these commands end with a parenthesis, others do not.  Those without parenthesis are attributes of the object, those with parenthesis are a function.

In [None]:
# scratch space


In [None]:
# df.shape

In [None]:
# df.info()

In [None]:
# df.describe()

In [None]:
# df.columns

In [None]:
# df.head()

In [None]:
# df.tail()

In [None]:
# df.sample(10)

### Quick Exercise
- How many entries of data do we have (number of observations)?
- How many columns do we have (number of variables)?
- Any first impression on what we may have to clean in this dataset?

In [None]:
# scratch space


In [None]:
# df.shape

In [None]:
# len(df)

In [None]:
# len(df.columns)

In [None]:
# len(df.index)

### Basic Data Cleanup
For now lets narrow down this dataset to some simple components:
- REGION
- YEAR
- WEEK
- ILITOTAL

In [None]:
# we'll select just those columns and create new dataframe
# df1 = df[['YEAR', 'WEEK','REGION','ILITOTAL']]

In [None]:
# let's check the summary of our new dataframe using the "info()" function
# df1.info()

Note that our "ILITOTAL" variable is an 'object' type, not an integer ('int64').  This means that there are some string values in this column that we should clean up.

In [None]:
# let's take a closer look at our new dataframe
# df1.head(10)

In [None]:
# note that in row 9 we have a value of "X" for 'ILITOTAL'
# lets replace all the "X" values in 'ILITOTAL' with a null placeholder
# to do this we're going to import another pachage called NumPy
# import numpy as np
# df1['ILITOTAL'].replace('X', np.nan, inplace=True)

In [None]:
# and we'll check the top 10 lines to see if "X" was replaced
# df1.head(10)

In [None]:
# but what about the column type?
# df1.info()

In [None]:
# df1['ILITOTAL'] = df1['ILITOTAL'].astype('float')

In [None]:
# now lets recheck the info for our dataframe
# note: 'ILITOTAL' is now a float object ('float64')
# df1.info()

In [None]:
# lets try the "describe()" function again
# note: we now have summary states for 'ILITOTAL'
# df1.describe()

In [None]:
# one more safetey check, is this just for the 50 US states?
# df1['REGION'].nunique()

In [None]:
# view a unique listing of regions in the dataframe
# df1['REGION'].unique().tolist()

### Data Visualization
Now that we have simplified our dataset lets make some figures!

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline

In [None]:
# getting a quick glimpse with the plot() function
df1.plot()

In [None]:
# generate a histogram for the ILI case count
df1['ILITOTAL'].hist(bins=100)

In [None]:
# generate a boxplot for the ILI case count
df1.boxplot('ILITOTAL')

In [None]:
# saving a figure
plt = df1.boxplot('ILITOTAL')
plt.set_title('Range of ILI Cases per Week')
plt.set_ylabel('# if ILI Cases per Week')
fig = plt.get_figure()
fig.savefig('figs/fig1.png')

Seaborn is an alternative package that works very well with Pandas dataframes to generate figures, check out the Seaborn [example gallery](https://seaborn.pydata.org/examples/index.html) for some inspiration.

In [None]:
# making a seaborn lineplot (sns.lineplot())
# lets view the incidence of influenza cases over weeks for each state
sns.lineplot(x="WEEK", y="ILITOTAL",
             hue="REGION",
             data=df1)

In [None]:
# lets view the incidence of influenza cases over weeks for a single
sns.lineplot(x="WEEK", y="ILITOTAL",
             hue="REGION",
             data=df1.query('REGION == "Maryland"'))

In [None]:
# lets view the incidence of influenza cases over weeks for a list of regions/states
extra_regions = ['District of Columbia','Maryland','Virginia']
sns.lineplot(x="WEEK", y="ILITOTAL",
             hue="REGION",
             data=df1.query(f'REGION == {extra_regions}'))

In [None]:
# saving a seaborn plot by making a pointer and using get_figure()
sns_plt = sns.lineplot(x="WEEK", y="ILITOTAL",
             hue="REGION",
             data=df1.query('REGION == "Maryland"'))
fig = sns_plt.get_figure()
fig.savefig('figs/fig2.png')

In [None]:
# bringing it all together to make the example image at the top
regions = ['Maryland','Virginia','District of Columbia']

sns_plt = sns.lineplot(x="WEEK", y="ILITOTAL",
             hue="REGION",
             data=df1.query(f'REGION == {regions}'))
sns_plt.set_title('Influenza-Like Cases from 2014-2018')
sns_plt.set_ylabel('Number of Influenza-Like Cases')
sns_plt.set_xlabel('Week Number')

fig = sns_plt.get_figure()
fig.savefig('figs/flu.png')

In [None]:
# example of a swarm plot using seaborn
sns.swarmplot(
    x="REGION", 
    y="ILITOTAL",
    data=df1.query(f'REGION == {regions}')
)

In [None]:
# example of a violin plot using seaborn
sns.violinplot(x="REGION", y="ILITOTAL",
               data=df1.query(f'REGION == {regions}'))

# Final Notes
![xkcd](https://imgs.xkcd.com/comics/is_it_worth_the_time_2x.png)