# Data Processing with Python and Pandas Part One


## Today's Topics

* What/Why Pandas
* Data Structures
* Loading Data
* Basic Data Manipulation

## What is Pandas 

* Pandas is a 3rd-party library for doing data analysis
* It is a foundational component of Python data science
* Developed by [Wes McKinney](http://wesmckinney.com/pages/about.html) while working in the finance industry, so it has some...warts
* Vanilla Python (what we did previously) can do many of the same things, but Pandas does them *faster* and usually in fewer lines of code
* To do this, is built on top of another 3rd party library called [numpy](http://www.numpy.org/)
    * If you have TONS of numerical data you can use Numpy directly
* Pandas gives Python some R like functionality (Dataframes)

## Why Pandas?

* Pandas provides a powerful set of data structure and functions for working with data.
* Once you learn these structures and functions (which takes time) you can begin to quickly ask questions and get answers from data.
* Pandas integrates nicely with other libraries in the Python data science ecosysem like:
    * [Jupyter Notebooks](http://jupyter.org/) - pretty display of Dataframes as HTML tables
    * [Matplotlib](https://matplotlib.org/) - Easy plotting from Dataframes
    * [Scikit Learn](http://scikit-learn.org/stable/) - Integrates with the machine learning api



In [None]:
import pandas as pd
%matplotlib inline

In [None]:

# load the CSV file
data = pd.read_csv("community-center-attendance.csv", index_col="date", parse_dates=True)

# drop the id column because we don't need it
data = data.drop(columns="_id")

# look at the first ten rows of the data
data.head(10)

In [None]:
# What does the data look like?
data.plot();

We can pivot the data so the center names are columns and each row is the number of people attending that community center per day. This is basically rotating the data.

In [None]:
# Use the pivot function to make column values into columns
data.pivot(columns="center_name", values="attendance_count").head()

That is a lot of NaN, and not the tasty garlicy kind either.

We might want to break this apart for each Community Center. We can start by inspecting the number rows per center.

In [None]:
# count the number of rows per center and sort the list
data.groupby("center_name").count().sort_values(by=["attendance_count"], 
                                                ascending=False)

We can look at this visually too!

In [None]:
# plot the total attendance
data.groupby("center_name").count().sort_values(by=["attendance_count"], 
                                                ascending=False).plot(kind="bar");

There are a lot of community centers that don't have a lot of numbers because either 1) they are not very popular or 2) they don't report their daily attendance (more likely given how man NaNs we saw above).

What we will do is create a custom filter function that we will apply to ever row in the dataframe using the [groupby filter function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html). This is some knarly stuff we are doing here. This isn't the plain old filter function, this is a special filter fuction (part of the groupby functionality) that requires you to create a special function to apply to each row. In our case we will make a little function that takes a value and tests to see if it is create than a threshold value (in our case 1000). 

In [None]:
# create a function we will use to perform a filtering operation on the data
# filter out centers that have less then 1000 total entries
def filter_less_than(x, threshold):
    if len(x) > threshold:
        return True
    else:
        return False

# use the custom function to filter out rows
popular_centers = data.groupby("center_name").filter(filter_less_than, 
                                                     threshold=1000)
# look at what centers are in the data now
popular_centers.groupby("center_name").count()

Now we have a more meaty subset of the data to examine.

In [None]:
# look at the first 5 rows
popular_centers.head()

In [None]:
# plot the popular community centers
popular_centers.plot();

This isn't the most informative representation of the data. Perhaps we can reshape it to make it more useful.

In [None]:
# Use the pivot function to make rows into columns with only the popular community centers
pivoted_data = popular_centers.pivot_table(columns="center_name",
                                           values="attendance_count", 
                                           index="date")
pivoted_data.head()

Still NaN-y, but not as bad. Now we can look at the attendance at the more popular community centers over time.

In [None]:
# plot the data
pivoted_data.plot(figsize=(10,10));

Still pretty messy. Let's look at the cumulative sum.

In [None]:
# compute the cumulative sum for every column and make a chart
pivoted_data.cumsum().plot(figsize=(10,10));

Looks like Brookline is the winner here, but attendance has tapered off in the past couple years.

In [None]:
# Resample and compute the monthly totals for the popular community centers
pivoted_data.resample("M").sum().plot(figsize=(10,10));

Looks like monthly is too messy, maybe by year?

In [None]:
# yearly resample to monthly, compute the totals, and plot
pivoted_data.resample("Y").sum().plot(figsize=(10,10));