## Introduction to Pandas

[Pandas](http://pandas.pydata.org/) is the essential data analysis library for Python programmers. It provides fast and flexible data structures built on top of [numpy](http://www.numpy.org/).

It is well suited to handle "tabular" data (that might be found in a spreadsheet), time series data, or pretty much anything you care to put in a matrix with rows and named columns.

It contains two primary data structures, the `Series` (1-dimensional) and the `DataFrame` (2-dimensional) as well as a host of convenience methods for loading and plotting data.

The main thing that makes pandas pandas is that all data is *intrinsically aligned*. That means each data structure, `DataFrame` or `Series` has something called an **Index** that links data values with a label. That link will always be there (unless you explicitly break or change it) and it's what allows pandas to quickly and efficiently "do the right thing" when working with data.

In [None]:
# The canonical way to import pandas:
import pandas as pd
import numpy as np

## The Series Object

A `Series` is a one-dimensional array of indexed data.

In [None]:
data = pd.Series([0.1, 0.2, 0.3, 0.4])
data

The `Series` wraps a 1-d `ndarray` from numpy and an `Index` object.

In [None]:
data.values

In [None]:
type(data.values)

In [None]:
data.index

In [None]:
# This particular index type, the `RangeIndex`, lets us use the
# same square-bracket notation as a `ndarray` to access elements:
data[0]

In [None]:
data.values[0]

In [None]:
# or even a slice:
data[1:3]

We don't have to use this auto-generated list of integers as the index though. Index values can be specified manually and don't even have to be integers.

In [None]:
data = pd.Series([0.1, 0.2, 0.3, 0.4], index=['a', 'b', 'c', 'd'])
data

In [None]:
data.index

In [None]:
# Item access works just like before, with square brackets, 
# even though the index values are strings
data['a']

In [None]:
# slices still work! But note the last element is included this time.
# This is the default behavior for indexes.
data['a':'c']

In [None]:
# We could create a non-sequential integer index:
data = pd.Series([0.1, 0.2, 0.3, 0.4], index=[5, 8, 2, 1])
data

In [None]:
data.index

In [None]:
data.values[1]

In [None]:
# Why?
data[1]

Above we see the critical difference between numpy arrays, which are always ordered sequentially and have an implicit integer index, and `Series` objects, which have an index that maps *labels* to *values*.

`Series` are in fact a cross between a numpy array and a python dictionary. You can think of them as a dictionary with *typed* keys and *typed* values.

In [None]:
max_depths_dict = {
    'Erie': 64,
    'Huron': 229,
    'Michigan': 281,
    'Ontario': 244,
    'Superior': 406,
}
max_depths = pd.Series(max_depths_dict)
max_depths

In [None]:
# squint and it looks like a dictionary!
max_depths['Michigan']

In [None]:
max_depths_dict['Michigan']

In [None]:
# Notice the index in this case was constructed automatically
# from the dictionary keys.
max_depths.index

We can think of an `Index` as an *immutable*, n-dimensional array. 

## The DataFrame Object

Much like the `Series` is a one-dimensional array of indexed data, a `DataFrame` is a two-dimensional array of indexed data.

You can think of a `DataFrame` as a sequence of `Series` objects all sharing the same index.

In [None]:
avg_depths_dict = {
    'Erie': 19,
    'Huron': 59,
    'Michigan': 85,
    'Ontario': 86,
    'Superior': 149,
}

avg_depths = pd.Series(avg_depths_dict)

lakes = pd.DataFrame({'Max Depth (m)': max_depths, 'Avg Depth (m)': avg_depths})
lakes

In [None]:
# Just like the `Series`, a `DataFrame` has an `index` property
lakes.index

In [None]:
# and a `values` property that exposes the underlying `ndarray`
lakes.values

In [None]:
# And unlike the Series, the DataFrame has a `columns` property
lakes.columns

In [None]:
# We can get the shape of a dataframe, just like a numpy ndarray
lakes.shape

In [None]:
# We can do dictionary-style lookups into the dataframe by column name
# to get a single Series:
lakes['Max Depth (m)']

In [None]:
# To select more than one column put a list of column names inside the dictionary-style square brackets:
lakes[['Max Depth (m)','Avg Depth (m)']]

### Creating new columns

Once we have a `DataFrame`, creating new columns is done through simple assignment.

In [None]:
surface_area = pd.Series({
    'Superior': 82097,
    'Michigan': 57753,
    'Huron': 59565,
    'Erie': 25655,
    'Ontario': 19009,
})

lakes['Surface Area (sq km)'] = surface_area
lakes

Notice how the index values allowed pandas to "align" the new data with the existing data!

It's also possible to create new columns from existing columns. Say for example we wanted a column to track the difference between the avg depth and max depth. We'll call this the "depth delta".

In [None]:
lakes['Depth Delta'] = lakes['Max Depth (m)'] - lakes['Avg Depth (m)']
lakes

## Data Indexing and Selection

Now that we can load data into pandas objects, we need to be able to access it. Pandas offers a variety of methods for accessing the data we need.

First, both `Series` and `DataFrame` objects support dictionary-style access with square brackets. Think of index label values as dictionary keys:

In [None]:
# We saw this above -- access a series like a dictionary to get a single value.
avg_depths['Michigan']

In [None]:
# DataFrame dictionary-style access returns the Series with that column index label:
lakes['Avg Depth (m)']

Pandas also borrows array-style access from numpy. Namely, masking and "fancy indexing" work like in numpy.

In [None]:
# use a boolean mask to select just the items we want:
avg_depths[avg_depths > 60]

In [None]:
# fancy indexing with an array of index labels:
avg_depths[['Erie', 'Ontario']]

In [None]:
# There is a potential problem with non-sequential integer indexes:
data_implicit = pd.Series([100, 200, 300, 400])
data_explicit = pd.Series([100, 200, 300, 400], index=[4, 9, 8, 1])
print("data_implicit[1] = {}\ndata_explicit[1] = {}".format(
    data_implicit[1],
    data_explicit[1]
))

To handle this potential confusion between label-based and position-based access and make data access easier in general, pandas provides three "indexers": `Series` and `DataFrame` attributes that expose differents ways to access the data.

- `iloc`: always integer position-based
- `loc`: always label-based
- `ix`: primarily label-based, falls back to position-based.

In [None]:
data_implicit.iloc[1]

In [None]:
data_explicit.iloc[1]

In [None]:
data_implicit.loc[4]

In [None]:
data_explicit.loc[4]

In [None]:
# We can use slices to select more than one value as well. Here, get all values after the first one:
data_implicit.iloc[1:]

In [None]:
# ix is useful with DataFrames and allows you to mix label and position-based
lakes.ix[0, ['Avg Depth (m)', 'Max Depth (m)']]

In [None]:
# Pop quiz! Let's get all rows of the lakes dataframe except the last one:
lakes.iloc[0:-1]

In [None]:
# What about the first two rows and first two columns only?
lakes.iloc[:2, :2]

In [None]:
lakes.loc['Erie']

`loc` accepts the following types of inputs:

- a single label (as above)
- a list or array of labels, e.g. ['a', 'b', 'c']
- a slice object with labels e.g. 'a':'c' (note that contrary to usual python slices, both the start and the stop are included!)
- A boolean array
- A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

`loc` and `iloc` also take an optional second parameter, the list of column names to return:

In [None]:
lakes.loc[['Michigan', 'Superior'], ['Max Depth (m)']]

It is also possible to assign to the values at the locations you specify with the `iloc`, `loc`, or `ix` indexers! They aren't read-only.

In [None]:
df = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=['A', 'B', 'C'])
df

In [None]:
# Assign the value 100 to the 0,B and 1,B.
# Remember with label-based access, which `loc` uses, the high end of the slice is *included*.
df.loc[:1, 'B'] = 100
df

### Examining Data

While you can manipulate and operate on your data in any way you can dream up, pandas does provide basic descriptive statistics and sorting functionality for you. I **highly** recommend reading the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats) to see what methods are available and save yourself some work!

The `describe` method is very useful with numeric data:

In [None]:
lakes.describe()

In [None]:
# We can get the highest value for a given Series with `max`:
max_depths.max()

In [None]:
# But what if we wanted the top 2? `sort_values` is the answer:
max_depths.sort_values(ascending=False).head(2)

In [None]:
# This is so common that there is actually a shortcut for it:
max_depths.nlargest(2)

In [None]:
# Which naturally works on DataFrames as well:
lakes.nlargest(2, 'Avg Depth (m)')

## Loading Data

Pandas provides a bunch of functions for reading data from a variety of sources, including CSV, Excel files, SQL databases, HDF5, even your computer's clipboard! The always-comprehensive pandas documentation has more info here: [https://pandas.pydata.org/pandas-docs/stable/io.html](https://pandas.pydata.org/pandas-docs/stable/io.html).

Let's read a local CSV dataset into a dataframe using the `read_csv` function.

In [None]:
df = pd.read_csv("data/Speed_Camera_Violations.csv")

This particular `DataFrame` contains speed camera violation data provided by the city of Chicago. This dataset is available at [https://catalog.data.gov/dataset/speed-camera-violations-997eb](https://catalog.data.gov/dataset/speed-camera-violations-997eb).

Let's start inspecting it by using the `head` method to look at the first five rows.

In [None]:
df.head(10)

When data is loaded from an external source, pandas will try to guess the datatype for each column. Let's see how it did:

In [None]:
pd.Series({col: df[col].dtype for col in df.columns})

## Data types

Much of pandas functionality depends on the data types of the `Series` it's working with. For instance we can get summary measures and do numpy-like parallel operations on numeric types (`int64`, `float64`), or do date-based arithmetic with `date` series.

Notice above that the data type of the `VIOLATION DATE` column is "object", which, just like in numpy, means it is a generic type that isn't very useful. Let's turn those date strings into actual date objects, which are much better to work with.

In [None]:
# given a Series, pd.to_datetime returns a new Series with the string dates parsed as actual dates.
# We can then directly assign that Series back to the original column in our dataframe and pandas' magical Index
# functionality will make it all line up properly.
df["VIOLATION DATE"] = pd.to_datetime(df["VIOLATION DATE"], format="%m/%d/%Y")

df["VIOLATION DATE"].head()

## Filtering

Now that we have a date column, we can do things like filter to only look at violations in 2015.

To do this, we'll create a "filter", essentially a boolean expression that works just like a mask or "fancy indexing" expression in numpy, and apply that filter to our dataframe to get just the rows we want.


In [None]:
import datetime

# note the extra parentheses below, these are necessary when creating a boolean filter expression with
# multiple comparisons like this
date_filter = ((df["VIOLATION DATE"] >= datetime.date(2015,1,1)) & (df["VIOLATION DATE"] < datetime.date(2016,1,1)))

# date_filter now contains a series of true/false values that we can use to extract just the values we are interested in
# by putting it in square brackets after the dataframe variable.
print(date_filter.head())
print()
print(date_filter.tail())

df_2015 = df[date_filter]

df_2015.head()

This kind of filtering works for any kind of data type, provided you take care to make sure pandas is using the right data types for your data!

You may have noticed that many of the rows in this dataframe are missing lat/lon data. Pandas uses the "NaN" placeholder for missing data and offers some methods for dealing with it.

Both `Series` and `DataFrame` objects have `fillna` method that will replace missing data with a specified value.

In thise case however we may want to just drop those records that have missing data entirely:


In [None]:
df_no_nans = df.dropna(axis=0, how="any")
df_no_nans.head()