
# Introduction to Pandas

Based on a series of tutorials by Chris Fonnesbeck: 

https://github.com/fonnesbeck/statistical-analysis-python-tutorial

Pandas provides a useful wrapper for tabular data, with lots of utilities for restructuring and analizing data.

http://pandas.pydata.org/pandas-docs/stable/

This book by Wes McKinney is a great starting point for getting further into Pandas:

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.


Pandas uses the numpy arrays under the hood.  Numpy is a very high-performance multi-dimensional interface to blocks of memory that can be accessed efficiently.  See the introduction to Numpy for background information.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np

# Set some Pandas options
pd.set_option('html', False)
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

### Pandas Data Structures

### Series

A **Series** is a single vector of data (like a one-dimensional NumPy array) with an *index* that labels each element in the vector.

In [None]:
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [None]:
counts.values

In [None]:
counts.index

We can assign meaningful labels to the index, if they are available:

In [None]:
top_grantmakers_2014 = pd.Series([3921403840, 1171857588, 964514537, 964514537], 
    index=['Bill & Melinda Gates Foundation', 'Schwab Charitable Fund', 'Silicon Valley Community Foundation', 'Gordon and Betty Moore Foundation'])
print (top_grantmakers_2014)

These labels can be used to refer to the values in the `Series`, dictionary-style.

In [None]:
top_grantmakers_2014['Silicon Valley Community Foundation']

Boolean indexing can work as well:

In [None]:
[name.endswith('Fund') for name in top_grantmakers_2014.index]

In [None]:
top_grantmakers_2014[[name.endswith('Fund') for name in top_grantmakers_2014.index]]

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [None]:
print(top_grantmakers_2014[1])
print(top_grantmakers_2014.index[1])

We can give both the array of values and the index meaningful labels themselves:

In [None]:
top_grantmakers_2014.name = 'Top Amounts'
top_grantmakers_2014.index.name = 'Grantmaker'
top_grantmakers_2014

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [None]:
np.log(top_grantmakers_2014)

We can also filter according to the values in the `Series`:

In [None]:
top_grantmakers_2014[top_grantmakers_2014>10**9]

A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [None]:
top_gm_dict = {'Schwab Charitable Fund': 1171857588, 'Bill & Melinda Gates Foundation': 3921403840, 'Gordon and Betty Moore Foundation': 964514537, 'Silicon Valley Community Foundation': 964514537}
print(pd.Series(top_gm_dict))

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.  Graceful handling of missing and null data is a key feature of Pandas:

In [None]:
top_gm_series = pd.Series(top_gm_dict, index=['Unknown', 'Bill & Melinda Gates Foundation','Schwab Charitable Fund','Silicon Valley Community Foundation','Gordon and Betty Moore Foundation'])
print(top_gm_series)

In [None]:
top_gm_series.isnull()

Critically, the labels are used to **align data** when used in operations with other Series objects:

In [None]:
print(top_gm_series)
print(top_grantmakers_2014)
print(top_gm_series + top_grantmakers_2014)

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

In [None]:
df = pd.DataFrame({  'gm_state':['CA', 'CA', 'NY', 'NY', 'NY'],
                     'amount':[569092056, 538673007, 506235384, 467353105, 443120415],
                     'recip_state':['NJ', 'NY', 'MA', 'CA', 'DC']})
print(df)

Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [None]:
df[['gm_state','recip_state','amount']]

A `DataFrame` has a second index, representing the columns:

In [None]:
df.columns

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [None]:
df['amount']

In [None]:
df.amount

Columns act as series

In [None]:
type(df.amount)

A dataframes columns can be accessed using an array, in which case the result is a dataframe, even if it only has one column:

In [None]:
type(df[['amount']])

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). If we want access to a row in a `DataFrame`, we index its `ix` attribute.


In [None]:
df.ix[3]

Alternatively, we can create a `DataFrame` with a dict of dicts:

In [None]:
df = pd.DataFrame({0: {'gm_state': 'CA', 'recip_state': 'NJ', 'amount': 569092056},
                    1: {'gm_state': 'CA', 'recip_state': 'NY', 'amount': 538673007},
                    2: {'gm_state': 'NY', 'recip_state': 'MA', 'amount': 506235384},
                    3: {'gm_state': 'NY', 'recip_state': 'CA', 'amount': 467353105},
                    4: {'gm_state': 'NY', 'recip_state': 'DC', 'amount': 443120415}})
print (df)

We probably want this transposed - note that the T function works as in numpy:

In [None]:
df = df.T
df

Its important to note that, as with numpy, the Series returned when a DataFrame is indexed is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [None]:
vals = df.amount
vals

*vals* is just a reference to df.amount, so if we change a value of an element in vals...

In [None]:
vals[4] = 0
vals

*df* is affected by the change:

In [None]:
df

On the other hand, if we use the copy function, a separate copy is created:

In [None]:
df = pd.DataFrame({0: {'gm_state': 'CA', 'recip_state': 'NJ', 'amount': 569092056},
                    1: {'gm_state': 'CA', 'recip_state': 'NY', 'amount': 538673007},
                    2: {'gm_state': 'NY', 'recip_state': 'MA', 'amount': 506235384},
                    3: {'gm_state': 'NY', 'recip_state': 'CA', 'amount': 467353105},
                    4: {'gm_state': 'NY', 'recip_state': 'DC', 'amount': 443120415}})
df = df.T
vals = df.amount.copy()
vals[4] = 0

print("df", df)
print("vals", vals)

We can create or modify columns by assignment:

In [None]:
df.amount[3] = 10000
df

Like numpy, an entire column can be changed.

In [None]:
df['gm_state'] = 'MN'
df

But note, we cannot use the attribute indexing method to add a new column:

In [None]:
df.new_column = 1
df

But we can using dictionary-type notation:

In [None]:
df["new_column"] = "X"
df

Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index - note that multiplying a list by a number replicates the list the given number of times:

In [None]:
print([1,2,3]*3)
new_series = pd.Series([2]*3 + [1]*2)
new_series

In [None]:
df['new_series'] = new_series
df

Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [None]:
df['month'] = ['Jan']*len(df)
df

We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [None]:
del df['month']
del df['new_column']
del df['new_series']
df

We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [None]:
df.values

## Importing data

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.  For example:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

There are several other data formats that can be imported into Python and converted into DataFrames, with the help of built-in or third-party libraries. These include Excel, JSON, XML, HDF5, relational and non-relational databases, and various web APIs.

Note that head() and tail() can let you inspect the beginning or end of the data:

In [None]:
df = pd.read_csv("arts_funding_by_state.txt", delimiter="\t")
print(df.head())
print(df.tail())

Notice that `read_csv` automatically considered the first row in the file to be a header row, and inferred the datatypes

In [None]:
print(df.dtypes)

For a more useful index, we can specify the first column as a unique index to the data.

In [None]:
df = pd.read_csv("arts_funding_by_state.txt", delimiter="\t", index_col=['display_name'])
df.head()

If we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [None]:
df_two_rows = pd.read_csv("arts_funding_by_state_with_missing_data.txt", delimiter="\t", index_col=['display_name'], nrows=2)
print(df_two_rows)

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

Above, Pandas recognized `NA` and an empty field as missing data.  Note the first few lines of this file - it successfully infers NaN (not a number) from N/A and NULL:

```
Alabama	None	N/A
Alaska	NULL	1953665
```

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [None]:
df_with_nulls = pd.read_csv("arts_funding_by_state_with_missing_data.txt", delimiter="\t", index_col=['display_name'],
                           na_values=['None'])
print(df_with_nulls.head())

Missing values can be filled as desired, with fillna:

In [None]:
df_with_nulls.fillna(0).head()

Fillna includes other methods of filling in missing data

In [None]:
df_with_nulls.fillna(method='bfill').head()

In [None]:
df_with_nulls.fillna(df_with_nulls.mean()).head()

Or dropped with dropna:

In [None]:
df_with_nulls.dropna().head()

In [None]:
df_with_nulls.dropna(how='all').head()

### More indexing

The cross-section method `xs` (not a field) extracts a single column or row *by label* and returns it as a `Series`:

In [None]:
df.xs('Alabama')

In [None]:
large_states = df[df.population > 10000000]
large_states.head()     

In [None]:
arts_per_capita_2014 = df.arts_grants_2014 / df.population
arts_per_capita_2014

We can add this as a new column in the dataframe:

In [None]:
df["per_capita"] = df.arts_grants_2014 / df.population
df.head()

Operations can also be applied across columns. For example, suppose we wanted to express population as a percentage:

In [None]:
df["population_pct"] = (df.population / df.population.sum()) * 100
df["funding_pct"] = (df.arts_grants_2014 / df.arts_grants_2014.sum()) * 100
df.head()

We can also use *apply* so apply functions to each column or row of a `DataFrame`

In [None]:
print(df.apply(np.min))
print()
print(df.apply(np.max))
print()
print(df.apply(np.mean))
print()
print(df.apply(np.median))

In [None]:
range_function = lambda x: x.max() - x.min()
df.apply(range_function)

## Sorting and Ranking

Pandas objects include methods for re-ordering data.

In [None]:
df.sort_index().head(3)

In [None]:
df.sort_index(ascending=False).head(3)

In [None]:
# Sorting by columns:
df.sort_index(axis=1, ascending=False).head(3)

We can also use `sort_values` to sort a `Series` by value, rather than by label.  Top 10 per-capita 2014 arts funding:

In [None]:
df.per_capita.sort_values(ascending=False).head(10)

For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`.

**Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.

In [None]:
df.per_capita.rank()

Calling the `DataFrame`'s `rank` method results in the ranks of all columns:

In [None]:
df.rank(ascending=False).head()

#### Exploring data

Pandas makes it easy to slice and summarize and explore data.  The describe() function is good place to start:

In [None]:
df.describe()

Pandas has matplotlib plotting built in, so some simple plotting can also be useful for exploring the data:

In [None]:
df.population.hist()

In [None]:
df.arts_grants_2014.apply(np.log10).hist()

In [None]:
df.plot(kind="scatter", x="population", y="per_capita")
df_no_dc = df.drop(["District of Columbia"])
df_no_dc.plot(kind="scatter", x="population", y="per_capita")

You can do quite comparative statistics between columns as well:

In [None]:
df.corr()

## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

In [None]:
df.to_csv("arts_grants_extended.txt", delimiter="\t")

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.  Pickle is handy for saving python objects to files without having to write your own serialization code.

In [None]:
df.to_pickle("arts_funding.pkl")

The complement to `to_pickle` is the `read_pickle` function, which restores the pickle to a `DataFrame` or `Series`:

In [None]:
pd.read_pickle("arts_funding.pkl").head()

As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.

For example, we might be interested in the distribution of transit lengths, so we can plot them as a histogram:

Though most of the transits appear to be short, there are a few longer distances that make the plot difficult to read. This is where a transformation is useful:

Pandas has many additional capabilities - SQL-like joins; merging dataframes; groupby; concatenating multiple frames; pivoting rows and columns.

## Data aggregation and GroupBy operations

One of the most powerful features of Pandas is its **GroupBy** functionality. On occasion we may want to perform operations on *groups* of observations within a dataset. For exmaple:

* **aggregation**, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results
* **slicing** the DataFrame into groups and then doing something with the resulting slices (*e.g.* plotting)
* group-wise **transformation**, such as standardization/normalization

In [None]:
data = np.load("zips_and_revenues.npy")
data = data.T
dfz = pd.DataFrame(data,index=data[:,0], columns=["zip_code", "revenue"])
dfz.head()

In [None]:
revenue_grouped = dfz.groupby("zip_code")

This *grouped* dataset is hard to visualize



In [None]:
revenue_grouped

A common data analysis procedure is the **split-apply-combine** operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table.

For example, we may want to aggregate our data with with some function.

![split-apply-combine](http://f.cl.ly/items/0s0Z252j0X0c3k3P1M47/Screen%20Shot%202013-06-02%20at%203.04.04%20PM.png)

<div align="right">*(figure taken from "Python for Data Analysis", p.251)*</div>

We can aggregate in Pandas using the `aggregate` (or `agg`, for short) method:

In [None]:
revenue_grouped.agg(np.mean).head()

Some aggregation functions are so common that Pandas has a convenience method for them, such as `mean`:

In [None]:
revenue_grouped.mean().head()

The `add_prefix` and `add_suffix` methods can be used to give the columns of the resulting table labels that reflect the transformation:

In [None]:
revenue_grouped.mean().add_suffix('_mean').head()

Alternately, we can **transform** the data, using a function of our choice with the `transform` method:

In [None]:
normalize = lambda x: (x - x.mean())/x.std()

revenue_grouped.transform(normalize).head()

Also, we can access the group as a list:

In [None]:
chunks = dict(list(revenue_grouped))
chunks[10003]

This just scratches the surface - see the Wes McKinney book for a more in-depth introduction.