# What is pandas?

pandas is one of the most popular open source data exploration libraries currently available. It gives its users the power to explore, query, transform, aggregate, and visualize **tabular** data. Tabular refers to data that is two-dimensional, consisting of rows and columns. Commonly, we refer to this organized structure of data as a **table**. pandas is the tool that we will use to analyze data in nearly every chapter of this book.

## pandas is built directly on numpy

numpy('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including pandas. numpy's primary data structure is an n-dimensional array which is much more powerful than a Python list and with much better performance for numerical operations.

All of the data in pandas is stored in numpy arrays. That said, it isn't necessary to know much about numpy when learning pandas. You can think of pandas as a higher-level, easier to use interface for doing data analysis than numpy. It is a good idea to eventually learn numpy, but for most data analysis tasks, pandas will be the right tool.

## pandas operates on tabular data

There are numerous formats for data, such as XML, JSON, CSV, Parquet, raw bytes, and many others. pandas has the capability to read in many different formats of data and always converts it to tabular form. pandas is built just for analyzing this rectangular, deceptively normal concept of data. pandas is not a suitable library for handling data in more than two-dimensions. It's focus is strictly on data that is one or two dimensions.

### The DataFrame and Series

The DataFrame and Series are the two primary pandas objects that we use throughout this book.
* **DataFrame** - A two-dimensional data structure that looks like any other rectangular table of data with rows and columns.
* **Series** - A single dimension of data. It is analogous to a single column of data or a one-dimensional array.

## pandas examples

The rest of this chapter contains examples of common data analysis tasks with pandas. There are one or two examples from each of the following major areas of the library:
* Reading data, 
* Filtering data
* Aggregating methods
* Non-Aggregating methods
* Aggregating within groups
* Cleaning data
* Joining data
* Time series analysis
* Visualization


## Reading data

Multiple datasets are used during the rest of this chapter. The `read_csv` function is able to read in data stored in plain text that is separated by a delimiter. By default, the delimiter is a comma. Below, we read in public bike usage data from the city of Chicago into a pandas DataFrame named `bikes`.

In [1]:
import pandas as pd

bikes = pd.read_csv('input/bikes.csv')
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,6/28/2013 19:01,6/28/2013 19:17,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
1,Male,6/28/2013 22:53,6/28/2013 23:03,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
2,Male,6/30/2013 14:43,6/30/2013 15:01,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy


## Filtering data

pandas can filter the rows of a DataFrame based on whether the values in that row meet a condition. For instance, we can select only the rides that had a `tripduration` greater than 5,000 (seconds).

### Single Condition

This example is a single condition that gets tested for each row. Only the rows that meet this condition are returned.

In [2]:
filt = bikes['tripduration'] > 5000
bikes[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,7/9/2013 13:12,7/9/2013 14:42,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy
40,Female,7/14/2013 14:08,7/14/2013 15:53,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,7/21/2013 11:35,7/21/2013 13:54,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy


### Multiple Conditions

We can test for multiple conditions in a single row. The following example returns rides by females **and** have a `tripduration` greater than 5,000.

In [3]:
filt1 = bikes['tripduration'] > 5000
filt2 = bikes['gender'] == 'Female'
filt = filt1 & filt2
bikes[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
40,Female,7/14/2013 14:08,7/14/2013 15:53,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,7/21/2013 11:35,7/21/2013 13:54,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy
1954,Female,12/28/2013 11:37,12/28/2013 13:34,7050,LaSalle St & Washington St,15,Theater on the Lake,15,44.1,12.7,clear


The next example has multiple conditions but only requires that one of the conditions is true. It returns all the rows where either the rider is female **or** the `tripduration` is greater than 5,000.

In [4]:
filt = filt1 | filt2
bikes[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
9,Female,7/4/2013 15:00,7/4/2013 15:16,922,Lakeview Ave & Fullerton Pkwy,19,Racine Ave & Congress Pkwy,19,81.0,12.7,mostlycloudy
14,Female,7/6/2013 12:39,7/6/2013 12:49,610,Morgan St & Lake St,15,Aberdeen St & Jackson Blvd,15,82.0,5.8,mostlycloudy
18,Male,7/9/2013 13:12,7/9/2013 14:42,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy


### Using the `query` method

The `query` method provides an alternative and often more readable way to filter data than the above. All three filtering examples from above may be duplicated with `query`. A string representing the condition is passed to the `query` method to filter the data.

In [5]:
bikes.query('tripduration > 5000').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
18,Male,7/9/2013 13:12,7/9/2013 14:42,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy
40,Female,7/14/2013 14:08,7/14/2013 15:53,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,7/21/2013 11:35,7/21/2013 13:54,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy


In [6]:
bikes.query('tripduration > 5000 and gender=="Female"').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
40,Female,7/14/2013 14:08,7/14/2013 15:53,6274,Wabash Ave & Roosevelt Rd,19,Lake Shore Dr & Monroe St,11,87.1,8.1,partlycloudy
77,Female,7/21/2013 11:35,7/21/2013 13:54,8299,State St & 19th St,15,Sheffield Ave & Kingsbury St,15,82.9,5.8,mostlycloudy
1954,Female,12/28/2013 11:37,12/28/2013 13:34,7050,LaSalle St & Washington St,15,Theater on the Lake,15,44.1,12.7,clear


In [7]:
bikes.query('tripduration > 5000 or gender=="Female"').head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
9,Female,7/4/2013 15:00,7/4/2013 15:16,922,Lakeview Ave & Fullerton Pkwy,19,Racine Ave & Congress Pkwy,19,81.0,12.7,mostlycloudy
14,Female,7/6/2013 12:39,7/6/2013 12:49,610,Morgan St & Lake St,15,Aberdeen St & Jackson Blvd,15,82.0,5.8,mostlycloudy
18,Male,7/9/2013 13:12,7/9/2013 14:42,5396,Canal St & Jackson Blvd,35,Millennium Park,35,79.0,13.8,cloudy


## Aggregating methods

The technical definition of an **aggregation** is when a sequence of values is summarized by a **single** number. For example, `sum`, `mean`, `median`, `max`, and `min` are all examples of aggregation methods. By default, calling these methods on a pandas DataFrame applies the aggregation to each column. Below, we use a dataset containing San Francisco employee compensation information. Only a subset of the columns are initially read into the DataFrame.

In [8]:
cols = ['salaries', 'overtime', 'other salaries', 'retirement', 'health and dental']
sf_emp = pd.read_csv('input/sf_employee_compensation.csv', usecols=cols)
sf_emp.head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental
0,71414.01,0.0,0.0,14038.58,12918.24
1,67941.06,0.0,0.0,13030.23,10047.52
2,116956.72,59975.43,19037.3,24796.44,15788.97


Calling the `mean` method returns the mean of each column. The result is then rounded to the nearest thousand.

In [9]:
sf_emp.mean()

salaries             63689.182149
overtime              4369.738196
other salaries        3909.401825
retirement           12131.039922
health and dental     9154.667163
dtype: float64

pandas allows you to aggregate rows as well. The `axis` parameter may be used to change the direction of the aggregation. This returns the total compensation for each employee.

In [10]:
sf_emp.sum(axis=1).head(3)

0     98370.83
1     91018.81
2    236554.86
dtype: float64

## Non-aggregating methods

There are methods that perform some calculation on the DataFrame that do not aggregate the data and usually preserve the shape of the DataFrame. For example, the `round` method rounds each number to a given decimal place. Here, we round each value in the DataFrame to the nearest thousand.

In [11]:
sf_emp.round(-3).head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental
0,71000.0,0.0,0.0,14000.0,13000.0
1,68000.0,0.0,0.0,13000.0,10000.0
2,117000.0,60000.0,19000.0,25000.0,16000.0


## Aggregating within groups

Above, we performed aggregations on the entire DataFrame. We can instead perform aggregations within groups of the data. Below we use an insurance dataset.

In [12]:
ins = pd.read_csv('input/insurance.csv')
ins.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'input/insurance.csv'

One of the simplest aggregations is the frequency of occurrence of all the unique values within a single column. This is performed below with the `value_counts` method.

### Frequency of unique values in a single column

Here, we count the occurrence of each individual `region`.

In [None]:
ins['region'].value_counts()

### Single aggregation function

Let's say we wish to find the mean charges for each of the unique values in the `sex` column. The `groupby` method creates groups based on the given grouping column before applying the aggregation. In this example, we return the mean charges for each sex.

In [None]:
ins.groupby('sex').agg(mean_charges=('charges', 'mean')).round(-3)

### Multiple aggregation functions

pandas allows us to perform multiple aggregations at the same time. Below, we calculate the mean and max of the `charges` column as well as count the number of non-missing values.

In [None]:
ins.groupby('sex').agg(mean_charges=('charges', 'mean'),
                       max_charges=('charges', 'max'),
                       count_charges=('charges', 'count')).round(0)

### Multiple grouping columns

pandas allows us to form groups based on multiple columns. In the below example, each unique combination of `sex` and `region` form a group. For each of these groups, the same aggregations as above are performed on the `charges` column.

In [None]:
ins.groupby(['sex', 'region']).agg(mean_charges=('charges', 'mean'),
                                   max_charges=('charges', 'max'),
                                   count_charges=('charges', 'count')).round(0)

### Pivot Tables

We can reproduce the exact same output as above in a different shape with the `pivot_table` method. It groups and aggregates the same way as `groupby`, but places the unique values of one of the grouping columns as the new columns in the resulting DataFrame. Notice that pivot tables make for easier comparisons across groups.

In [None]:
pt = ins.pivot_table(index='sex', columns='region', 
                     values='charges', aggfunc='mean').round(0)
pt

### Styling DataFrames

pandas enables you to style DataFrames in various ways to provide emphasis on particular cells. Below, the maximum value of each column is highlighted, a comma is added to separate the digits, and decimals are removed.

In [None]:
pt.style.highlight_max().format(r'{:,.0f}')

## Cleaning data

Many datasets need to be cleaned before analyzed. pandas provides many tools to prepare data for further analysis.

### Options in the `read_csv` function

Below, we read in a new dataset on plane crashes. Notice all the question marks. They represent missing values, but pandas will read them in as strings.

In [None]:
pc = pd.read_csv('input/planecrashinfo.csv')
pc.head(3)

The `read_csv` function has dozens of options to help read in messy data. One of the options allows you to convert a particular string to missing values. Notice that all of the question marks are now labeled as `NaN` (not a number).

In [None]:
pc = pd.read_csv('input/planecrashinfo.csv', na_values='?')
pc.head(3)

### String manipulation

Often times there is data trapped within a string column that you will need to extract. The `aboard` column appears to have three distinct pieces of information; the total number of people on board, the number of passengers, and the number of crew.

In [None]:
aboard = pc['aboard']
aboard.head()

pandas has special functionality for manipulating strings. Below, we use a regular expression to extract the pertinent numbers from the `aboard` column.

In [None]:
aboard.str.extract(r'(\d+)?\D*(\d+)?\D*(\d+)?').head()

### Reshaping into tidy form

Occasionally, you will have several columns of data that all belong in a single column. Take a look at the DataFrame below on the average arrival delay of airlines at different airports. All of the columns with three-letter airport codes could be placed in the same column as they all contain the arrival delay which has the same units.

In [None]:
aad = pd.read_csv('input/average_arrival_delay.csv').head()
aad

The `melt` method stacks columns one on top of the other. Here, it places all of the three-letter airport code columns into a single column. The first two airports (ATL and DEN) are shown below in the new tidy DataFrame.

In [None]:
aad.melt(id_vars='airline', var_name='airport', value_name='delay').head(10)

## Joining Data

pandas can join multiple DataFrames together by matching values in one or more columns. If you are familiar with SQL, then pandas performs joins in a similar fashion. Below, we make a connection to a database and read in two of its tables.

In [None]:
# from sqlalchemy import create_engine
# engine = create_engine('sqlite:///../data/databases/neurIPS.db')
# authors = pd.read_sql('Authors', engine)
# pa = pd.read_sql('PaperAuthors', engine)

Output the first three rows of each DataFrame.

In [None]:
# authors.head(3)

In [None]:
# pa.head(3)

We can now join these tables together using the `merge` method. The `AuthorID` column from the `pa` table is aligned with the `Id` column of the `authors` table.

In [None]:
# pa.merge(authors, how='left', left_on='AuthorId', right_on='Id').head(3)

## Time Series Analysis

One of the original purposes of pandas was to do time series analysis. Below, we read in 20 years of Microsoft's closing stock price data.

In [None]:
msft = pd.read_csv('input/msft20.csv', parse_dates=['date'], index_col='date')
msft.head()

### Select a period of time

pandas allows us to easily select a period of time. Below, we select all of the trading data from February 27, 2017 through March 2, 2017.

In [None]:
msft['2017-02-27':'2017-03-02']

### Group by time

We can group by some length of time. Here, we group together every month of trading data and return the average closing price of that month.

In [None]:
msft_mc = msft.resample('M').agg({'close':'mean'})
msft_mc.head(3)