# Pandas

Pandas data frames to Python, like those in R, for easy data wrangling.

_Note_: Most of the examples here are from [Jake Vanderplas' Github](https://github.com/jakevdp/OsloWorkshop2014).  Jake is one of the main contributors to SciPy and Scikit.Learn

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Series

Series are basically an array with labels for each element.  Can also think of it as an ordered dictionary.

In [2]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])

In [3]:
s

0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

In [4]:
# Indices can be strings as well as numbers
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2

a    0
b    1
c    2
d    3
dtype: int64

In [5]:
pop_dict = {'California': 38332521,
            'Texas': 26448193,
            'New York': 19651127,
            'Florida': 19552860,
            'Illinois': 12882135}
populations = pd.Series(pop_dict)
populations

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [6]:
# Can slice series
print populations['California']
print populations['California':'Illinois']

38332521
California    38332521
Florida       19552860
Illinois      12882135
dtype: int64


## Data Frames

Like 2D tables with row and column labels.  IPython has good support for data frames.

In [7]:
data = {'state': ['California', 'Texas', 'New York', 'Florida', 'Illinois'],
        'population': [38332521, 26448193, 19651127, 19552860, 12882135],
        'area':[423967, 695662, 141297, 170312, 149995]}
states = pd.DataFrame(data)
states

Unnamed: 0,area,population,state
0,423967,38332521,California
1,695662,26448193,Texas
2,141297,19651127,New York
3,170312,19552860,Florida
4,149995,12882135,Illinois


In [8]:
# Maybe one of the columns is a natural index
states = states.set_index('state')
states

Unnamed: 0_level_0,area,population
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [9]:
# Extracting a column returns a Series
states['area']

state
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

You can add new columns to the data frame based on vector arithmetic of existing columns.

In [10]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0_level_0,area,population,density
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [11]:
# Can filter like NumPy arrays
states[states['density'] > 100]

Unnamed: 0_level_0,area,population,density
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [12]:
# Sort by a particular column
states.sort_values(by='density', ascending=False)

Unnamed: 0_level_0,area,population,density
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
California,423967,38332521,90.413926
Illinois,149995,12882135,85.883763
Texas,695662,26448193,38.01874


In [13]:
# Compute summary statistics for each column
states.describe()

Unnamed: 0,area,population,density
count,5.0,5.0,5.0
mean,316246.6,23373370.0,93.639859
std,242437.411951,9640386.0,37.672251
min,141297.0,12882140.0,38.01874
25%,149995.0,19552860.0,85.883763
50%,170312.0,19651130.0,90.413926
75%,423967.0,26448190.0,114.806121
max,695662.0,38332520.0,139.076746
