<DIV ALIGN=CENTER>

# Introduction to Pandas
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

One of the early criticisms of many in the data science arena of the
Python language was the lack of useful data structures for performing
data analysis tasks. This stemmed in part from comparisons between the R
language and Python, since R has a built-in _DataFrame_ object that
greatly simplified many data analysis tasks. This deficiency was
addressed in 2008 by Wes McKinney with the creation of [Pandas][1] (the
name was originally an abbreviation of Panel datas). To quote the Pandas
documentation:

>Python has long been great for data munging and preparation, but less
>so for data analysis and modeling. pandas helps fill this gap, enabling
>you to carry out your entire data analysis workflow in Python without
>having to switch to a more domain specific language like R.

Pandas introduces several new data structures like the `Series`,
`DataFrame`, and `Panel` that build on top of existing
tools like `numpy` to speed-up data analysis tasks. Pandas also provides
efficient mechanisms for moving data between in memory representations
and different data formats including CSV and text files, JSON files, SQL
databases, HDF5 format files, and even Excel spreadsheets. Pandas also
provides support for dealing with missing or incomplete data and
aggregating or grouping data.

-----
[1]: http://pandas.pydata.org

## Brief introduction to Pandas

Before using Pandas, we must first import the Pandas library:

    import pandas as pd

Second, we simply create and use the appropriate Pandas data structure.
The two most important data structures for typical data science tasks
are the `Series` and the `DataFrame`:

1. `Series`: a one-dimensional labeled array that can hold any data type
such as integers, floating-point numbers, strings, or Python objects. A
`Series` has both a data column and a label column called the _index_.

2. `DataFrame`: a two-dimensional labeled data structure with columns
that can be hold different data types, similar to a spreadsheet or
relational database table. 

Pandas also provides a date/time data structure sometimes refereed to as
a `TimeSeries` and a three-dimensional data structure known as a
`Panel`. 

### `Series`

A `Series` is useful to hold data that should be accesible by using a
specific label. You create a `Series` by passing in an appropriate data
set along with an optional index:

    values = pd.Series(data, index=idx)

The index varies depending on the type of data passed in to create the
`Series:

- if data is a NumPy array, the index should be the same length as the
data array. If no index is provided one will be created that enables
integer access that mirrors a traditional NumPy array (i.e., zero
indexed). 

- if data is a Python dictionary, `idx` can contain specific labels to
indicate which values in the dictionary should be used to create the
`Series`. If no index is specified, an index is created from the sorted
dictionary keys. 

- if data is a scalar value, an inde must be supplied. In this case, the
scalar value will be repeated to ensure that each label in the index has
a value in the `Series`.

These different options are demonstrated in the following code cells.

-----
[df]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

In [1]:
import pandas as pd
import numpy as np

# We label the random values
s1 = pd.Series(np.random.rand(6), index=['q', 'w', 'e', 'r', 't', 'y'])

print(s1)

q    0.832269
w    0.631883
e    0.162581
r    0.034341
t    0.661563
y    0.062999
dtype: float64


In [2]:
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}

# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])

print(s2)

q    11.0
w    21.0
r    41.0
y     NaN
dtype: float64


In [3]:
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])

print(s3)

print('\nThe "e" value is ', s3['e'])

q    42
w    42
e    42
r    42
t    42
y    42
dtype: int64

The "e" value is  42


In [4]:
# We can slice like NumPy arrays

print(s1[:-2])

# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)
print('\nSeries operations:')
print(s2 * 5 - 1.2)

q    0.832269
w    0.631883
e    0.162581
r    0.034341
dtype: float64

Sum Series:
q    42.832269
w    42.631883
e    42.162581
r    42.034341
t    42.661563
y    42.062999
dtype: float64

Series operations:
q     53.8
w    103.8
r    203.8
y      NaN
dtype: float64


-----

### `DataFrame`

The second major data structure that Pandas provdis is he `DataFrame`,
which is a two-dimensional array, where each column is effectively a
`Series` with a shared index. A DataFrame is a very powerful data
structure and provides a nice mapping for a number of different data
formats (and storage mechanisms). For example, you can easily read data
from a CSV file, a fixed width format text file, a JSOPN file, an HTML
file, and HDF file, and a relational database into a Pandas `DataFrame`.
This is demonstrated in the next set of code cells, where we extract
data from files we created in the [Introduction to Data
Formats][df] Notebook.

-----
[df]: dataformats.ipynb

In [5]:
# Read data from CSV file, and display subset

dfa = pd.read_csv('/home/data_scientist/data/data.csv', delimiter='|', index_col='iata')

# We can grab the first five rows, and only extract the 'airport' column
print(dfa[['airport', 'city', 'state']].head(5))

                   airport              city state
iata                                              
00M               Thigpen        Bay Springs    MS
00R   Livingston Municipal        Livingston    TX
00V            Meadow Lake  Colorado Springs    CO
01G           Perry-Warsaw             Perry    NY
01J       Hilliard Airpark          Hilliard    FL


In [6]:
# Read data from our JSON file
dfb = pd.read_json('/home/data_scientist/data/data.json')

# Grab the last five rows
print(dfb[[0, 1, 5]].tail(5))

                       airport         city state
ZEF            Elkin Municipal        Elkin    NC
ZER  Schuylkill Cty/Joe Zerbey   Pottsville    PA
ZPH      Zephyrhills Municipal  Zephyrhills    FL
ZUN                 Black Rock         Zuni    NM
ZZV       Zanesville Municipal   Zanesville    OH


-----

In the previous code cells, we read data first from a delimiter
separated value file and second from a JSON file into a Pandas
`DataFrame`. In each code cell, we display data contained in the new
DataFrame, first by using the `head` method to display the first few
rows, and second by using the `tail` method to display the last few
lines. For the delimiter separated value file, we explicitly specified
the delimiter, which is a vertical bar `|`, the default is to assume a
comma as the delimiter. We also explicitly specify the `iata` column
should be used as the index column, which is how we can refer to rows in
the array. 

We also explicitly select columns for display in both code cells. In the
first code cell, we explicitly name the columns, passing in a list of
the names to the DataFrame. Alternatively, in the second code cell, we
pass in a list of the column ids, which we must do since we did not
create named columns when reading data from the JSON file. The list of
integers can be used even if the columns of the array have been assigned
names.

Pandas includes a tremendous amount of functionality, especially for
the `DataFrame`, to learn more, view the [detailed documentation][pdd].
Several useful functions are demonstrated below, however, including
information summaries, slicing, and column operations on DataFrames.

-----

[pdd]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

In [7]:
# Lets look at the datatypes of each column
dfa.dtypes

airport     object
city        object
state       object
country     object
lat        float64
long       float64
dtype: object

In [8]:
# We can get a summary of numerical information in the dataframe

dfa.describe()

Unnamed: 0,lat,long
count,3376.0,3376.0
mean,40.036524,-98.621205
std,8.329559,22.869458
min,7.367222,-176.646031
25%,34.688427,-108.761121
50%,39.434449,-93.599425
75%,43.372612,-84.137519
max,71.285448,145.621384


In [9]:
# Notice the JSON data did not automatically specify data types
dfb.dtypes

airport     object
city        object
country     object
lat        float64
long       float64
state       object
dtype: object

In [10]:
# This affects the output of the describe method, dfb has no numerical data types.

dfb.describe()

Unnamed: 0,lat,long
count,3376.0,3376.0
mean,40.036524,-98.621205
std,8.329559,22.869458
min,7.367222,-176.646031
25%,34.688427,-108.761121
50%,39.434449,-93.599425
75%,43.372612,-84.137519
max,71.285448,145.621384


In [11]:
# We can slice out rows using the indicated index for dfa

print(dfa.loc[['00V', '11R', '12C']])

                 airport              city state country        lat  \
iata                                                                  
00V          Meadow Lake  Colorado Springs    CO     USA  38.945749   
11R    Brenham Municipal           Brenham    TX     USA  30.219000   
12C   Rochelle Municipal          Rochelle    IL     USA  41.893001   

            long  
iata              
00V  -104.569893  
11R   -96.374278  
12C   -89.078290  


In [12]:
# We can slice out rows using the row index for dfb
print(dfb[100:105])

                airport      city country        lat       long state
11R   Brenham Municipal   Brenham     USA  30.219000 -96.374278    TX
12C  Rochelle Municipal  Rochelle     USA  41.893001 -89.078290    IL
12D     Tower Municipal     Tower     USA  47.818333 -92.291667    MN
12J   Brewton Municipal   Brewton     USA  31.051263 -87.067968    AL
12K  Superior Municipal  Superior     USA  40.046361 -98.060111    NE


In [13]:
# We can also select rows based on boolean tests on columns
print(dfa[(dfa.lat > 48) & (dfa.long < -170)])

       airport      city state country        lat        long
iata                                                         
ADK       Adak      Adak    AK     USA  51.877964 -176.646031
AKA       Atka      Atka    AK     USA  52.220348 -174.206350
GAM    Gambell   Gambell    AK     USA  63.766766 -171.732824
SNP   St. Paul  St. Paul    AK     USA  57.167333 -170.220444
SVA   Savoonga  Savoonga    AK     USA  63.686394 -170.492636


-----

We can also perform numerical operations on a `DataFrame`, just as was the
case with NumPy arrays. To demonstrate this, we create a numerical
DataFrame, apply different operations, and view the results.

-----

In [14]:
df = pd.DataFrame(np.random.randn(5, 6))

print(df)

          0         1         2         3         4         5
0  1.495289 -0.885577  0.722717 -0.321483 -0.937753 -0.434917
1  0.098376  0.116121 -1.171592 -0.182107 -0.581043  0.329274
2 -0.784655  0.135645 -0.107842 -0.425287 -1.116514 -2.077581
3 -0.396700  0.158741 -0.223637 -0.045535 -1.240279 -0.057504
4  0.549085 -0.975027 -0.843541 -0.512297  1.703240 -2.444427


In [15]:
# We can incorporate operate with basic scalar values

df + 2.5

Unnamed: 0,0,1,2,3,4,5
0,3.995289,1.614423,3.222717,2.178517,1.562247,2.065083
1,2.598376,2.616121,1.328408,2.317893,1.918957,2.829274
2,1.715345,2.635645,2.392158,2.074713,1.383486,0.422419
3,2.1033,2.658741,2.276363,2.454465,1.259721,2.442496
4,3.049085,1.524973,1.656459,1.987703,4.20324,0.055573


In [16]:
# And perform more complex scalar operations

-1.0 * df + 3.5

Unnamed: 0,0,1,2,3,4,5
0,2.004711,4.385577,2.777283,3.821483,4.437753,3.934917
1,3.401624,3.383879,4.671592,3.682107,4.081043,3.170726
2,4.284655,3.364355,3.607842,3.925287,4.616514,5.577581
3,3.8967,3.341259,3.723637,3.545535,4.740279,3.557504
4,2.950915,4.475027,4.343541,4.012297,1.79676,5.944427


In [17]:
# We can also apply vectorized functions

np.sin(df)

Unnamed: 0,0,1,2,3,4,5
0,0.997151,-0.77428,0.661425,-0.315974,-0.806231,-0.421335
1,0.098217,0.11586,-0.921371,-0.181102,-0.548896,0.323356
2,-0.706581,0.13523,-0.107633,-0.412582,-0.898576,-0.874309
3,-0.386377,0.158075,-0.221777,-0.045519,-0.945874,-0.057473
4,0.521907,-0.827717,-0.747002,-0.490181,0.991242,-0.642047


In [18]:
# We can tranpose the dataframe

df.T

Unnamed: 0,0,1,2,3,4
0,1.495289,0.098376,-0.784655,-0.3967,0.549085
1,-0.885577,0.116121,0.135645,0.158741,-0.975027
2,0.722717,-1.171592,-0.107842,-0.223637,-0.843541
3,-0.321483,-0.182107,-0.425287,-0.045535,-0.512297
4,-0.937753,-0.581043,-1.116514,-1.240279,1.70324
5,-0.434917,0.329274,-2.077581,-0.057504,-2.444427


-----

The above description merely scratches the surface of what you can do
with a Pandas `Series` or a `DataFrame`. The best way to learn how to
effectively use these data structures is to just do it!

-----

### Additional References

1. [Pandas Documentation][pdd]
2. A slightly dated Pandas [tutorial][pdt]
-----

[pdd]: http://pandas.pydata.org/pandas-docs/stable/index.html
[pdt]: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/