This is a Jupyter notebook. The notebook consists of a list of cells. This cell is a markdown cell, which contains text to document what is going on.

The cells below with In[] labels next to them are code cells. You can type Python code into a code cell and hit shift-enter to evaluate the code in the Python interpreter.

If you would like to follow along through this example you should evaluate each code cell in turn below.

To get started, we need to load the pandas module.

In [1]:
import pandas as pd

The pandas module includes two important data structures, the Series and the DataFrame.

A Series is a list of data items and an index for the list. You can think of the index as a set of labels attached to the data items.

In [2]:
ser1 = pd.Series([5,6,4,5],index=['a','b','c','d'])

One cool feature of the Jupyter notebook is that you can enter the name of any variable in a cell and hit shift-enter to evaluate the cell. This prints the contents of the variable.

In [3]:
ser1

a    5
b    6
c    4
d    5
dtype: int64

Here is a second data series. Note that this series has a slightly different set of labels.

You can think of these two series as scores for a set of tests administered to a group of subjects. In the first round of testing participants a, b, c, and d took part. In the second round the participants were b, c, d, and e.

In [4]:
ser2 = pd.Series([7,6,5,5],index=['b','c','d','e'])
ser2

b    7
c    6
d    5
e    5
dtype: int64

Next we demonstrate a vector operation on a pair of series. To combine two series we set up a mathematical expression involving the two series. In this example I want to compute the average score for each participant across the two rounds of testing.

In any operation involving two or more data series, pandas will start by forming the union of the label sets involved. This determines the label set for the result. Next, pandas adds entries to each series for labels that appear in the union but not in that serie's index. For example, ser1 does not have a label for 'e' in its index, so pandas will add that label to the index for ser1. For numeric series like the ones we are working with, when pandas adds a label to a series it assigns a value of np.na to that label. np.na is the numpy 'Not a Number' object. In any arithmetic operation where one or more of the operands is np.na, the result will also be np.na.

In [5]:
avg = (ser1+ser2)/2
avg

a    NaN
b    6.5
c    5.0
d    5.0
e    NaN
dtype: float64

As you can see, labels 'a' and 'e' end up with NaN values, since label 'a' did not appear in the second series and 'e' did not appear in the first.

The second major data structure in pandas is the DataFrame. A DataFrame is a two-dimensional data structure that uses an index to label its rows and a set of names to label its columns.

There are many ways to construct a DataFrame in pandas. The simplest method is to build a DataFrame from a list of Series, giving each Series a column label. The index for the DataFrame will be the union of the Series indices.

To make a DataFrame from a list of Series we pass a dictionary to the DataFrame constructor. The keys in the dictionary are the names of the columns we want and the values are the Series.

In [6]:
df1 = pd.DataFrame({'one':ser1,'two':ser2})
df1

Unnamed: 0,one,two
a,5.0,
b,6.0,7.0
c,4.0,6.0
d,5.0,5.0
e,,5.0


You can select individual columns in a DataFrame using an array-like notation with the name of the column you want. This will return the Series that sits in that column.

In [7]:
df1['one']

a    5.0
b    6.0
c    4.0
d    5.0
e    NaN
Name: one, dtype: float64

One of the more common tasks when working with a DataFrame is selecting a subset of the rows in the DataFrame. The key to doing this is the concept of a Boolean Series, which is a data series whose entries are True/False values.

One way to make a Boolean Series is to construct a boolean expression involving one of the columns. For example, the expression below compares the values in the 'one' column against the number 4.

In [8]:
df1['one'] > 4

a     True
b     True
c    False
d     True
e    False
Name: one, dtype: bool

You can pass a Boolean Series to an array index expression. This will return a version of the DataFrame containing just the rows where the Boolean Series contains a True value.

In [9]:
df1[df1['one']>4]

Unnamed: 0,one,two
a,5.0,
b,6.0,7.0
d,5.0,5.0
