# Getting Started with pandas


pandas will be the primary library of interest throughout much of the rest of the book.
It contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.
As a bit of background, I started building pandas in early 2008 during my tenure at
AQR, a quantitative investment management firm. At the time, I had a distinct set of
requirements that were not well-addressed by any single tool at my disposal:

* Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.

* Integrated time series functionality.

* The same data structures handle both time series data and non-time series data.

* Arithmetic operations and reductions (like summing across an axis) would passon the metadata (axis labels).

* Flexible handling of missing data.

* Merge and other relational operations found in popular database databases (SQLbased, for example).

I wanted to be able to do all of these things in one place, preferably in a language wellsuited
to general purpose software development. Python was a good candidate language
for this, but at that time there was not an integrated set of data structures and
tools providing this functionality.
Over the last four years, pandas has matured into a quite large library capable of solving
a much broader set of data handling problems than I ever anticipated, but it has expanded
in its scope without compromising the simplicity and ease-of-use that I desired
from the very beginning. I hope that after reading this book, you will find it to be just
as much of an indispensable tool as I do.
Throughout the rest of the book, I use the following import conventions for pandas:

In [20]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

Thus, whenever you see pd. in code, it’s referring to pandas. Series and DataFrame are
used so much that I find it easier to import them into the local namespace.

# Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

## Series
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The simplest
Series is formed from only an array of data:

In [2]:
obj = Series([4, 7 , -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point:

In [8]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [9]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [10]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values:

In [12]:
obj2['a']

-5

In [13]:
obj2['d']

4

In [15]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [16]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [17]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [18]:
obj2*2

d     8
b    14
a   -10
c     6
dtype: int64

In [21]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
dict:

In [22]:
'b' in obj2

True

In [23]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:

In [24]:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

In [25]:
obj3 = Series(sdata)

In [27]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

When only passing a dict, the index in the resulting Series will have the dict’s keys in
sorted order.

In [28]:
states = ['California', 'Ohia', 'Oregon', 'Texas']

In [29]:
obj4 = Series(sdata, index=states)

In [30]:
obj4

California        NaN
Ohia              NaN
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since
no value for 'California' was found, it appears as NaN (not a number) which is considered
in pandas to mark missing or NA values. I will use the terms “missing” or “NA”
to refer to missing data. The isnull and notnull functions in pandas should be used to
detect missing data:

In [31]:
pd.isnull(obj4)

California     True
Ohia           True
Oregon        False
Texas         False
dtype: bool

In [32]:
pd.notnull(obj4)

California    False
Ohia          False
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods:

In [33]:
obj4.isnull()

California     True
Ohia           True
Oregon        False
Texas         False
dtype: bool

I discuss working with missing data in more detail later in this chapter.
A critical Series feature for many applications is that it automatically aligns differentlyindexed
data in arithmetic operations:

In [34]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [35]:
obj4

California        NaN
Ohia              NaN
Oregon        16000.0
Texas         71000.0
dtype: float64

In [36]:
obj3 + obj4

California         NaN
Ohia               NaN
Ohio               NaN
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data alignment features are addressed as a separate topic.
Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [37]:
obj4.name = 'population'

In [38]:
obj4.index.name = 'state'

In [39]:
obj4

state
California        NaN
Ohia              NaN
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in place by assignment:

In [40]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame), roworiented
and column-oriented operations in DataFrame are treated roughly symmetrically.
Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays. The exact details
of DataFrame’s internals are far outside the scope of this book.

Note: While DataFrame stores the data internally in a two-dimensional format,
you can easily represent much higher-dimensional data in a tabular
format using hierarchical indexing, a subject of a later section and a key
ingredient in many of the more advanced data-handling features in pandas.

There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays

In [42]:
data =  {
            'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']
            ,'year' : [2000, 2001, 2002, 2001, 2002]
            ,'pop' : [1.5, 1.7, 3.6, 2.4, 2.9]
        }

In [43]:
frame = DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:

In [44]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [81]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:

In [82]:
frame2 = DataFrame( data
                    , columns=['year', 'state', 'pop', 'debt']
                    , index=['one', 'two', 'three', 'four', 'five']
                  )

In [83]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [84]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [85]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [86]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.
Rows can also be retrieved by position or name by a couple of methods, such as the
ix indexing field (much more on this later):

In [87]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:

In [88]:
frame2['debt'] = np.arange(5.)

In [89]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


In [90]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [91]:
frame2['state']


one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [92]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.
Rows can also be retrieved by position or name by a couple of methods, such as the
ix indexing field (much more on this later):

In [93]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt        2
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:

In [94]:
frame2['debt'] = 16.5

In [95]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [96]:
frame2['debt'] = np.arange(5.)

In [97]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:

In [99]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [100]:
frame2['debt'] = val

In [101]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:

In [105]:
frame2['eastern'] = frame2.state == 'Ohio'

In [106]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [107]:
# delete column
del frame2['eastern']

In [108]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Note:
The column returned when indexing a DataFrame is a view on the underlying
data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s copy method.

Another common form of data is a nested dict of dicts format:

In [111]:
pop =   {
            'Nevada': {2001: 2.4, 2002: 2.9 }
            ,'Ohio' : {2000: 1.5, 2001: 1.7, 2002: 3.6}
        }

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [112]:
frame3 = DataFrame(pop)

In [113]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Of course you can always transpose the result:

In [114]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified:

In [115]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


Dicts of Series are treated much in the same way:

In [116]:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [117]:
DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


For a complete list of things you can pass the DataFrame constructor, see Table 5-1.
If a DataFrame’s index and columns have their name attributes set, these will also be
displayed:

In [118]:
frame3.index.name = 'year'
frame3.columns.name = 'state'

In [119]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Like Series, the values attribute returns the data contained in the DataFrame as a 2D
ndarray:

In [120]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accomodate all of the columns:

In [121]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Table 5.1
Type Notes
2D ndarray A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples Each sequence becomes a column in the DataFrame. All sequences must be the same length.
NumPy structured/record array Treated as the “dict of arrays” case
dict of Series Each value becomes a column. Indexes from each Series are unioned together to form the
result’s row index if no explicit index is passed.
dict of dicts Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of
Series” case.
list of dicts or Series Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the
DataFrame’s column labels
List of lists or tuples Treated as the “2D ndarray” case
Another DataFrame The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

## Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing
a Series or DataFrame is internally converted to an Index:

In [122]:
obj = Series(range(3), index=['a', 'b', 'c'])

In [123]:
index = obj.index

In [124]:
index

Index(['a', 'b', 'c'], dtype='object')

In [125]:
index[1:]

Index(['b', 'c'], dtype='object')

NOTE:
Index objects are immutable and thus can’t be modified by the user:

In [126]:
index[1] = 'd'

TypeError: Index does not support mutable operations

Immutability is important so that Index objects can be safely shared among data
structures:

In [127]:
index = pd.Index(np.arange(3))

In [128]:
obj2 = Series([1.5, -2.5, 0], index=index)

In [129]:
obj2.index is index

True

Table 5-2 has a list of built-in Index classes in the library. With some development
effort, Index can even be subclassed to implement specialized axis indexing functionality.

Many users will not need to know much about Index objects, but they’re
nonetheless an important part of pandas’s data model.

Table 5-2. Main Index objects in pandas
Class Description
Index The most general Index object, representing axis labels in a NumPy array of Python objects.
Int64Index Specialized Index for integer values.
MultiIndex “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of
as similar to an array of tuples.
DatetimeIndex Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
PeriodIndex Specialized Index for Period data (timespans).

In addition to being array-like, an Index also functions as a fixed-size set:

In [130]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [132]:
'Ohio' in frame3.columns

True

In [133]:
2003 in frame3.index

False

Each Index has a number of methods and properties for set logic and answering other
common questions about the data it contains. These are summarized in Table 5-3.

Table 5-3. Index methods and properties
Method Description
append Concatenate with additional Index objects, producing a new Index
diff Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index