In [1]:
import pandas as pd
import numpy as np


# Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.


# Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:


In [2]:
data = pd.Series(data=[1, 2, 3])
data


0    1
1    2
2    3
dtype: int64

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:


In [3]:
data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
serie = pd.Series(data=data)
serie


Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:


In [4]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
serie = pd.Series(data=data, index=states)
data, states, serie


({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000},
 ['California', 'Ohio', 'Oregon', 'Texas'],
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64)

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.

I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:


In [5]:
res = pd.isnull(obj=serie)
res


California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations. If you have experience with databases, you can think about this as being similar to a join operation.


In [6]:
sdata_serie = pd.Series(data=data)
res = sdata_serie + serie
sdata_serie, serie, res


(Ohio      35000
 Texas     71000
 Oregon    16000
 Utah       5000
 dtype: int64,
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64,
 California         NaN
 Ohio           70000.0
 Oregon         32000.0
 Texas         142000.0
 Utah               NaN
 dtype: float64)

You can get the array representation and index object of the Series via its values and index attributes, respectively:


In [7]:
sdata_serie.values, sdata_serie.index


(array([35000, 71000, 16000,  5000], dtype=int64),
 Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object'))

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:


In [8]:
sdata_serie.name = 'population'
sdata_serie.index.name = 'state'
sdata_serie


state
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: population, dtype: int64

# DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.


There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:


In [9]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data=data)
frame


Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [10]:
frame.head()


Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:


In [11]:
frame = pd.DataFrame(data=data, columns=['year', 'state', 'pop'])
frame


Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [12]:
frame = pd.DataFrame(
    data=data,
    columns=['year', 'state', 'pop', 'debt'],
    index=['zero', 'one', 'two', 'three', 'four', 'five']
)
frame


Unnamed: 0,year,state,pop,debt
zero,2000,Ohio,1.5,
one,2001,Ohio,1.7,
two,2002,Ohio,3.6,
three,2001,Nevada,2.4,
four,2002,Nevada,2.9,
five,2003,Nevada,3.2,


In [13]:
frame['state']


zero       Ohio
one        Ohio
two        Ohio
three    Nevada
four     Nevada
five     Nevada
Name: state, dtype: object

In [14]:
frame.loc['three']


year       2001
state    Nevada
pop         2.4
debt        NaN
Name: three, dtype: object

In [15]:
frame['debt'] = 16.5
frame


Unnamed: 0,year,state,pop,debt
zero,2000,Ohio,1.5,16.5
one,2001,Ohio,1.7,16.5
two,2002,Ohio,3.6,16.5
three,2001,Nevada,2.4,16.5
four,2002,Nevada,2.9,16.5
five,2003,Nevada,3.2,16.5


In [16]:
frame['debt'] = np.arange(start=.0, stop=6.)
frame


Unnamed: 0,year,state,pop,debt
zero,2000,Ohio,1.5,0.0
one,2001,Ohio,1.7,1.0
two,2002,Ohio,3.6,2.0
three,2001,Nevada,2.4,3.0
four,2002,Nevada,2.9,4.0
five,2003,Nevada,3.2,5.0


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:


In [17]:
frame['debt'] = pd.Series(
    data=[-1.2, -1.5, -1.7],
    index=['two', 'four', 'five']
)
frame


Unnamed: 0,year,state,pop,debt
zero,2000,Ohio,1.5,
one,2001,Ohio,1.7,
two,2002,Ohio,3.6,-1.2
three,2001,Nevada,2.4,
four,2002,Nevada,2.9,-1.5
five,2003,Nevada,3.2,-1.7


Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict. As an example of del, I first add a new column of boolean values where the state column equals 'Ohio':


In [18]:
frame['eastern'] = frame.state == 'Ohio'
frame


Unnamed: 0,year,state,pop,debt,eastern
zero,2000,Ohio,1.5,,True
one,2001,Ohio,1.7,,True
two,2002,Ohio,3.6,-1.2,True
three,2001,Nevada,2.4,,False
four,2002,Nevada,2.9,-1.5,False
five,2003,Nevada,3.2,-1.7,False


The del method can then be used to remove this column:


In [19]:
del frame['eastern']
frame


Unnamed: 0,year,state,pop,debt
zero,2000,Ohio,1.5,
one,2001,Ohio,1.7,
two,2002,Ohio,3.6,-1.2
three,2001,Nevada,2.4,
four,2002,Nevada,2.9,-1.5
five,2003,Nevada,3.2,-1.7


Another common form of data is a nested dict of dicts:


In [21]:
pop = {
    'Nevada': {
        2001: 2.4,
        2002: 2.9
    },
    'Ohio': {
        2000: 1.5,
        2001: 1.7,
        2002: 3.6
    }
}
frame = pd.DataFrame(data=pop)
frame


Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The keys in the inner dicts are combined and sorted to form the index in the result. This isn’t true if an explicit index is specified:


In [22]:
frame = pd.DataFrame(data=pop, index=[2001, 2002, 2003])
frame


Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


Dicts of Series are treated in much the same way:


In [25]:
frame = pd.DataFrame(
    data={
        'Ohio': frame['Ohio'][:-1],
        'Nevada': frame['Nevada'][:2]
    }
)
frame


Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [31]:
frame.index.name = 'year'
frame.columns.name = 'state'
frame


state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,1.7,2.4
2002,3.6,2.9


| Type                             | Notes                                                                                                                                    |
| :------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------- |
| 2D ndarray                       | A matrix of data, passing optional row and column labels                                                                                 |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length                                                   |
| NumPy structured/record array    | Treated as the “dict of arrays” case                                                                                                     |
| dict of Series                   | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| dict of dicts                    | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case                                 |
| List of dicts or Series          | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels                      |
| List of lists or tuples          | Treated as the “2D ndarray” case                                                                                                         |
| Another DataFrame                | The DataFrame’s indexes are used unless different ones are passed                                                                        |
| NumPy MaskedArray                | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result                                                |
