In [5]:
import pandas as pd


# Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.


# Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:


In [6]:
s = pd.Series(data=[1, 2, 3])
s


0    1
1    2
2    3
dtype: int64

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:


In [7]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
serie = pd.Series(data=sdata)
serie


Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:


In [8]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
serie = pd.Series(data=sdata, index=states)
sdata, states, serie


({'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000},
 ['California', 'Ohio', 'Oregon', 'Texas'],
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64)

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.

I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:


In [9]:
res = pd.isnull(obj=serie)
res


California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations. If you have experience with databases, you can think about this as being similar to a join operation.


In [11]:
sdata_serie = pd.Series(data=sdata)
res = sdata_serie + serie
sdata_serie, serie, res


(Ohio      35000
 Texas     71000
 Oregon    16000
 Utah       5000
 dtype: int64,
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64,
 California         NaN
 Ohio           70000.0
 Oregon         32000.0
 Texas         142000.0
 Utah               NaN
 dtype: float64)

You can get the array representation and index object of the Series via its values and index attributes, respectively:


In [13]:
sdata_serie.values, sdata_serie.index


(array([35000, 71000, 16000,  5000], dtype=int64),
 Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object'))

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:


In [14]:
sdata_serie.name = 'population'
sdata_serie.index.name = 'state'
sdata_serie


state
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: population, dtype: int64

A Series’s index can be altered in-place by assignment
