<h1 id="tocheading">Table of Contents and Notebook Setup</h1>
<div id="toc"></div>

In [1]:
%matplotlib inline

In [2]:
%%javascript
$.get('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

To get started with pandas, one has to get comfortable with the two data structures it's comprised of: <b>Series</b> and <b>DataFrame.</b>

# Series

## Introduction to Series

A series is a one dimensional array containing a sequence of values and associated array data labels (called its index). Another way to think of a series is a fixed length, ordered dictionary (since it is simply a mapping of index values to data values).

The simplest series is formed from an array of data:

In [3]:
import pandas as pd
obj = pd.Series([2,3,4,5])
obj

0    2
1    3
2    4
3    5
dtype: int64

We can also choose our own specific indices.

In [4]:
obj2 = pd.Series([2,3,4,5], index=['a','b','c','d'])
obj2

a    2
b    3
c    4
d    5
dtype: int64

We can select values of given indices in any order:

In [5]:
obj2['c']

4

If we pass in an array as the argument then a Series is returned.

In [6]:
obj2[['c','b']]

c    4
b    3
dtype: int64

We can pass in a logical expression to filter through the Series and return a subset that satisfies the condition specified.

In [7]:
obj2[obj2>3]

c    4
d    5
dtype: int64

Operations on Series are simialr to operations on other datatypes like arrays.

In [8]:
obj2*obj2

a     4
b     9
c    16
d    25
dtype: int64

We can search for keys in Series in a similar way that we would search for a key in a dictionary.

In [9]:
'b' in obj2

True

## Creating Series, Declaring Key Order, and Adding Multiple Series Together

We can convert dictionaries to Series (as they are super similar). When passing a dict as a constructor argument, the Series will have the dict's keys in sorted order. One can override this by passing the keys in the order one wants them in the Series (remember - Series are ordered unlike dictionaries).

In [10]:
sdata = {'Victoria': 600000, 'Vancouver': 10000000, 'Geneva': 1000000, 'Paris': 20000000}
obj3 = pd.Series(sdata)
obj3

Victoria       600000
Vancouver    10000000
Geneva        1000000
Paris        20000000
dtype: int64

In [11]:
city_order = ['Victoria', 'Paris', 'California']
obj4 = pd.Series(sdata, city_order)
obj4

Victoria        600000.0
Paris         20000000.0
California           NaN
dtype: float64

Note that the keys are <i>not sorted</i> and that cities not included in the "city_order" array are not included in the resulting Series. In addition, since California was not specifed in the dictionary "sdata", it has no value in the resulting Series (NaN stands for "not a number").

We can check what values in a Series are null using the following method.

In [12]:
pd.isnull(obj4)

Victoria      False
Paris         False
California     True
dtype: bool

Lets add obj3 and obj4. Note that they contain different keys.

In [13]:
obj3+obj4

California           NaN
Geneva               NaN
Paris         40000000.0
Vancouver            NaN
Victoria       1200000.0
dtype: float64

If even one of the keys holds a value "NaN", then the resulting sum will be NaN. In otherwords, NaN+x=NaN for all x. In data analysis, this is known as a join operation.

## Naming Keys and Values in Series

We can also name the keys and values as such:

In [14]:
obj4.name = 'population'
obj4.index.name = 'city'
obj4

city
Victoria        600000.0
Paris         20000000.0
California           NaN
Name: population, dtype: float64

# DataFrame

## Introduction to DataFrames

A <b>DataFrame</b> represents a rectangular array of data and contains an ordered collection of columns, each of which can be a different data type. The easiest one to create a dataframe is with a dictionary of lists.

In [15]:
data ={'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000,2001,2002,2001,2002,2003],
       'pop': [1.2, 1.7, 1.8, 1.4, 1.8, 1.9]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.2
1,Ohio,2001,1.7
2,Ohio,2002,1.8
3,Nevada,2001,1.4
4,Nevada,2002,1.8
5,Nevada,2003,1.9


For large DataFrames, use the head method to display the first five rows

In [16]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.2
1,Ohio,2001,1.7
2,Ohio,2002,1.8
3,Nevada,2001,1.4
4,Nevada,2002,1.8


If we specify a sequence of columns, the dataframe will be created in that order.

In [17]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.2
1,2001,Ohio,1.7
2,2002,Ohio,1.8
3,2001,Nevada,1.4
4,2002,Nevada,1.8
5,2003,Nevada,1.9


If you pass in a column that isn't contained in the dict, it will appear with missing values. 

## Specifying Indices of DataFrame Upon Creation

We can specify indices of a dataframe as such.

In [18]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.2,
two,2001,Ohio,1.7,
three,2002,Ohio,1.8,
four,2001,Nevada,1.4,
five,2002,Nevada,1.8,
six,2003,Nevada,1.9,


## Retrieving Rows and Columns

Columns can either be retrieved as a <i>Series</i> using either a dict like notation or by attribute:

In [19]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [20]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Rows can be retrieved using the <i>loc</i> attribute.

In [21]:
frame2.loc['two']

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

Columns can be modified by assignment. For example...

In [22]:
a = [4,6,3,7,8,9]
frame2.debt = a
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.2,4
two,2001,Ohio,1.7,6
three,2002,Ohio,1.8,3
four,2001,Nevada,1.4,7
five,2002,Nevada,1.8,8
six,2003,Nevada,1.9,9


We can also use a series to assign values to the debt column. The indices of the Series must match the indices of the DataFile.

In [23]:
a = pd.Series([10,14,15], index=['two','four','five'])
frame2.debt = a
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.2,
two,2001,Ohio,1.7,10.0
three,2002,Ohio,1.8,
four,2001,Nevada,1.4,14.0
five,2002,Nevada,1.8,15.0
six,2003,Nevada,1.9,


## Filtering Rows with NaN

We can remove rows with NaN as follows:

In [24]:
frame2 = frame2.dropna()
frame2

Unnamed: 0,year,state,pop,debt
two,2001,Ohio,1.7,10.0
four,2001,Nevada,1.4,14.0
five,2002,Nevada,1.8,15.0


## Alternative Ways to Create a DataFrame

One can also create a DataFrame using a nested dict of dicts. The outer dict keys become the columns and the inner keys become the rows.

In [25]:
pop = {'Nevada': {2001: 23, 2002: 45}, 'Ohio': {2000: 1.5, 2001: 67}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,23.0,67.0
2002,45.0,


Note that in all cases, dataframes are created by calling pd.DataFrame on another type in python (in this case a dict of dicts). The DataFrame, however, after creation, is a different datatype with different operations.

# Index Objects

## Introduction to Index Objects

Index Objects hold the axis labels. We can extract the index objects of a Series as such.

In [26]:
obj = pd.Series(range(3), index = ['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [27]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

Index objects are <i> immutable </i> and can't be modified by the user. This makes it safer to share index objects among data structures.

In [28]:
labels = pd.Index(['a', 'b', 'c'])
labels

Index(['a', 'b', 'c'], dtype='object')

Now we use labels as our index for a new series

In [29]:
obj2 = pd.Series([0,1,2], labels)
obj2

a    0
b    1
c    2
dtype: int64

In [30]:
obj2.index is labels

True

So obj 2 uses the Index 'labels' as its index. Note that this is like a pointer; if we create a new Index object identical to 'labels'and use this as the index for obj2 we get the following:

In [31]:
obj2.index = pd.Index(['a', 'b', 'c'])
obj2.index is labels

False

This is what gives the notion of <i> sharing </i> an index between multiple objects; they all point to the same Index object. 

In [32]:
obj1 = pd.Series([1, 4, 9], labels)
obj2 = pd.Series([7, 6, 9], labels)
obj3 = pd.Series([7, 8, 9], labels)
obj2.index is obj3.index

True

All three objects above share the same index object.

The columns of a dataframe are also an index object.

In [33]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object')

There are a number of different methods for labels on page 136 of the textbook including operations like appending to indices.