# Pandas Data Objects

In [2]:
import numpy as np
import pandas as pd

Understanding language's data structures is the most important part for a good programming experience. Poor understanding of data structures leads to poor code in terms of efficiency and readability.

These notes are devoted to walk through all Pandas structures but also providing further readings for a deeper knowledge. A previous Numpy datastructures knowledge is required.

The covered Pandas data structures are:
* Series:
* DataFrames:

## Series

Series is an one-dimension structures that can hold any data type (boolean, integer, float or even Python objects).

### Series Creation

We can create Series from:
* Python dictionaries
* NumPy ndarrays
* a sclar value

The passed **index is a list of axis labels**. Thus, this separates into a few cases depending on what data is, this will provide an effective way to access data.

** From dictionary: **

In this case if no index is provided, it is extracted from dictionary keys, while the data is extracted from values.

In [4]:
d = {'a':5.,'b':5.,'c':5.}
i = ['x','y','z']
s1 = pd.Series(d)
print s1
s1.index

a    5
b    5
c    5
dtype: float64


Index([u'a', u'b', u'c'], dtype='object')

Otherwise, if index is passed, values with keys in index are pulled out, the rest are assigned to NaN. A NaN value means not assigned, and we will have to deal with these values in the future.

//TODO: talk about NaN

In [16]:
d = {'a':5,'b':5,'c':5}
i = ['x','y','a','b']
s1 = pd.Series(d, index = i)
print s1
print s1.dtype
s1.index

x   NaN
y   NaN
a     5
b     5
float64


Index([x, y, a, b], dtype=object)

** From numpy array: **

In [23]:
s2 = pd.Series(np.array([3,20,5]),index=['a','b','c'])
print s2
print s2.dtype
s2.index

a     3
b    20
c     5
int32


Index([a, b, c], dtype=object)

** From scalar: **

In [18]:
s3 = pd.Series(5,index=['a','b','c'])
print s3
print s3.dtype
s3.index

a    5
b    5
c    5
int32


Index([a, b, c], dtype=object)

Series can have the atribute **name**. When dealing with DataFrames, Series names will be automatically assigned with its column name.

In [48]:
s3 = pd.Series(5,index=['a','b','c'], name = 'Series3')
s3.name

'Series3'

### Series Accessing

Series can be accessed through position (numerical index), boolean ¿list? or key (axis of labels). Accessing by position is like working with numpy ndarrays while accessing through keys (axis of labels) is like working with dictionaries.

** Position accessing **

In [24]:
s2

a     3
b    20
c     5

In [25]:
s2[1]

20

In [26]:
s2[1:]

b    20
c     5

** Boolean list accessing **

In [27]:
s2[[True,True,False]]

a     3
b    20

In [28]:
s2[s2>4]

b    20
c     5

**Key accessing**

In [29]:
s2[['a','b']]

a     3
b    20

In [31]:
s2['a']

3

In [30]:
'a' in s2

True

In case of accessing an unexisting key, a KeyError exception is thrown

In [33]:
try:
    s2['z']
except KeyError:
    print "Error handled"

Error handled


To avoid errors, we can use Serie's get function, where a defaut value is returned in case of error.

In [35]:
s2.get('x',np.NaN)

nan

### Series Operations

Vectorized operations can be done over pandas Series and also Series are accepted as most of NumPy operations. The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.

In [39]:
s2 + 23

a    26
b    43
c    28

Operations are performed index-wise.

In [45]:
s2 + s1

a     8
b    25
c   NaN
x   NaN
y   NaN

In [46]:
(s2 + s1).dropna()

a     8
b    25

In [41]:
s2 ** 3

a      27
b    8000
c     125

In [42]:
np.exp(s2)

a    2.008554e+01
b    4.851652e+08
c    1.484132e+02

### Examples with Series

Before loading data from different sources, we can have some examples using series with self-generated data.

In [14]:
import csv
import urllib2

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = urllib2.urlopen(url)
data = list(csv.reader(response))

#data is a list of list

We can create series from a list of lists directly from data variable.

In [16]:
pd.Series(data)

0     [1, Goroka, Goroka, Papua New Guinea, GKA, AYG...
1     [2, Madang, Madang, Papua New Guinea, MAG, AYM...
2     [3, Mount Hagen, Mount Hagen, Papua New Guinea...
3     [4, Nadzab, Nadzab, Papua New Guinea, LAE, AYN...
4     [5, Port Moresby Jacksons Intl, Port Moresby, ...
5     [6, Wewak Intl, Wewak, Papua New Guinea, WWK, ...
6     [7, Narsarsuaq, Narssarssuaq, Greenland, UAK, ...
7     [8, Nuuk, Godthaab, Greenland, GOH, BGGH, 64.1...
8     [9, Sondre Stromfjord, Sondrestrom, Greenland,...
9     [10, Thule Air Base, Thule, Greenland, THU, BG...
10    [11, Akureyri, Akureyri, Iceland, AEY, BIAR, 6...
11    [12, Egilsstadir, Egilsstadir, Iceland, EGS, B...
12    [13, Hornafjordur, Hofn, Iceland, HFN, BIHN, 6...
13    [14, Husavik, Husavik, Iceland, HZK, BIHU, 65....
14    [15, Isafjordur, Isafjordur, Iceland, IFJ, BII...
...
8092    [9527, Bus, Siem Reap, Cambodia, , SMRP, 13.36...
8093    [9528, Bus, Sihanoukville, Cambodia, , SNKV, 1...
8094    [9529, Bus, Kampot, Cambodia, , 

This is not a very useful Series object, as acces to list items is not syntatically nice. However, let's try to put the all countries' airports in a Series.

In [20]:
countries = pd.Series(np.array([airport[3] for airport in data]))
# this is a more interesting Series object
countries

0     Papua New Guinea
1     Papua New Guinea
2     Papua New Guinea
3     Papua New Guinea
4     Papua New Guinea
5     Papua New Guinea
6            Greenland
7            Greenland
8            Greenland
9            Greenland
10             Iceland
11             Iceland
12             Iceland
13             Iceland
14             Iceland
...
8092         Cambodia
8093         Cambodia
8094         Cambodia
8095         Cambodia
8096           Taiwan
8097        Australia
8098    United States
8099            Spain
8100           Canada
8101           Canada
8102           Canada
8103           Canada
8104           Canada
8105    United States
8106    United States
Length: 8107, dtype: object

In [21]:
print countries.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')


## DataFrame

A DataFrame is a 2-dimensional **labeled** data structure with columns of diferent types. It can be seen as a spreadsheet, where columns are Series or a Python dictionary where Series can be accessed through labels.

## DataFrame Creation

We can create DataFrames from:
* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

** From dict, Series or dict **

The result **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.



In [58]:
d = {'one': pd.Series([1,2,3],index=['a','b','c']),'two': pd.Series([1,2,3,4],index=['a','b','c','z']),'three':{'a':1}}
df = pd.DataFrame(d)
df

Unnamed: 0,one,three,two
a,1.0,1.0,1
b,2.0,,2
c,3.0,,3
z,,,4


In [56]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,three,two
d,,,
b,2.0,,2.0
a,1.0,1.0,1.0


In [57]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three','four'])

Unnamed: 0,two,three,four
d,,,
b,2.0,,
a,1.0,1.0,


The row and column labels can be accessed respectively by accessing the index and columns attributes:

In [59]:
df.index

Index([a, b, c, z], dtype=object)

In [60]:
df.columns

Index([one, three, two], dtype=object)