## Pandas Series

One of the nice things that Pandas provides for series are *indexes*.  Indexes can be used to identify columns, like a column header in a CSV, and to access the series data by column name.  See below the various ways that an index can be used.  NOTE: the first cell is an example of the *describe()* function, which provides a quick summary of a loaded data set.


In [3]:
import pandas as pd

# Employment data in 2007 for those 20 countries
employment = pd.Series([55.70000076,  51.40000153,  50.5, 75.69999695, 58.40000153,
                        40.09999847,  61.5,  57.09999847, 60.90000153,  66.59999847,
                        60.40000153,  68.09999847, 66.90000153, 53.40000153,48.59999847,
                        56.79999924, 71.59999847,  58.40000153,  70.40000153,  41.20000076], 
                index = ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
                       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
                       'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
                       'Belize', 'Benin', 'Bhutan', 'Bolivia','Bosnia and Herzegovina'])

employment.describe()


count    20.000000
mean     58.685000
std       9.580862
min      40.099998
25%      52.900002
50%      58.400002
75%      66.674999
max      75.699997
dtype: float64

In [4]:
employment.index

Index([u'Afghanistan', u'Albania', u'Algeria', u'Angola', u'Argentina',
       u'Armenia', u'Australia', u'Austria', u'Azerbaijan', u'Bahamas',
       u'Bahrain', u'Bangladesh', u'Barbados', u'Belarus', u'Belgium',
       u'Belize', u'Benin', u'Bhutan', u'Bolivia', u'Bosnia and Herzegovina'],
      dtype='object')

In [5]:
employment.index[0]

'Afghanistan'

In [6]:
employment[0]

55.700000760000002

In [7]:
employment.loc['Algeria']

50.5

In [9]:
employment.idxmax()

'Angola'

Some of the index values of the *describe()* function can be accessed as individual functions themselves.

In [10]:
employment.max()

75.699996949999999

In [11]:
employment.min()

40.099998470000003

In [12]:
employment.std()

9.580861956589192

In [13]:
employment.count()

20

## Vectorized operations on Series with indexes

One of the nice things about using Pandas series is that if 2 series have the same index values, vectorized operations will be done based on the index value, not the position in the series.  This means that the series don't have to be kept in order.


In [18]:
# 1. Addition when indexes are the same
if True:
    print "Example 1."
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
    print s1 + s2

# 2. Indexes have same elements in a different order
if True:
    print "\nExample 2."
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['b', 'd', 'a', 'c'])
    print s1 + s2

# 3. Indexes overlap, but do not have exactly the same elements
if True:
    print "\nExample 3."
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
    print s1 + s2

# 4. Indexes do not overlap
if True:
    print "\nExample 4."
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['e', 'f', 'g', 'h'])
    print s1 + s2

Example 1.
a    11
b    22
c    33
d    44
dtype: int64

Example 2.
a    31
b    12
c    43
d    24
dtype: int64

Example 3.
a     NaN
b     NaN
c    13.0
d    24.0
e     NaN
f     NaN
dtype: float64

Example 4.
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
g   NaN
h   NaN
dtype: float64


Example 3 is particularly useful, as it implies a sort of "inner join" logic.  In addition, the non-overlapping index values can be dropped, which yields only the intersecting values.

In [20]:
# 3. Indexes overlap, but do not have exactly the same elements + dropping the non-overlap
if True:
    print "\nExample 3."
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
    s3 = s1 + s2
    print s3.dropna()


Example 3.
c    13.0
d    24.0
dtype: float64


It's also possible to do an 'outer join' between 2 series.  First in the stupid Roger way.

In [69]:
# HOW I FIRST DID IT

from collections import defaultdict

s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])

zipped1 = zip(s1.index, s1)
zipped2 = zip(s2.index, s2)
z3 = zipped1 + zipped2

foo = defaultdict(int)
for i, j in z3:
    foo[i] += j
    
outer_join = pd.Series(foo.values(), index = foo.keys()).sort_values()
outer_join


a     1
b     2
c    13
d    24
e    30
f    40
dtype: int64

In [70]:
# A MUCH BETTER WAY TO DO IT

s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])

s1.add(s2, fill_value=0)

a     1.0
b     2.0
c    13.0
d    24.0
e    30.0
f    40.0
dtype: float64

## Using the .apply() function

Pandas Series have a useful function named *apply()* that is similar to the Python *map()* function.  IN both cases, all elements of the list or series are passed through a function and a new series is returned.


In [84]:
# Python map() example

if True:
    foo = [2,4,6,8]
    times_2 = lambda n: n * 2
    print map(times_2, foo)
    
print "----"
    
if True:
    foo = pd.Series([2, 4, 6, 8])
    times_2 = lambda n: n * 2
    print foo.apply(times_2)

[4, 8, 12, 16]
----
0     4
1     8
2    12
3    16
dtype: int64


In [85]:
# Pandas apply() example

import pandas as pd

names = pd.Series(['Andre Agassi','Barry Bonds','Christopher Columbus','Daniel Defoe','Emilio Estevez',
                   'Fred Flintstone','Greta Garbo','Humbert Humbert','Ivan Ilych','James Joyce',
                   'Keira Knightley','Lois Lane','Mike Myers','Nick Nolte','Ozzy Osbourne','Pablo Picasso',
                   'Quirinus Quirrell','Rachael Ray','Susan Sarandon','Tina Turner','Ugueth Urbina',
                   'Vince Vaughn','Woodrow Wilson','Yoji Yamada','Zinedine Zidane'])

def reverse_names(names):
    '''Use Pandas .apply() to return a new series where: 
    "Firstname Lastname" is transformed to "Lastname, FirstName".
    '''
    def flip_name(name):
        first, last = name.split()
        new_name = "%s, %s" % (last, first)
        return new_name
    new_names = names.apply(flip_name)  
    return new_names

reverse_names(names)

0             Agassi, Andre
1              Bonds, Barry
2     Columbus, Christopher
3             Defoe, Daniel
4           Estevez, Emilio
5          Flintstone, Fred
6              Garbo, Greta
7          Humbert, Humbert
8               Ilych, Ivan
9              Joyce, James
10         Knightley, Keira
11               Lane, Lois
12              Myers, Mike
13              Nolte, Nick
14           Osbourne, Ozzy
15           Picasso, Pablo
16       Quirrell, Quirinus
17             Ray, Rachael
18          Sarandon, Susan
19             Turner, Tina
20           Urbina, Ugueth
21            Vaughn, Vince
22          Wilson, Woodrow
23             Yamada, Yoji
24         Zidane, Zinedine
dtype: object