# Chapter 5: Concise of Pandas
> Pandas will be the primary library of interest throughout much of the rest of the book.
It contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.

Overview:
* Introduction to pandas: Data Structures
* Essential functionality
* Summarizing and computing descriptive statics
* Handling missing data 
* Hierarchical Indexing
* Other pandas topic

In [1]:
from pandas import Series, DataFrame

In [2]:
import pandas as pd
import numpy as np

# Introduction to pandas datastructure
> To work well with pandas, we should know about **Series** and **DataFrame**

## Series
> A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type)

In [3]:
obj = Series([4, 5, -7, 3])

In [4]:
obj

0    4
1    5
2   -7
3    3
dtype: int64

Series contains 2 attributes: **index** on the left and **values** on the right

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [7]:
obj.values

array([ 4,  5, -7,  3])

So we can create Series with our defination of index.

In [8]:
obj2 = Series([4, 7, -5, 3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [9]:
obj2.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

As ndarray in Numpy, we can select and set single value through index

In [10]:
obj2['a']

4

In [15]:
obj2['c'] = -10

In [16]:
obj2

a     4
b     7
c   -10
d     3
dtype: int64

In [17]:
obj[:2]

0    4
1    5
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [18]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [19]:
obj2 * 2

a     8
b    14
c   -20
d     6
dtype: int64

Series is the same with ndarray, if we change the view dataset, the the origin data will change too.

In [28]:
import numpy as np
arr = np.array([1,2,3,4,5,6])
sub_arr = arr[:3]
sub_arr[:] = 1
arr

array([1, 1, 1, 4, 5, 6])

In [33]:
sub_series = obj2[:3]
sub_series['a'] = 10000
obj2

a    10000
b        7
c      -10
d        3
dtype: int64

dict can be substitued in many functions:

In [34]:
'a' in obj2

True

In [35]:
'f' in obj2

False

We can create a Series by passing a dict. And it will be ordered by index

In [36]:
sdata = {'Ohio': 35000, 'Texas': 70000, 'Conecticut':40000, 'New York': 100000}

In [38]:
obj3 = Series(sdata)
obj3

Conecticut     40000
New York      100000
Ohio           35000
Texas          70000
dtype: int64

We can replace index of dict by another list. If indexes match then the mapping value will be assigned to that index. Unless, NaN will assigned.

In [44]:
states = ['California', 'New York', 'Ohio', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California         NaN
New York      100000.0
Ohio           35000.0
Texas          70000.0
dtype: float64

Then **pandas** provides 2 functions to check if missing data: **isnull** and **notnull**

In [46]:
pd.isnull(obj4)

California     True
New York      False
Ohio          False
Texas         False
dtype: bool

In [47]:
pd.notnull(obj4)

California    False
New York       True
Ohio           True
Texas          True
dtype: bool

Series also has the same methods:
    

In [48]:
obj4.isnull()

California     True
New York      False
Ohio          False
Texas         False
dtype: bool

In [49]:
obj4.notnull()

California    False
New York       True
Ohio           True
Texas          True
dtype: bool

We will discuss more detail about handling missing data in rest of this chapter

** Series can automatically aligns differently index in arithmetic operations: **

In [51]:
obj3 + obj4

California         NaN
Conecticut         NaN
New York      200000.0
Ohio           70000.0
Texas         140000.0
dtype: float64

** Series name attribute **
> Both Series itself and its index has name attribute. 

In [55]:
obj4.name = 'population'
obj4.name

'population'

In [56]:
obj4.index.name = 'state'
obj4.index.name

'state'

## DataFrame

> A DataFrame represents a tabular, spreadsheet-like data structure containing an or-
dered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame ), row-
oriented and column-oriented operations in DataFrame are treated roughly symmet-
rically. Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays

* Create a dataframe from dict

In [60]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2000, 2001],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2000
4,2.9,Nevada,2001


We can set sequence of column.

In [62]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2000,Nevada,2.4
4,2001,Nevada,2.9


As Series, we can set index to dataframe, and if we pass columns that isn't contained data it will appear with NaN values.

In [66]:
frame2 = DataFrame(data, 
                   columns=['year', 'state', 'pop', 'debt'], 
                   index=['one', 'two','three', 'four', 'five']
                  )
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [67]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [69]:
frame2.state?

Rows can be retrieved by position or name by **ix** attribute. And we get a Series too.

In [77]:
frame2.ix['one']

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

Columns can be modified by assignment. If we assign a lists of arrays to a columns then the length must match the length of Dataframe

In [80]:
frame2.debt = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2000,Nevada,2.4,3
five,2001,Nevada,2.9,4


We can assign a Series. 


In [82]:
val = Series([-1.2, 1.5, 1.7], index=['one', 'two', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,1.5
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,1.7


If we assign a column doesn't exist, then a new columns will be created. 

In [83]:
frame2['isOhio'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,isOhio
one,2000,Ohio,1.5,-1.2,True
two,2001,Ohio,1.7,1.5,True
three,2002,Ohio,3.6,,True
four,2000,Nevada,2.4,,False
five,2001,Nevada,2.9,1.7,False


And **del** will delete column

In [84]:
del frame2['isOhio']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,1.5
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,1.7


Nested dict. If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [85]:
pop = {
    'Ohio': {
        2001: 2.4,
        2002: 2.9
    }, 
    'Nevada': {
        2000: 1.5,
        2001: 1.7,
        2002: 3.6
    }
}
pop

{'Nevada': {2000: 1.5, 2001: 1.7, 2002: 3.6}, 'Ohio': {2001: 2.4, 2002: 2.9}}

In [87]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


We can transpose the result: 

In [88]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,1.5,1.7,3.6
Ohio,,2.4,2.9


We can set name for index and columns

In [92]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3.T

year,2000,2001,2002
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,1.5,1.7,3.6
Ohio,,2.4,2.9


As Series, we can get **values**. It returns an 2D ndarrays

In [93]:
frame3.values

array([[ 1.5,  nan],
       [ 1.7,  2.4],
       [ 3.6,  2.9]])

## Index Objects

> Pandas’s Index objects are responsible for holding the axis labels and other metadata

In [96]:
obj = Series(np.arange(3), index=['a','b','c'])

In [97]:
obj

a    0
b    1
c    2
dtype: int64

In [98]:
index= obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [102]:
index[1] = 'd'

TypeError: Index does not support mutable operations

# Essential Functionality

## Reindexing

> We can change index of DataFrame, or Series

In [110]:
obj = Series(np.random.randn(5), index=['a','b','c','d','e'])
obj

a   -0.801731
b    0.207178
c    0.545623
d   -1.025449
e    0.145221
dtype: float64

In [112]:
obj.reindex(['a','b','c','d','e','g'], fill_value=0)

a   -0.801731
b    0.207178
c    0.545623
d   -1.025449
e    0.145221
g    0.000000
dtype: float64

* **ffil**: Fill (or carry) values forward

In [114]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [115]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

* **bfill**: Fill (or carry) values backward

In [118]:
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result:

In [119]:
frame = DataFrame(np.arange(9).reshape((3,3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'] )

In [120]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [121]:
frame.reindex(['a','b','c','d'])

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed using the columns keyword:

In [124]:
states = ['Texas', 'Utah', 'California', 'Ohio']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California,Ohio
a,1,,2,0
c,4,,5,3
d,7,,8,6


Then, as Series, we can **ffill** or **bfill** with dataframe

In [126]:
frame.reindex(index=['a','b','c','d'], columns=states, method='ffill')

Unnamed: 0,Texas,Utah,California,Ohio
a,1,,2,0
b,1,,2,0
c,4,,5,3
d,7,,8,6


## Dropping entries from an axis

> Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an axis:

In [3]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [4]:
new_obj = obj.drop('c')

In [5]:
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [7]:
obj.drop(['d','c'])

a    0
b    1
e    4
dtype: int64

With DataFrame, index values can be deleted from either axis:

In [9]:
data = DataFrame(
    np.arange(16).reshape(4,4),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns = ['one', 'two', 'three', 'four']
)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [10]:
data.index

Index([u'Ohio', u'Colorado', u'Utah', u'New York'], dtype='object')

In [12]:
data.columns

Index([u'one', u'two', u'three', u'four'], dtype='object')

In [16]:
data.drop(['Ohio', 'Utah'], axis=0)

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
New York,12,13,14,15


In [18]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, selection, and filtering

> Series indexing ( obj[...] ) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers. Here are some examples this:

In [19]:
obj = Series(np.arange(4), index=['a','b','c','d'])

In [23]:
obj[2]

2

In [24]:
obj[2:4]

c    2
d    3
dtype: int64

In [25]:
obj[['a','b','c']]

a    0
b    1
c    2
dtype: int64

In [26]:
obj[obj > 2]

d    3
dtype: int64

Slicing with labels behaves differently than normal Python slicing in that the endpoint
is inclusive:

In [27]:
obj['b':'c']

b    1
c    2
dtype: int64

Setting using these methods works just as you would expect:

In [28]:
obj['b':'c'] = 5

In [29]:
obj

a    0
b    5
c    5
d    3
dtype: int64

Indexing of DataFrame

In [30]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [32]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [34]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


DataFrame has some special cases of Indexing.
* Selecting rows:

In [36]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [38]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This might seem inconsistent to some readers, but this syntax arose out of practicality
and nothing more. Another use case is in indexing with a boolean DataFrame, such as
one produced by a scalar comparison:

In [39]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [42]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


For DataFrame label-indexing on rows, we have special indexing file **ix**. It enables you to select a subset of the rows and columns from DataFrame,

In [43]:
data.ix['Colorado', ['two' , 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [44]:
type(data.ix['Colorado', ['two' , 'three']])

pandas.core.series.Series

Or we can use **ix** as a way to reindexing

In [45]:
data.ix[['Colorado', 'Utah'], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Or **ix** can use to select row by index number

In [46]:
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

Or **ix** can use to select columns by columns

In [56]:
data.ix[:,'two']

Ohio         0
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

There are some other ways and methods to work with selecting and indexing or reindexing in pandas

|Type | Notes|
|-----|------|
|**obj[val]** | Select single column or sequence of columns from the DataFrame. Special case con-veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion). |
| **obj.ix[val]** | Selects single row of subset of rows from the DataFrame. | 
| **obj.ix[:,val]** | Select single columns of subsets of columns | 
| **obj.ix[val1,val2]** | Select both rows and columns. |
| **reindex** method | Conform one or more axes to new indexes.|
| **xs** method | Select single row or column as a Series by label.|
| **icol (insted by iloc), irow** methods | Select single column or row, respectively, as a Series by integer location.|
| **get_value, set_value** methods | Select single value by row and column label. |

In [68]:
new_index = ['New York', 'California', 'Ohio', 'Colorado', 'Los Angles']
data.reindex(new_index, fill_value='missing')

Unnamed: 0,one,two,three,four
New York,12,13,14,15
California,missing,missing,missing,missing
Ohio,0,0,0,0
Colorado,0,5,6,7
Los Angles,missing,missing,missing,missing


In [78]:
data.get_value('Ohio', 'one')

0

In [81]:
data.iloc[3]

one      12
two      13
three    14
four     15
Name: New York, dtype: int64

## Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between ob-
jects with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs. Let’s
look at a simple example:

In [84]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [86]:
s2 = Series([-2.1, 1.4, -4.5, 5.6 , 7.6], index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    1.4
e   -4.5
f    5.6
g    7.6
dtype: float64

In [89]:
s1 + s2

a    5.2
c   -1.1
d    NaN
e   -3.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [3]:
df1 = DataFrame(
    np.arange(9).reshape(3,3),
    columns=list('bcd'),
    index=['Ohio', 'Texas', 'Colorado']
)
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [4]:
df2 = DataFrame(
    np.arange(12).reshape((4, 3)), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [5]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


**Arithmetic methods with fill values**

In [8]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [13]:
df1.add(df2, fill_value=2)

Unnamed: 0,b,c,d,e
Colorado,8.0,9.0,10.0,
Ohio,3.0,3.0,6.0,7.0
Oregon,11.0,,12.0,13.0
Texas,9.0,6.0,12.0,10.0
Utah,2.0,,3.0,4.0


Pandas provides us some flexible methods to use:

|Method | Description |
|-------|-------------|
| add | Method for addition|
| sub | Method for subtraction |
| div | Method for division |
| mul | Method for multiplication |

**Operations between DataFrame and Series**

As with **NumPy** arrays, arithmetic between DataFrame and Series is well-defined. First,
as a motivating example, consider the difference between a 2D array and one of its rows:

In [15]:
arr = np.arange(12).reshape(3,4)
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [16]:
arr[0]

array([0, 1, 2, 3])

In [17]:
arr - arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

Once again, we have seen a case of broadcasting.

**DataFrame** is fimilar with Numpy

In [20]:
frame = DataFrame(
    np.arange(12).reshape(4,3), 
    columns=['a', 'b', 'c'], 
    index=['Ohio', 'Texas', 'New York', 'Colorado']
)
frame

Unnamed: 0,a,b,c
Ohio,0,1,2
Texas,3,4,5
New York,6,7,8
Colorado,9,10,11


In [27]:
series = frame.ix[2]
series

a    6
b    7
c    8
Name: New York, dtype: int64

In [28]:
frame - series

Unnamed: 0,a,b,c
Ohio,-6,-6,-6
Texas,-3,-3,-3
New York,0,0,0
Colorado,3,3,3


If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:

In [30]:
series2 = Series(range(3), index=list('bef'))
series2

b    0
e    1
f    2
dtype: int64

In [34]:
frame + series2

Unnamed: 0,a,b,c,e,f
Ohio,,1.0,,,
Texas,,4.0,,,
New York,,7.0,,,
Colorado,,10.0,,,


## Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects:

In [48]:
frame = DataFrame(
    np.random.randn(4, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.162711,0.366653,0.060733
Ohio,0.059461,0.112147,0.996201
Texas,0.680617,1.885878,2.131485
Oregon,0.84085,0.041084,0.083186


**apply**:
Another frequent operation is applying a function on 1D arrays to each column or row.

In [41]:
f = lambda x: x.max() - x.min()

In [47]:
frame.apply(f)

b    2.382938
d    3.073747
e    0.982059
dtype: float64

In [43]:
frame.apply(f, axis=1)

Utah      1.714474
Ohio      2.250502
Texas     0.790177
Oregon    2.732197
dtype: float64

**apply** also can return a Series.

In [52]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f, axis=0)

Unnamed: 0,b,d,e
min,-0.680617,-0.366653,-2.131485
max,0.84085,1.885878,0.083186


We can apply function for each element of DataFrame by using **applymap**

In [65]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.16,-0.37,0.06
Ohio,-0.06,0.11,-1.0
Texas,-0.68,1.89,-2.13
Oregon,0.84,-0.04,0.08


## Sorting and ranking
* **sort_index**: Sort by row or columns index.

In [68]:
obj = Series(np.random.randn(6), index=['a', 'b', 'r', 'c', 'd', 'e'])
obj.sort_index()

a   -2.139884
b    0.168129
c    2.369960
d   -2.137507
e    0.283322
r    0.613477
dtype: float64

With DataFrame, we can sort by axis 0 or 1. And we can decending the order

In [69]:
frame

Unnamed: 0,b,d,e
Utah,0.162711,-0.366653,0.060733
Ohio,-0.059461,0.112147,-0.996201
Texas,-0.680617,1.885878,-2.131485
Oregon,0.84085,-0.041084,0.083186


In [72]:
frame.sort_index(axis=0)

Unnamed: 0,b,d,e
Ohio,-0.059461,0.112147,-0.996201
Oregon,0.84085,-0.041084,0.083186
Texas,-0.680617,1.885878,-2.131485
Utah,0.162711,-0.366653,0.060733


In [73]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,e,d,b
Utah,0.060733,-0.366653,0.162711
Ohio,-0.996201,0.112147,-0.059461
Texas,-2.131485,1.885878,-0.680617
Oregon,0.083186,-0.041084,0.84085


* **sort_index(by='columns')**:  We can sort by value of Series or DataFrame.

With Series, we can use **sort_values()**

In [77]:
obj.sort_values()

a   -2.139884
d   -2.137507
b    0.168129
e    0.283322
r    0.613477
c    2.369960
dtype: float64

With DataFrame, we use **sort_values()** and pass argument **by=**

In [84]:
frame.sort_values(by='e')

Unnamed: 0,b,d,e
Texas,-0.680617,1.885878,-2.131485
Ohio,-0.059461,0.112147,-0.996201
Utah,0.162711,-0.366653,0.060733
Oregon,0.84085,-0.041084,0.083186


**Ranking is closely related to sorting.**

In [94]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [95]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

We can change the type of rank by order (instead average of the same rank numbers)

In [98]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

And we can descending the order, too.

In [100]:
obj.rank(method='max', ascending=False)


0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrame can conside rank by row or column

In [102]:
frame = DataFrame({
        'b': [4.3, 7, -3, 2], 
        'a': [0, 1, 0, 1],
        'c': [-2, 5, 8, -2.5]
    })
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [107]:
frame.rank(axis=0, method='first')

Unnamed: 0,a,b,c
0,1.0,3.0,2.0
1,3.0,4.0,3.0
2,2.0,1.0,4.0
3,4.0,2.0,1.0


## Axis indexes with duplicate values

We can check if index values is unique or not by using **is_unique**

In [108]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [110]:
obj.index.is_unique

False

If we select data with multiple entries, it returns a Series, otherwise, if we select data with only on entry then it return a scalar value

In [111]:
obj['a']

a    0
a    1
dtype: int64

In [112]:
obj['c']

4

The same logic with DataFrame

In [117]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'c'])
df

Unnamed: 0,0,1,2
a,-0.396379,-0.18111,-0.366091
a,1.51443,0.05016,0.99557
b,-1.590776,0.126091,-0.363573
c,0.265425,0.072615,0.575085


In [115]:
df.ix['a']

Unnamed: 0,0,1,2
a,-0.94267,0.65888,0.132171
a,1.390048,1.171006,0.775252


In [118]:
df.ix['c']

0    0.265425
1    0.072615
2    0.575085
Name: c, dtype: float64

# Summarizing and Computing Descriptive Statistics

> Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider
a small DataFrame:

In [119]:
df = DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
)
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums:

In [122]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Or passing axis=1:

In [121]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

**skipna** option is set defaultly to True, so we can set it to False to consider NA values

In [125]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

**idxmin** and **idxmax** return index value of minimum and maximum values.

In [126]:
df.idxmin()

one    d
two    b
dtype: object

In [127]:
df.idxmax(axis=1)

a    one
b    one
c    NaN
d    one
dtype: object

**cumsum**: are accumulations:

In [128]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [130]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,,
50%,,
75%,,
max,7.1,-1.3


## Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained from
Yahoo! Finance:

In [21]:
import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/1990', '1/1/2017')
price = DataFrame({tic: data['Adj Close']
                    for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

In [22]:
price

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1990-01-02,1.132075,,14.726001,0.418482
1990-01-03,1.139673,,14.857484,0.420840
1990-01-04,1.143471,,15.026532,0.433218
1990-01-05,1.147270,,14.988966,0.422608
1990-01-08,1.154868,,15.082882,0.429092
1990-01-09,1.143471,,14.932616,0.427913
1990-01-10,1.094086,,14.876267,0.416125
1990-01-11,1.048499,,15.007749,0.407873
1990-01-12,1.048499,,14.707218,0.406105
1990-01-15,1.040901,,14.744785,0.406105


Now compute percent changes of the prices:

In [23]:
returns = price.pct_change()

In [24]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-12-23,0.001978,-0.001706,-0.002095,-0.004878
2016-12-27,0.006351,0.002076,0.002579,0.000632
2016-12-28,-0.004264,-0.008212,-0.005684,-0.004583
2016-12-29,-0.000257,-0.002879,0.002467,-0.001429
2016-12-30,-0.007796,-0.014014,-0.003661,-0.012083


The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [25]:
returns.MSFT.corr(returns.IBM)

0.42890751046268227

In [26]:
returns.MSFT.cov(returns.IBM)

0.00015457092448141116

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [27]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.450585,0.348782,0.362066
GOOG,0.450585,1.0,0.393845,0.453219
IBM,0.348782,0.393845,1.0,0.428908
MSFT,0.362066,0.453219,0.428908,1.0


In [28]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000827,0.000192,0.000176,0.000214
GOOG,0.000192,0.000392,0.000105,0.00015
IBM,0.000176,0.000105,0.000309,0.000155
MSFT,0.000214,0.00015,0.000155,0.00042


**corrwith**: compute pairwise correlations between
a DataFrame’s columns or rows with another Series or DataFrame

In [29]:
returns.corrwith(returns.IBM)

AAPL    0.348782
GOOG    0.393845
IBM     1.000000
MSFT    0.428908
dtype: float64

Now we can compute correlations of percent changes with volume:

In [32]:
volume

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1990-01-02,45799600,,7041600,53033600
1990-01-03,51998800,,9464000,113772800
1990-01-04,55378400,,9674800,125740800
1990-01-05,30828000,,7570000,69564800
1990-01-08,25393200,,4625200,58982400
1990-01-09,21534800,,7048000,70300800
1990-01-10,49929600,,5945600,103766400
1990-01-11,52763200,,5905600,95772800
1990-01-12,42974400,,5390800,148908800
1990-01-15,40434800,,4035600,62467200


In [31]:
returns.corrwith(volume)

AAPL    0.008559
GOOG    0.051874
IBM    -0.009769
MSFT   -0.000871
dtype: float64

## Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a
one-dimensional Series. To illustrate these, consider this example:

In [33]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

* **unique**: which gives you an array of the unique values in a Series:

In [34]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

* **value_counts**: computes a Series con-taining value frequencies:

In [35]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

* **isin**: is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [36]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [37]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

With DataFrame:

In [46]:
data = DataFrame({
    'Qu1': [1, 3, 4, 3, 4],
    'Qu2': [2, 3, 1, 2, 3],
    'Qu3': [1, 5, 2, 4, 4]
    }, index=['a', 'b', 'c', 'd', 'e'])
data

Unnamed: 0,Qu1,Qu2,Qu3
a,1,2,1
b,3,3,5
c,4,1,2
d,3,2,4
e,4,3,4


In [49]:
result = data.apply(pd.value_counts, axis=0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


# Handling Missing Data
> Missing data is common in most data analysis applications. One of the goals in de-
signing pandas was to make working with missing data as painless as possible

**What is missing data?**

We have a Series contains a NaN.

In [50]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [51]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python **None** value is also treated as NA in object arrays:

In [52]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

## Filtering Out Missing Data

* **dropna**: On a Series, it returns the Series with only the non-null data and index values:

In [53]:
from numpy import nan as NA

In [54]:
data = Series([1, NA, 2, NA, NA, 5])

In [55]:
data.dropna()

0    1.0
2    2.0
5    5.0
dtype: float64

* **notnull**: We can use a mask to filter.

In [56]:
data[data.notnull()]

0    1.0
2    2.0
5    5.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. dropna by default drops
any row containing a missing value:

In [57]:
data = DataFrame(
    [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]
)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [59]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing **how='all'** will only drop rows that are all NA:

In [60]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Dropping columns in the same way is only a matter of passing **axis=1** :

In [61]:
data[4] = NA

In [62]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [64]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


* **thresh**

In [77]:
df = DataFrame(np.random.randn(9,3))
df

Unnamed: 0,0,1,2
0,0.080671,0.404659,-1.087925
1,0.049271,0.095474,0.856017
2,0.390515,-0.41889,0.780404
3,-1.362594,0.434078,-0.210431
4,0.144691,-1.225772,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


In [78]:
df.ix[:4, 1] = NA
df

Unnamed: 0,0,1,2
0,0.080671,,-1.087925
1,0.049271,,0.856017
2,0.390515,,0.780404
3,-1.362594,,-0.210431
4,0.144691,,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


In [79]:
df.ix[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,0.080671,,
1,0.049271,,
2,0.390515,,
3,-1.362594,,-0.210431
4,0.144691,,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


In [87]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


## Filling in Missing Data

* **fillna**

In [89]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.080671,0.0,0.0
1,0.049271,0.0,0.0
2,0.390515,0.0,0.0
3,-1.362594,0.0,-0.210431
4,0.144691,0.0,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


Calling **fillna** with a dict to modified each columns has a special fillna value

In [91]:
df.fillna({1: 0, 2: -1})

Unnamed: 0,0,1,2
0,0.080671,0.0,-1.0
1,0.049271,0.0,-1.0
2,0.390515,0.0,-1.0
3,-1.362594,0.0,-0.210431
4,0.144691,0.0,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


We can make object is mutable by using **inplace** options:

In [92]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.080671,0.0,0.0
1,0.049271,0.0,0.0
2,0.390515,0.0,0.0
3,-1.362594,0.0,-0.210431
4,0.144691,0.0,0.560421
5,0.013337,0.340196,-0.000784
6,-0.681978,0.150758,-0.449414
7,-0.708153,-0.252117,1.791062
8,0.088101,-3.042993,-0.591641


The same with DataFrame: 

In [93]:
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.780441,-0.365936,0.783645
1,1.101566,0.286978,1.874623
2,0.342339,,-0.126176
3,-1.168216,,-2.28254
4,0.579346,,
5,-1.124763,,


In [94]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.780441,-0.365936,0.783645
1,1.101566,0.286978,1.874623
2,0.342339,0.286978,-0.126176
3,-1.168216,0.286978,-2.28254
4,0.579346,0.286978,-2.28254
5,-1.124763,0.286978,-2.28254


In [95]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.780441,-0.365936,0.783645
1,1.101566,0.286978,1.874623
2,0.342339,0.286978,-0.126176
3,-1.168216,0.286978,-2.28254
4,0.579346,,-2.28254
5,-1.124763,,-2.28254


# Hierarchical Indexing

> Hierarchical indexing is an important feature of pandas enabling you to have multiple
(two or more) index levels on an axis

In [97]:
data = Series(
    np.random.randn(10), 
    index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,2,3,1,2,2,3]]
)
data

a  1   -0.078725
   2   -0.977198
   3    0.039397
b  1   -0.172503
   2   -0.186141
   3   -0.146769
c  1   -0.953576
   2    0.258952
d  2   -0.075312
   3   -1.525014
dtype: float64

In [98]:
data.index

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

With a hierarchically-indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data:

In [99]:
data['b']

1   -0.172503
2   -0.186141
3   -0.146769
dtype: float64

In [100]:
data['b':'c']

b  1   -0.172503
   2   -0.186141
   3   -0.146769
c  1   -0.953576
   2    0.258952
dtype: float64

In [101]:
data.ix[['b', 'd']]

b  1   -0.172503
   2   -0.186141
   3   -0.146769
d  2   -0.075312
   3   -1.525014
dtype: float64

We can select inner level:

In [103]:
data[:, 2]

a   -0.977198
b   -0.186141
c    0.258952
d   -0.075312
dtype: float64

* **unstack**: 

In [105]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.078725,-0.977198,0.039397
b,-0.172503,-0.186141,-0.146769
c,-0.953576,0.258952,
d,,-0.075312,-1.525014


In [106]:
data.unstack().stack()

a  1   -0.078725
   2   -0.977198
   3    0.039397
b  1   -0.172503
   2   -0.186141
   3   -0.146769
c  1   -0.953576
   2    0.258952
d  2   -0.075312
   3   -1.525014
dtype: float64

With a DataFrame, both axes can have a hierarchical index:

In [108]:
frame = DataFrame(
    np.arange(12).reshape(4,3), 
    index=[['a','a','b','b'], [1,2,1,2]], 
    columns=[
        ['Ohio', 'Ohio', 'Colorado'],
        ['Green', 'Red', 'Green']
    ]
)
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have names (as strings or any Python objects)

In [109]:
frame.index.names = ['key1', 'key2']

In [110]:
frame.columns.names = ['state', 'color']

In [111]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With this kind of index, we can select group of columns

In [114]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


## Reordering and Sorting Levels

* **swaplevel**: takes two level numbers or names and
returns a new object with the levels interchanged

In [117]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


* **sortlevel**: sorts the data (stably) using only the values in a single
level

In [121]:
frame.sortlevel(1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [123]:
frame.swaplevel(1,0).sortlevel(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


## Summary Statistics by Level

* **level** option

In [125]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [126]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10
