# Chapter 5: Concise of Pandas
> Pandas will be the primary library of interest throughout much of the rest of the book.
It contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.

Overview:
* Introduction to pandas: Data Structures
* Essential functionality
* Summarizing and computing descriptive statics
* Handling missing data 
* Hierarchical Indexing
* Other pandas topic

In [3]:
from pandas import Series, DataFrame

In [4]:
import pandas as pd
import numpy as np

# Introduction to pandas datastructure
> To work well with pandas, we should know about **Series** and **DataFrame**

## Series
> A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type)

In [5]:
obj = Series([4, 5, -7, 3])

In [6]:
obj

0    4
1    5
2   -7
3    3
dtype: int64

Series contains 2 attributes: **index** on the left and **values** on the right

In [7]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [8]:
obj.values

array([ 4,  5, -7,  3], dtype=int64)

So we can create Series with our defination of index.

In [9]:
obj2 = Series([4, 7, -5, 3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [10]:
obj2.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

As ndarray in Numpy, we can select and set single value through index

In [11]:
obj2['a']

4

In [12]:
obj2['c'] = -10

In [13]:
obj2

a     4
b     7
c   -10
d     3
dtype: int64

In [14]:
obj[:2]

0    4
1    5
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [15]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [16]:
obj2 * 2

a     8
b    14
c   -20
d     6
dtype: int64

Series is the same with ndarray, if we change the view dataset, the the origin data will change too.

In [17]:
import numpy as np
arr = np.array([1,2,3,4,5,6])
sub_arr = arr[:3]
sub_arr[:] = 1
arr

array([1, 1, 1, 4, 5, 6])

In [18]:
sub_series = obj2[:3]
sub_series['a'] = 10000
obj2

a    10000
b        7
c      -10
d        3
dtype: int64

dict can be substitued in many functions:

In [19]:
'a' in obj2

True

In [20]:
'f' in obj2

False

We can create a Series by passing a dict. And it will be ordered by index

In [21]:
sdata = {'Ohio': 35000, 'Texas': 70000, 'Conecticut':40000, 'New York': 100000}

In [22]:
obj3 = Series(sdata)
obj3

Conecticut     40000
New York      100000
Ohio           35000
Texas          70000
dtype: int64

We can replace index of dict by another list. If indexes match then the mapping value will be assigned to that index. Unless, NaN will assigned.

In [23]:
states = ['California', 'New York', 'Ohio', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California         NaN
New York      100000.0
Ohio           35000.0
Texas          70000.0
dtype: float64

Then **pandas** provides 2 functions to check if missing data: **isnull** and **notnull**

In [24]:
pd.isnull(obj4)

California     True
New York      False
Ohio          False
Texas         False
dtype: bool

In [25]:
pd.notnull(obj4)

California    False
New York       True
Ohio           True
Texas          True
dtype: bool

Series also has the same methods:
    

In [26]:
obj4.isnull()

California     True
New York      False
Ohio          False
Texas         False
dtype: bool

In [27]:
obj4.notnull()

California    False
New York       True
Ohio           True
Texas          True
dtype: bool

We will discuss more detail about handling missing data in rest of this chapter

** Series can automatically aligns differently index in arithmetic operations: **

In [28]:
obj3 + obj4

California         NaN
Conecticut         NaN
New York      200000.0
Ohio           70000.0
Texas         140000.0
dtype: float64

** Series name attribute **
> Both Series itself and its index has name attribute. 

In [29]:
obj4.name = 'population'
obj4.name

'population'

In [30]:
obj4.index.name = 'state'
obj4.index.name

'state'

## DataFrame

> A DataFrame represents a tabular, spreadsheet-like data structure containing an or-
dered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame ), row-
oriented and column-oriented operations in DataFrame are treated roughly symmet-
rically. Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays

* Create a dataframe from dict

In [31]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2000, 2001],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2000
4,2.9,Nevada,2001


We can set sequence of column.

In [32]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2000,Nevada,2.4
4,2001,Nevada,2.9


As Series, we can set index to dataframe, and if we pass columns that isn't contained data it will appear with NaN values.

In [33]:
frame2 = DataFrame(data, 
                   columns=['year', 'state', 'pop', 'debt'], 
                   index=['one', 'two','three', 'four', 'five']
                  )
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [34]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [35]:
frame2.state?

Rows can be retrieved by position or name by **ix** attribute. And we get a Series too.

In [36]:
frame2.ix['one']

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

Columns can be modified by assignment. If we assign a lists of arrays to a columns then the length must match the length of Dataframe

In [37]:
frame2.debt = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2000,Nevada,2.4,3
five,2001,Nevada,2.9,4


We can assign a Series. 


In [38]:
val = Series([-1.2, 1.5, 1.7], index=['one', 'two', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,1.5
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,1.7


If we assign a column doesn't exist, then a new columns will be created. 

In [39]:
frame2['isOhio'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,isOhio
one,2000,Ohio,1.5,-1.2,True
two,2001,Ohio,1.7,1.5,True
three,2002,Ohio,3.6,,True
four,2000,Nevada,2.4,,False
five,2001,Nevada,2.9,1.7,False


And **del** will delete column

In [40]:
del frame2['isOhio']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,1.5
three,2002,Ohio,3.6,
four,2000,Nevada,2.4,
five,2001,Nevada,2.9,1.7


Nested dict. If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [41]:
pop = {
    'Ohio': {
        2001: 2.4,
        2002: 2.9
    }, 
    'Nevada': {
        2000: 1.5,
        2001: 1.7,
        2002: 3.6
    }
}
pop

{'Nevada': {2000: 1.5, 2001: 1.7, 2002: 3.6}, 'Ohio': {2001: 2.4, 2002: 2.9}}

In [42]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


We can transpose the result: 

In [43]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,1.5,1.7,3.6
Ohio,,2.4,2.9


We can set name for index and columns

In [44]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3.T

year,2000,2001,2002
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,1.5,1.7,3.6
Ohio,,2.4,2.9


As Series, we can get **values**. It returns an 2D ndarrays

In [45]:
frame3.values

array([[ 1.5,  nan],
       [ 1.7,  2.4],
       [ 3.6,  2.9]])

## Index Objects

> Pandas’s Index objects are responsible for holding the axis labels and other metadata

In [46]:
obj = Series(np.arange(3), index=['a','b','c'])

In [47]:
obj

a    0
b    1
c    2
dtype: int32

In [48]:
index= obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [49]:
index[1] = 'd'

TypeError: Index does not support mutable operations

# Essential Functionality

## Reindexing

> We can change index of DataFrame, or Series

In [None]:
obj = Series(np.random.randn(5), index=['a','b','c','d','e'])
obj

In [None]:
obj.reindex(['a','b','c','d','e','g'], fill_value=0)

* **ffil**: Fill (or carry) values forward

In [None]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

In [None]:
obj3.reindex(range(6), method='ffill')

* **bfill**: Fill (or carry) values backward

In [None]:
obj3.reindex(range(6), method='bfill')

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result:

In [None]:
frame = DataFrame(np.arange(9).reshape((3,3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'] )

In [None]:
frame

In [None]:
frame.reindex(['a','b','c','d'])

The columns can be reindexed using the columns keyword:

In [None]:
states = ['Texas', 'Utah', 'California', 'Ohio']
frame.reindex(columns=states)

Then, as Series, we can **ffill** or **bfill** with dataframe

In [None]:
frame.reindex(index=['a','b','c','d'], columns=states, method='ffill')

## Dropping entries from an axis

> Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

In [None]:
new_obj = obj.drop('c')

In [None]:
new_obj

In [None]:
obj.drop(['d','c'])

With DataFrame, index values can be deleted from either axis:

In [None]:
data = DataFrame(
    np.arange(16).reshape(4,4),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns = ['one', 'two', 'three', 'four']
)
data

In [None]:
data.index

In [None]:
data.columns

In [None]:
data.drop(['Ohio', 'Utah'], axis=0)

In [None]:
data.drop(['two', 'four'], axis=1)

## Indexing, selection, and filtering

> Series indexing ( obj[...] ) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers. Here are some examples this:

In [None]:
obj = Series(np.arange(4), index=['a','b','c','d'])

In [None]:
obj[2]

In [None]:
obj[2:4]

In [None]:
obj[['a','b','c']]

In [None]:
obj[obj > 2]

Slicing with labels behaves differently than normal Python slicing in that the endpoint
is inclusive:

In [None]:
obj['b':'c']

Setting using these methods works just as you would expect:

In [None]:
obj['b':'c'] = 5

In [None]:
obj

Indexing of DataFrame

In [None]:
data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

DataFrame has some special cases of Indexing.
* Selecting rows:

In [None]:
data[:2]

In [None]:
data[data['three'] > 5]

This might seem inconsistent to some readers, but this syntax arose out of practicality
and nothing more. Another use case is in indexing with a boolean DataFrame, such as
one produced by a scalar comparison:

In [None]:
data < 5

In [None]:
data[data < 5] = 0
data

For DataFrame label-indexing on rows, we have special indexing file **ix**. It enables you to select a subset of the rows and columns from DataFrame,

In [None]:
data.ix['Colorado', ['two' , 'three']]

In [None]:
type(data.ix['Colorado', ['two' , 'three']])

Or we can use **ix** as a way to reindexing

In [None]:
data.ix[['Colorado', 'Utah'], [3,0,1]]

Or **ix** can use to select row by index number

In [None]:
data.ix[2]

Or **ix** can use to select columns by columns

In [None]:
data.ix[:,'two']

There are some other ways and methods to work with selecting and indexing or reindexing in pandas

|Type | Notes|
|-----|------|
|**obj[val]** | Select single column or sequence of columns from the DataFrame. Special case con-veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion). |
| **obj.ix[val]** | Selects single row of subset of rows from the DataFrame. | 
| **obj.ix[:,val]** | Select single columns of subsets of columns | 
| **obj.ix[val1,val2]** | Select both rows and columns. |
| **reindex** method | Conform one or more axes to new indexes.|
| **xs** method | Select single row or column as a Series by label.|
| **icol (insted by iloc), irow** methods | Select single column or row, respectively, as a Series by integer location.|
| **get_value, set_value** methods | Select single value by row and column label. |

In [None]:
new_index = ['New York', 'California', 'Ohio', 'Colorado', 'Los Angles']
data.reindex(new_index, fill_value='missing')

In [None]:
data.get_value('Ohio', 'one')

In [None]:
data.iloc[3]

## Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between ob-
jects with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs. Let’s
look at a simple example:

In [None]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

In [None]:
s2 = Series([-2.1, 1.4, -4.5, 5.6 , 7.6], index=['a', 'c', 'e', 'f', 'g'])
s2

In [None]:
s1 + s2

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = DataFrame(
    np.arange(9).reshape(3,3),
    columns=list('bcd'),
    index=['Ohio', 'Texas', 'Colorado']
)
df1

In [None]:
df2 = DataFrame(
    np.arange(12).reshape((4, 3)), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
df2

In [None]:
df1 + df2

**Arithmetic methods with fill values**

In [None]:
df1.add(df2, fill_value=0)

In [None]:
df1.add(df2, fill_value=2)

Pandas provides us some flexible methods to use:

|Method | Description |
|-------|-------------|
| add | Method for addition|
| sub | Method for subtraction |
| div | Method for division |
| mul | Method for multiplication |

**Operations between DataFrame and Series**

As with **NumPy** arrays, arithmetic between DataFrame and Series is well-defined. First,
as a motivating example, consider the difference between a 2D array and one of its rows:

In [None]:
arr = np.arange(12).reshape(3,4)
arr

In [None]:
arr[0]

In [None]:
arr - arr[0]

Once again, we have seen a case of broadcasting.

**DataFrame** is fimilar with Numpy

In [None]:
frame = DataFrame(
    np.arange(12).reshape(4,3), 
    columns=['a', 'b', 'c'], 
    index=['Ohio', 'Texas', 'New York', 'Colorado']
)
frame

In [None]:
series = frame.ix[2]
series

In [None]:
frame - series

If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:

In [None]:
series2 = Series(range(3), index=list('bef'))
series2

In [None]:
frame + series2

## Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects:

In [None]:
frame = DataFrame(
    np.random.randn(4, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
np.abs(frame)

**apply**:
Another frequent operation is applying a function on 1D arrays to each column or row.

In [None]:
f = lambda x: x.max() - x.min()

In [None]:
frame.apply(f)

In [None]:
frame.apply(f, axis=1)

**apply** also can return a Series.

In [None]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f, axis=0)

We can apply function for each element of DataFrame by using **applymap**

In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

## Sorting and ranking
* **sort_index**: Sort by row or columns index.

In [None]:
obj = Series(np.random.randn(6), index=['a', 'b', 'r', 'c', 'd', 'e'])
obj.sort_index()

With DataFrame, we can sort by axis 0 or 1. And we can decending the order

In [None]:
frame

In [None]:
frame.sort_index(axis=0)

In [None]:
frame.sort_index(axis=1, ascending=False)

* **sort_index(by='columns')**:  We can sort by value of Series or DataFrame.

With Series, we can use **sort_values()**

In [None]:
obj.sort_values()

With DataFrame, we use **sort_values()** and pass argument **by=**

In [None]:
frame.sort_values(by='e')

**Ranking is closely related to sorting.**

In [None]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj

In [None]:
obj.rank()

We can change the type of rank by order (instead average of the same rank numbers)

In [None]:
obj.rank(method='first')

And we can descending the order, too.

In [None]:
obj.rank(method='max', ascending=False)


DataFrame can conside rank by row or column

In [None]:
frame = DataFrame({
        'b': [4.3, 7, -3, 2], 
        'a': [0, 1, 0, 1],
        'c': [-2, 5, 8, -2.5]
    })
frame

In [None]:
frame.rank(axis=0, method='first')

## Axis indexes with duplicate values

We can check if index values is unique or not by using **is_unique**

In [None]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

In [None]:
obj.index.is_unique

If we select data with multiple entries, it returns a Series, otherwise, if we select data with only on entry then it return a scalar value

In [None]:
obj['a']

In [None]:
obj['c']

The same logic with DataFrame

In [None]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'c'])
df

In [None]:
df.ix['a']

In [None]:
df.ix['c']

# Summarizing and Computing Descriptive Statistics

> Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider
a small DataFrame:

In [None]:
df = DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
)
df

Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

Or passing axis=1:

In [None]:
df.sum(axis=1)

**skipna** option is set defaultly to True, so we can set it to False to consider NA values

In [None]:
df.mean(axis=1, skipna=False)

**idxmin** and **idxmax** return index value of minimum and maximum values.

In [None]:
df.idxmin()

In [None]:
df.idxmax(axis=1)

**cumsum**: are accumulations:

In [None]:
df.cumsum()

In [None]:
df.describe()

## Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained from
Yahoo! Finance:

In [None]:
import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/1990', '1/1/2017')
price = DataFrame({tic: data['Adj Close']
                    for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

In [None]:
price

Now compute percent changes of the prices:

In [None]:
returns = price.pct_change()

In [None]:
returns.tail()

The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [None]:
returns.MSFT.corr(returns.IBM)

In [None]:
returns.MSFT.cov(returns.IBM)

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [None]:
returns.corr()

In [None]:
returns.cov()

**corrwith**: compute pairwise correlations between
a DataFrame’s columns or rows with another Series or DataFrame

In [None]:
returns.corrwith(returns.IBM)

Now we can compute correlations of percent changes with volume:

In [None]:
volume

In [None]:
returns.corrwith(volume)

## Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a
one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

* **unique**: which gives you an array of the unique values in a Series:

In [None]:
obj.unique()

* **value_counts**: computes a Series con-taining value frequencies:

In [None]:
obj.value_counts()

* **isin**: is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset of values in a Series or column in a DataFrame:

In [None]:
mask = obj.isin(['b', 'c'])
mask

In [None]:
obj[mask]

With DataFrame:

In [None]:
data = DataFrame({
    'Qu1': [1, 3, 4, 3, 4],
    'Qu2': [2, 3, 1, 2, 3],
    'Qu3': [1, 5, 2, 4, 4]
    }, index=['a', 'b', 'c', 'd', 'e'])
data

In [None]:
result = data.apply(pd.value_counts, axis=0)
result

# Handling Missing Data
> Missing data is common in most data analysis applications. One of the goals in de-
signing pandas was to make working with missing data as painless as possible

**What is missing data?**

We have a Series contains a NaN.

In [None]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

In [None]:
string_data.isnull()

The built-in Python **None** value is also treated as NA in object arrays:

In [None]:
string_data[0] = None
string_data.isnull()

## Filtering Out Missing Data

* **dropna**: On a Series, it returns the Series with only the non-null data and index values:

In [None]:
from numpy import nan as NA

In [None]:
data = Series([1, NA, 2, NA, NA, 5])

In [None]:
data.dropna()

* **notnull**: We can use a mask to filter.

In [None]:
data[data.notnull()]

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. dropna by default drops
any row containing a missing value:

In [None]:
data = DataFrame(
    [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]
)
data

In [None]:
cleaned = data.dropna()
cleaned

Passing **how='all'** will only drop rows that are all NA:

In [None]:
data.dropna(how='all')

Dropping columns in the same way is only a matter of passing **axis=1** :

In [None]:
data[4] = NA

In [None]:
data

In [None]:
data.dropna(axis=1, how='all')

* **thresh**

In [None]:
df = DataFrame(np.random.randn(9,3))
df

In [None]:
df.ix[:4, 1] = NA
df

In [None]:
df.ix[:2, 2] = NA
df

In [None]:
df.dropna(thresh=3)

## Filling in Missing Data

* **fillna**

In [None]:
df.fillna(0)

Calling **fillna** with a dict to modified each columns has a special fillna value

In [None]:
df.fillna({1: 0, 2: -1})

We can make object is mutable by using **inplace** options:

In [None]:
df.fillna(0, inplace=True)
df

The same with DataFrame: 

In [None]:
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df

In [None]:
df.fillna(method='ffill')

In [None]:
df.fillna(method='ffill', limit=2)

# Hierarchical Indexing

> Hierarchical indexing is an important feature of pandas enabling you to have multiple
(two or more) index levels on an axis

In [None]:
data = Series(
    np.random.randn(10), 
    index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,2,3,1,2,2,3]]
)
data

In [None]:
data.index

With a hierarchically-indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data:

In [None]:
data['b']

In [None]:
data['b':'c']

In [None]:
data.ix[['b', 'd']]

We can select inner level:

In [None]:
data[:, 2]

* **unstack**: 

In [None]:
data.unstack()

In [None]:
data.unstack().stack()

With a DataFrame, both axes can have a hierarchical index:

In [None]:
frame = DataFrame(
    np.arange(12).reshape(4,3), 
    index=[['a','a','b','b'], [1,2,1,2]], 
    columns=[
        ['Ohio', 'Ohio', 'Colorado'],
        ['Green', 'Red', 'Green']
    ]
)
frame

The hierarchical levels can have names (as strings or any Python objects)

In [None]:
frame.index.names = ['key1', 'key2']

In [None]:
frame.columns.names = ['state', 'color']

In [None]:
frame

With this kind of index, we can select group of columns

In [None]:
frame['Ohio']

## Reordering and Sorting Levels

* **swaplevel**: takes two level numbers or names and
returns a new object with the levels interchanged

In [None]:
frame.swaplevel('key1', 'key2')

* **sortlevel**: sorts the data (stably) using only the values in a single
level

In [None]:
frame.sortlevel(1)

In [None]:
frame.swaplevel(1,0).sortlevel(0)

## Summary Statistics by Level

* **level** option

In [None]:
frame.sum(level='key2')

In [None]:
frame.sum(level='color', axis=1)

## Using a DataFrame’s Columns

It’s not unusual to want to use one or more columns from a DataFrame as the row
index; alternatively, you may wish to move the row index into the DataFrame’s col-
umns:

In [None]:
frame = DataFrame({
        'a': range(7), 'b': range(7, 0, -1),
        'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
        'd': [0, 1, 2, 0, 1, 2, 3]
    }
)
frame

* **set_index**

In [None]:
frame2 = frame.set_index(['c', 'd'], drop=True)
frame2

* **reset_index**

In [None]:
frame2.reset_index()

# Other pandas Topics

## Integer Indexing

Working with pandas objects indexed by integers is something that often trips up new
users due to some differences with indexing semantics on built-in Python data like lists and tuples. For example, you would not expect the following code to generate an error:

In [2]:
ser = Series(np.arange(3))
ser[-1]

NameError: name 'Series' is not defined

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [50]:
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

2.0

* **iloc**: get value by index order

In [56]:
ser3 = Series(range(3), index=[-5, 1, 3])
print ser3.iloc(2)
print ser3.ix(-5)

<pandas.core.indexing._iLocIndexer object at 0x0000000009521AC8>
<pandas.core.indexing._IXIndexer object at 0x00000000095218D0>


In [57]:
frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])
frame    

Unnamed: 0,0,1
2,0,1
0,2,3
1,4,5


In [58]:
frame.iloc[0]

0    0
1    1
Name: 2, dtype: int32

## Panel Data
> Panel as 3-D arrays

In [59]:
import pandas.io.data as web
pdata = pd.Panel(
    dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))            
    for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL'])
)
pdata

The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.


<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close

In [61]:
pdata = pdata.swapaxes('items', 'minor')

In [62]:
pdata['Adj Close']

KeyError: 'Adj Close'

In [None]:
 pdata.ix[:, '6/1/2012', :]

In [None]:
pdata.ix['Adj Close', '5/22/2012':, :]

In [None]:
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()

In [None]:
stacked