# Pandas
***pandas*** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Source: http://pandas.pydata.org/pandas-docs/stable/

# 10 Minutes to pandas
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook
The original document is posted at http://pandas.pydata.org/pandas-docs/stable/10min.html

Customarily, we import as follows:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Object Creation
See the [Data Structure Intro section](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [3]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a `DataFrame` by passing a numpy array, with a datetime index and labeled columns:

In [4]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [9]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns = ['Ann', "Bob", "Charly", "Don"])
                  ## columns=list('ABCD'))
df

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-01,-0.186518,-0.38583,-1.613007,-0.866188
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-03,0.063678,1.746219,1.36621,-0.072918
2013-01-04,-0.170625,0.472639,0.261512,0.828992
2013-01-05,0.569475,-0.798349,0.968672,-0.286756
2013-01-06,1.468753,0.123569,0.316814,0.107775


In [11]:
df.Charly+df.Don

2013-01-01   -2.479195
2013-01-02   -0.218620
2013-01-03    1.293292
2013-01-04    1.090504
2013-01-05    0.681916
2013-01-06    0.424589
Freq: D, dtype: float64

In [12]:
df2 = pd.DataFrame({ 'A' : 1.,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
   ....: 

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Having specific `dtypes`

In [13]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
<pre>
In [13]: df2.&lt;TAB&gt;
df2.A                  df2.boxplot
df2.abs                df2.C
df2.add                df2.clip
df2.add_prefix         df2.clip_lower
df2.add_suffix         df2.clip_upper
df2.align              df2.columns
df2.all                df2.combine
df2.any                df2.combineAdd
df2.append             df2.combine_first
df2.apply              df2.combineMult
df2.applymap           df2.compound
df2.as_blocks          df2.consolidate
df2.asfreq             df2.convert_objects
df2.as_matrix          df2.copy
df2.astype             df2.corr
df2.at                 df2.corrwith
df2.at_time            df2.count
df2.axes               df2.cov
df2.B                  df2.cummax
df2.between_time       df2.cummin
df2.bfill              df2.cumprod
df2.blocks             df2.cumsum
df2.bool               df2.D
</pre>
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

# Viewing Data
See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics)

See the top & bottom rows of the frame

In [14]:
df.head()

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-01,-0.186518,-0.38583,-1.613007,-0.866188
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-03,0.063678,1.746219,1.36621,-0.072918
2013-01-04,-0.170625,0.472639,0.261512,0.828992
2013-01-05,0.569475,-0.798349,0.968672,-0.286756


In [15]:
df.tail(3)

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-04,-0.170625,0.472639,0.261512,0.828992
2013-01-05,0.569475,-0.798349,0.968672,-0.286756
2013-01-06,1.468753,0.123569,0.316814,0.107775


In [16]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [17]:
df.columns

Index(['Ann', 'Bob', 'Charly', 'Don'], dtype='object')

In [18]:
df.values

array([[-0.18651843, -0.38582988, -1.61300739, -0.86618789],
       [-0.86927187, -0.66751844, -0.59466295,  0.37604253],
       [ 0.06367836,  1.74621913,  1.36620993, -0.07291836],
       [-0.17062511,  0.47263927,  0.26151245,  0.82899167],
       [ 0.56947507, -0.79834941,  0.96867161, -0.28675557],
       [ 1.46875278,  0.12356911,  0.31681382,  0.10777479]])

In [19]:
df.describe()

Unnamed: 0,Ann,Bob,Charly,Don
count,6.0,6.0,6.0,6.0
mean,0.145915,0.081788,0.11759,0.014491
std,0.797167,0.946532,1.081302,0.579106
min,-0.869272,-0.798349,-1.613007,-0.866188
25%,-0.182545,-0.597096,-0.380619,-0.233296
50%,-0.053473,-0.13113,0.289163,0.017428
75%,0.443026,0.385372,0.805707,0.308976
max,1.468753,1.746219,1.36621,0.828992


In [21]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [22]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [20]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


Transposing your data

In [23]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
Ann,-0.186518,-0.869272,0.063678,-0.170625,0.569475,1.468753
Bob,-0.38583,-0.667518,1.746219,0.472639,-0.798349,0.123569
Charly,-1.613007,-0.594663,1.36621,0.261512,0.968672,0.316814
Don,-0.866188,0.376043,-0.072918,0.828992,-0.286756,0.107775


Sorting by an axis

In [24]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,Don,Charly,Bob,Ann
2013-01-01,-0.866188,-1.613007,-0.38583,-0.186518
2013-01-02,0.376043,-0.594663,-0.667518,-0.869272
2013-01-03,-0.072918,1.36621,1.746219,0.063678
2013-01-04,0.828992,0.261512,0.472639,-0.170625
2013-01-05,-0.286756,0.968672,-0.798349,0.569475
2013-01-06,0.107775,0.316814,0.123569,1.468753


Sorting by values

In [26]:
df.sort_values(by='Bob', ascending=True)

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-05,0.569475,-0.798349,0.968672,-0.286756
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-01,-0.186518,-0.38583,-1.613007,-0.866188
2013-01-06,1.468753,0.123569,0.316814,0.107775
2013-01-04,-0.170625,0.472639,0.261512,0.828992
2013-01-03,0.063678,1.746219,1.36621,-0.072918


# Selection
***Note:*** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, .`iat`, `.loc`, `.iloc` and `.ix`.

See the indexing documentation [Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and [MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced)

## Getting

Selecting a single column, which yields a `Series`, equivalent to `df.A`

In [28]:
df['Ann']

2013-01-01   -0.186518
2013-01-02   -0.869272
2013-01-03    0.063678
2013-01-04   -0.170625
2013-01-05    0.569475
2013-01-06    1.468753
Freq: D, Name: Ann, dtype: float64

Selecting via [], which slices the rows.

In [30]:
df[1:3]

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-03,0.063678,1.746219,1.36621,-0.072918


In [31]:
df['20130102':'20130104']

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-03,0.063678,1.746219,1.36621,-0.072918
2013-01-04,-0.170625,0.472639,0.261512,0.828992


## Selection by Label

See more in [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)

For getting a cross section using a label

In [32]:
dates[0]

Timestamp('2013-01-01 00:00:00', offset='D')

In [None]:
df.loc[dates[0]]

Selecting on a multi-axis by label

In [34]:
df.loc[:,['Ann','Bob']]

Unnamed: 0,Ann,Bob
2013-01-01,-0.186518,-0.38583
2013-01-02,-0.869272,-0.667518
2013-01-03,0.063678,1.746219
2013-01-04,-0.170625,0.472639
2013-01-05,0.569475,-0.798349
2013-01-06,1.468753,0.123569


Showing label slicing, both endpoints are *included*

In [None]:
df.loc['20130102':'20130104',['A','B']]

Reduction in the dimensions of the returned object

In [None]:
df.loc['20130102',['A','B']]

For getting a scalar value

In [None]:
df.loc[dates[0],'A']

For getting fast access to a scalar (equiv to the prior method)

In [None]:
df.at[dates[0],'A']

## Selection by Position

See more in [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)

Select via the position of the passed integers

In [None]:
df.iloc[3]

By integer slices, acting similar to numpy/python

In [None]:
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style

In [None]:
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly

In [35]:
df.iloc[1:3,:]

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043
2013-01-03,0.063678,1.746219,1.36621,-0.072918


For slicing columns explicitly

In [None]:
df.iloc[:,1:3]

For getting a value explicitly

In [None]:
In [37]: df.iloc[1,1]
Out[37]: -0.17321464905330858
For getting fast access to a scalar (equiv to the prior method)

In [38]: df.iat[1,1]
Out[38]: -0.17321464905330858

## Boolean Indexing

Using a single column’s values to select data.

In [43]:
flt = (df.Ann >= 0.5) & (df.Ann < 1.5) 

In [49]:
df[flt]

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-05,0.569475,-0.798349,0.968672,-0.286756
2013-01-06,1.468753,0.123569,0.316814,0.107775


In [47]:
In [39]: df[(df.Ann >= 0.5) & (df.Ann < 1.5)]

Unnamed: 0,Ann,Bob,Charly,Don
2013-01-05,0.569475,-0.798349,0.968672,-0.286756
2013-01-06,1.468753,0.123569,0.316814,0.107775


A `where` operation for getting.

In [None]:
df[df > 0]

Using the `isin()` method for filtering:

In [50]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,Ann,Bob,Charly,Don,E
2013-01-01,-0.186518,-0.38583,-1.613007,-0.866188,one
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,one
2013-01-03,0.063678,1.746219,1.36621,-0.072918,two
2013-01-04,-0.170625,0.472639,0.261512,0.828992,three
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,four
2013-01-06,1.468753,0.123569,0.316814,0.107775,three


In [51]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,Ann,Bob,Charly,Don,E
2013-01-03,0.063678,1.746219,1.36621,-0.072918,two
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,four


## Setting

Setting a new column automatically aligns the data by the indexes

In [52]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [54]:
df['F'] = s1
df['G'] = df['Ann']-df['Bob']
df

Unnamed: 0,Ann,Bob,Charly,Don,F,G
2013-01-01,-0.186518,-0.38583,-1.613007,-0.866188,,0.199311
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,4.0,1.367824
2013-01-06,1.468753,0.123569,0.316814,0.107775,5.0,1.345184


In [None]:
df.G

Setting values by label

In [56]:
df.at[dates[0],'Ann'] = 17.6
df

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,17.6,-0.38583,-1.613007,-0.866188,,0.199311,0.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,4.0,1.367824,
2013-01-06,1.468753,0.123569,0.316814,0.107775,5.0,1.345184,


Setting values by position

In [59]:
df.iat[5,2] = 349
df

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,4.0,1.367824,
2013-01-06,1.468753,0.123569,349.0,0.107775,5.0,1.345184,


Setting by assigning with a numpy array

In [None]:
df.loc[:,'D'] = np.array([5] * len(df))

The result of the prior setting operations

In [None]:
df

A `where` operation with setting.

In [None]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

# Missing Data
pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.
See the [Missing Data section]()

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [60]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A,E
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0,1.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,,1.0
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,,


To drop any rows that have missing data.

In [61]:
df1.dropna(how='any')

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A,E


Filling missing data

In [62]:
df1

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A,E
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0,1.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,,1.0
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,,


In [63]:
df1.fillna(value=5)

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A,E
2013-01-01,17.6,349.0,-1.613007,-0.866188,5.0,0.199311,0.0,1.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,5.0,1.0
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,5.0,5.0
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,5.0,5.0


To get the boolean mask where values are `nan`

In [66]:
pd.isnull(df1)

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A,E
2013-01-01,False,False,False,False,True,False,False,False
2013-01-02,False,False,False,False,False,False,True,False
2013-01-03,False,False,False,False,False,False,True,True
2013-01-04,False,False,False,False,False,False,True,True


# Operations
See the [Basic section on Binary Ops]()

## Stats

Operations in general exclude missing data.

Performing a descriptive statistic


In [72]:
df.mean(0)

Ann        3.110335
Bob       58.312760
Charly    58.231454
Don        0.014491
F          3.000000
G          0.064127
A          0.000000
dtype: float64

Same operation on the other axis

In [71]:
df.mean(1)

2013-01-01    60.720019
2013-01-02    -0.159527
2013-01-03     0.570108
2013-01-04     0.624876
2013-01-05     0.970144
2013-01-06    59.507547
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

In [73]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [74]:
df.sub(s, axis='index')

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,,,,,,,
2013-01-02,,,,,,,
2013-01-03,-0.936322,0.746219,0.36621,-1.072918,1.0,-2.682541,
2013-01-04,-3.170625,-2.527361,-2.738488,-2.171008,0.0,-3.643264,
2013-01-05,-4.430525,-5.798349,-4.031328,-5.286756,-1.0,-3.632176,
2013-01-06,,,,,,,


## Apply

Applying functions to the data

In [77]:
df

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,4.0,1.367824,
2013-01-06,1.468753,0.123569,349.0,0.107775,5.0,1.345184,


In [75]:
df.apply(np.cumsum)

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0
2013-01-02,16.730728,348.332482,-2.20767,-0.490145,1.0,-0.002442,
2013-01-03,16.794406,350.078701,-0.84146,-0.563064,3.0,-1.684983,
2013-01-04,16.623781,350.55134,-0.579948,0.265928,6.0,-2.328247,
2013-01-05,17.193256,349.752991,0.388724,-0.020828,10.0,-0.960423,
2013-01-06,18.662009,349.87656,349.388724,0.086947,15.0,0.384761,


In [80]:
df

Unnamed: 0,Ann,Bob,Charly,Don,F,G,A
2013-01-01,17.6,349.0,-1.613007,-0.866188,,0.199311,0.0
2013-01-02,-0.869272,-0.667518,-0.594663,0.376043,1.0,-0.201753,
2013-01-03,0.063678,1.746219,1.36621,-0.072918,2.0,-1.682541,
2013-01-04,-0.170625,0.472639,0.261512,0.828992,3.0,-0.643264,
2013-01-05,0.569475,-0.798349,0.968672,-0.286756,4.0,1.367824,
2013-01-06,1.468753,0.123569,349.0,0.107775,5.0,1.345184,


In [83]:
df.max() ##df.apply(max)

Ann        17.600000
Bob       349.000000
Charly    349.000000
Don         0.828992
F           5.000000
G           1.367824
A           0.000000
dtype: float64

In [85]:
df.Ann.max()-df.Ann.min()

18.469271869437367

In [78]:
df.apply(lambda x: x.max() - x.min())

Ann        18.469272
Bob       349.798349
Charly    350.613007
Don         1.695180
F           4.000000
G           3.050365
A           0.000000
dtype: float64

## Histogramming

See more at [Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization)

In [79]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    6
1    6
2    2
3    1
4    5
5    2
6    1
7    2
8    6
9    6
dtype: int64

In [86]:
s.value_counts()

6    4
2    3
1    2
5    1
dtype: int64

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods).

In [None]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

# Merge
## Concat

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the [Merging section](http://pandas.pydata.org/pandas-docs/stable/merging.html#merging)

Concatenating pandas objects together with `concat()`:

In [None]:
df = pd.DataFrame(np.random.randn(10, 4))
df

In [None]:
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

## Join

SQL style merges. See the [Database style joining](http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-join)

In [95]:
left = pd.DataFrame({'key': ['foo', 'boo', 'foo'], 'lval': [1, 2, 3]})
#right = pd.DataFrame({'key': ['boo', 'foo', 'foo'], 'rval': [4, 5, 6]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [5, 6]})

In [96]:
left

Unnamed: 0,key,lval
0,foo,1
1,boo,2
2,foo,3


In [97]:
right

Unnamed: 0,key,rval
0,foo,5
1,foo,6


In [99]:
pd.merge(left, right, on='key', how='left')

Unnamed: 0,key,lval,rval
0,foo,1,5.0
1,foo,1,6.0
2,boo,2,
3,foo,3,5.0
4,foo,3,6.0


## Append

Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-concatenation)

In [None]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

In [None]:
s = df.iloc[3]
df.append(s, ignore_index=True)

# Grouping
By “group by” we are referring to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

See the [Grouping section](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby)

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
   ....: 

df

Grouping and then applying a function sum to the resulting groups.

In [None]:
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [None]:
df.groupby(['A','B']).sum()

# Reshaping
See the sections on [Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-stacking).

## Stack

In [None]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
   ....: 

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

The `stack()` method “compresses” a level in the DataFrame’s columns.

In [None]:
stacked = df2.stack()
stacked

With a “stacked” DataFrame or Series (having a `MultiIndex` as the index), the inverse operation of `stack()` is `unstack()`, which by default unstacks the ***last level***:

In [None]:
stacked.unstack()

In [None]:
stacked.unstack(1)

In [None]:
stacked.unstack(0)

## Pivot Tables

See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-pivot).

In [100]: 

In [None]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
   .....: 

df

We can produce pivot tables from this data very easily:

In [None]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

# Time Series
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.
See the [Time Series section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries)

In [None]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

In [None]:
Out[105]: 
2012-01-01    25083
Freq: 5T, dtype: int64

Time zone representation

In [None]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

In [None]:
ts_utc = ts.tz_localize('UTC')
ts_utc

Convert to another time zone

In [None]:
ts_utc.tz_convert('US/Eastern')

Converting between time span representations

In [None]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

In [None]:
ps = ts.to_period()
ps

In [None]:
ps.to_timestamp()

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [None]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

# Categoricals
Since version 0.15, pandas can include categorical data in a DataFrame. For full docs, see the [categorical introduction](http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#api-categorical).

In [None]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

Convert the raw grades to a categorical data type.

In [None]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

In [None]:
Out[124]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

Rename the categories to more meaningful names (assigning to `Series.cat.categories` is inplace!)

In [None]:
df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series `.cat` return a new Series per default).

In [None]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

Sorting is per order in the categories, not lexical order.

In [None]:
df.sort_values(by="grade")

In [None]:
Out[128]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

Grouping by a categorical column shows also empty categories.

In [None]:
df.groupby("grade").size()

# Plotting
[Plotting docs](http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization).

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()

In [None]:
%matplotlib inline
ts.plot()

On DataFrame, `plot()` is a convenience to plot all of the columns with labels:

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 

df = df.cumsum()

%matplotlib inline
plt.figure(); df.plot(); plt.legend(loc='best')

# Getting Data In/Out
## CSV

[Writing to a csv file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv)

In [None]:
df.to_csv('foo.csv')

[Reading from a csv file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table)

In [None]:
pd.read_csv('foo.csv')

## HDF5

Reading and writing to [HDFStores](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5)

Writing to a HDF5 Store

In [None]:
## df.to_hdf('foo.h5','df')

Reading from a HDF5 Store

In [None]:
## pd.read_hdf('foo.h5','df')

## Excel

Reading and writing to [MS Excel](http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel)

Writing to an excel file

In [None]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')

Reading from an excel file

In [None]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

# Gotchas
If you are trying an operation and you see an exception like:
<pre>    
&gt;&gt;&gt; if pd.Series([False, True, False]):
    print("I was true")
Traceback
    ...

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
</pre>

See [Comparisons](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-compare) for an explanation and what to do.

See [Gotchas](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas) as well.