# Reshaping and Pivoting

There are a number of fundamental operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.

In [1]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

## Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.

There are two primary actions:
* stack: this “rotates” or pivots from the columns in the data to the rows
* unstack: this pivots from the rows into the columns

I’ll illustrate these operations through a series of examples. Consider a small DataFrame
with string arrays as row and column indexes:

In [2]:
data = DataFrame(np.arange(6).reshape((2, 3)),
            index=pd.Index(['Ohio', 'Colorado'], name='state'),
            columns=pd.Index(['one', 'two', 'three'], name='number'))

In [3]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the stack method on this data pivots the columns into the rows, producing a
Series:

In [12]:
result = data.stack()

In [13]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically-indexed Series, you can rearrange the data back into a DataFrame
with unstack:

In [14]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default the innermost level is unstacked (same with stack). You can unstack a different
level by passing a level number or name:

In [17]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [18]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking might introduce missing data if all of the values in the level aren’t found in
each of the subgroups:

In [19]:
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

In [20]:
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])

In [21]:
data2 = pd.concat([s1, s2], keys=['one','two'])

In [22]:
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [24]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filters out missing data by default, so the operation is easily invertible:

In [25]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [27]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When unstacking in a DataFrame, the level unstacked becomes the lowest level in the
result:

In [30]:
df = DataFrame({'left': result, 'right': result + 5},
              columns=pd.Index(['left', 'right'], name='side'))

In [31]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [32]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [33]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Ohio,Colorado
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,0,3
one,right,5,8
two,left,1,4
two,right,6,9
three,left,2,5
three,right,7,10


## Pivoting “long” to "wide” Format

A common way to store multiple time series in databases and CSV is in so-called long
or stacked format:



In [49]:
ldata = pd.read_csv('ldata.csv', delimiter=',', 
                    names=['date', 'item', 'value']
                   )

In [50]:
ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31 00:00:00,realgdp,2710.349
1,1959-03-31 00:00:00,infl,0.0
2,1959-03-31 00:00:00,unemp,5.8
3,1959-06-30 00:00:00,realgdp,2778.801
4,1959-06-30 00:00:00,infl,2.34


In [51]:
pivoted = ldata.pivot('date', 'item', 'value')

In [52]:
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 00:00:00,0.0,2710.349,5.8
1959-06-30 00:00:00,2.34,2778.801,5.1
1959-09-30 00:00:00,2.74,2775.488,5.3
1959-12-31 00:00:00,,2785.204,


The first two values passed are the columns to be used as the row and column index,
and finally an optional value column to fill the DataFrame. Suppose you had two value
columns that you wanted to reshape simultaneously:

In [47]:
ldata['value2'] = np.random.randn(len(ldata))

In [48]:
ldata

Unnamed: 0,date,item,value,value2
0,1959-03-31 00:00:00,realgdp,2710.349,2.080434
1,1959-03-31 00:00:00,infl,0.0,-1.472674
2,1959-03-31 00:00:00,unemp,5.8,0.432504
3,1959-06-30 00:00:00,realgdp,2778.801,0.921968
4,1959-06-30 00:00:00,infl,2.34,0.067922
5,1959-06-30 00:00:00,unemp,5.1,-0.174717
6,1959-09-30 00:00:00,realgdp,2775.488,0.275734
7,1959-09-30 00:00:00,infl,2.74,1.642971
8,1959-09-30 00:00:00,unemp,5.3,0.995244
9,1959-12-31 00:00:00,realgdp,2785.204,-0.81794


By omitting the last argument, you obtain a DataFrame with hierarchical columns:

In [53]:
pivoted = ldata.pivot('date', 'item')

In [54]:
pivoted[:5]

Unnamed: 0_level_0,value,value,value
item,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1959-03-31 00:00:00,0.0,2710.349,5.8
1959-06-30 00:00:00,2.34,2778.801,5.1
1959-09-30 00:00:00,2.74,2775.488,5.3
1959-12-31 00:00:00,,2785.204,


In [55]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 00:00:00,0.0,2710.349,5.8
1959-06-30 00:00:00,2.34,2778.801,5.1
1959-09-30 00:00:00,2.74,2775.488,5.3
1959-12-31 00:00:00,,2785.204,


Note that pivot is just a shortcut for creating a hierarchical index using set_index and
reshaping with unstack:

In [56]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')

In [57]:
unstacked

Unnamed: 0_level_0,value,value,value
item,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1959-03-31 00:00:00,0.0,2710.349,5.8
1959-06-30 00:00:00,2.34,2778.801,5.1
1959-09-30 00:00:00,2.74,2775.488,5.3
1959-12-31 00:00:00,,2785.204,
