<h1 style="font-size:3rem; color: sienna;">Data Wrangling_Join, Combine, and Reshape::</h1>

## Part 3: Reshaping and Pivoting

There are a number of basic operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.

# Table of Contents

- 3.1  **[Reshaping with Hierarchical Indexing](#Reshaping)**
   
- 3.2  **[Pivoting “Long” to “Wide” Format](#Pivoting)**

- 3.3  **[Pivoting “Wide” to “Long” Format](#Wide_to_Long)**

In [1]:
import pandas as pd
import numpy as np

<a id="Reshaping"></a>
## 3.1 Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

**stack**
- This “rotates” or pivots from the columns in the data to the rows

**unstack**
- This pivots from the rows into the columns

I’ll illustrate these operations through a series of examples. Consider a small Data‐ Frame with string arrays as row and column indexes:

In [2]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                    name='number'))

In [3]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the `stack` method on this data pivots the columns into the rows, producing a Series:

In [4]:
result = data.stack()

In [5]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically indexed Series, you can rearrange the data back into a Data‐Frame with `unstack`:

In [6]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default the innermost level is unstacked (same with `stack`). You can unstack a different level by passing a level number or name:

In [7]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [8]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:


In [9]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

In [10]:
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

In [11]:
s1

a    0
b    1
c    2
d    3
dtype: int64

In [12]:
s2

c    4
d    5
e    6
dtype: int64

In [13]:
data2 = pd.concat([s1, s2], keys=['one', 'two'])

In [14]:
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [15]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filters out missing data by default, so the operation is more easily invertible:

In [16]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [17]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [18]:
data2.unstack().stack(dropna=False)

  data2.unstack().stack(dropna=False)


one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

In [19]:
df = pd.DataFrame({'left': result, 'right': result + 5},
                 columns=pd.Index(['left', 'right'], name='side'))

In [20]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [22]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


When calling `stack`, we can indicate the name of the axis to stack:

In [23]:
df.unstack('state').stack('side')

  df.unstack('state').stack('side')


Unnamed: 0_level_0,state,Ohio,Colorado
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,0,3
one,right,5,8
two,left,1,4
two,right,6,9
three,left,2,5
three,right,7,10


<a id="Pivoting"></a>
## 3.2 Pivoting “Long” to “Wide” Format

A common way to store multiple time series in databases and CSV is in so-called *long* or *stacked* format. Let’s load some example data and do a small amount of time series wrangling and other data cleaning:

In [24]:
data = pd.read_csv('macrodata.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'macrodata.csv'

In [29]:
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [30]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                         name='date')


#PeriodIndex is an immutable ndarray holding ordinal values indicating regular periods in time.
# it combines the year and quarter columns to create a kind of time interval type.

In [31]:
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [32]:
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

In [33]:
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [34]:
data = data.reindex(columns=columns)

In [35]:
data

item,realgdp,infl,unemp
0,2710.349,0.00,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2
...,...,...,...
198,13324.600,-3.16,6.0
199,13141.920,-8.79,6.9
200,12925.410,0.94,8.1
201,12901.504,3.37,9.2


In [36]:
data.index = periods.to_timestamp('D', 'end')

In [37]:
data.index

DatetimeIndex(['1959-03-31 23:59:59.999999999',
               '1959-06-30 23:59:59.999999999',
               '1959-09-30 23:59:59.999999999',
               '1959-12-31 23:59:59.999999999',
               '1960-03-31 23:59:59.999999999',
               '1960-06-30 23:59:59.999999999',
               '1960-09-30 23:59:59.999999999',
               '1960-12-31 23:59:59.999999999',
               '1961-03-31 23:59:59.999999999',
               '1961-06-30 23:59:59.999999999',
               ...
               '2007-06-30 23:59:59.999999999',
               '2007-09-30 23:59:59.999999999',
               '2007-12-31 23:59:59.999999999',
               '2008-03-31 23:59:59.999999999',
               '2008-06-30 23:59:59.999999999',
               '2008-09-30 23:59:59.999999999',
               '2008-12-31 23:59:59.999999999',
               '2009-03-31 23:59:59.999999999',
               '2009-06-30 23:59:59.999999999',
               '2009-09-30 23:59:59.999999999'],
              dtype=

In [38]:
ldata = data.stack().reset_index().rename(columns={0: 'value'})

In [39]:
#Now, ldata looks like:

ldata[:10]

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34
5,1959-06-30 23:59:59.999999999,unemp,5.1
6,1959-09-30 23:59:59.999999999,realgdp,2775.488
7,1959-09-30 23:59:59.999999999,infl,2.74
8,1959-09-30 23:59:59.999999999,unemp,5.3
9,1959-12-31 23:59:59.999999999,realgdp,2785.204


This is the so-called long format for multiple time series, or other observational data with two or more keys (here, our keys are date and item). Each row in the table represents a single observation.


you might prefer to have a DataFrame containing one column per distinct `item` value indexed by timestamps in the `date` column. DataFrame’s `pivot` method performs exactly this transformation:

In [40]:
pivoted = ldata.pivot('date', 'item', 'value')

In [41]:
pivoted

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.00,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2
...,...,...,...
2008-09-30 23:59:59.999999999,-3.16,13324.600,6.0
2008-12-31 23:59:59.999999999,-8.79,13141.920,6.9
2009-03-31 23:59:59.999999999,0.94,12925.410,8.1
2009-06-30 23:59:59.999999999,3.37,12901.504,9.2


The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

In [42]:
ldata['value2'] = np.random.randn(len(ldata))

In [43]:
ldata[:10]

Unnamed: 0,date,item,value,value2
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,0.946858
1,1959-03-31 23:59:59.999999999,infl,0.0,-0.051433
2,1959-03-31 23:59:59.999999999,unemp,5.8,0.809224
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,-0.105027
4,1959-06-30 23:59:59.999999999,infl,2.34,0.557531
5,1959-06-30 23:59:59.999999999,unemp,5.1,0.51679
6,1959-09-30 23:59:59.999999999,realgdp,2775.488,-0.404708
7,1959-09-30 23:59:59.999999999,infl,2.74,-1.533275
8,1959-09-30 23:59:59.999999999,unemp,5.3,-0.787165
9,1959-12-31 23:59:59.999999999,realgdp,2785.204,-0.66268



By omitting the last argument, you obtain a DataFrame with hierarchical columns:

In [44]:
pivoted = ldata.pivot('date', 'item')

In [45]:
pivoted[:5]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.051433,0.946858,0.809224
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,0.557531,-0.105027,0.51679
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,-1.533275,-0.404708,-0.787165
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,1.497059,-0.66268,-1.213747
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,0.724466,1.101081,-0.630961


In [46]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


Note that `pivot` is equivalent to creating a hierarchical index using `set_index` followed by a call to `unstack`:

In [47]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')

In [48]:
unstacked[:7]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.051433,0.946858,0.809224
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,0.557531,-0.105027,0.51679
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,-1.533275,-0.404708,-0.787165
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,1.497059,-0.66268,-1.213747
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,0.724466,1.101081,-0.630961
1960-06-30 23:59:59.999999999,0.14,2834.39,5.2,-0.777284,1.044261,-1.084689
1960-09-30 23:59:59.999999999,2.7,2839.022,5.6,-1.668734,1.183427,-0.507985


<a id=Wide_to_Long></a>
## 3.3 Pivoting “Wide” to “Long” Format

An inverse operation to pivot for DataFrames is `pandas.melt`. Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input. Let’s look at an example:

In [25]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                  'A': [1, 2, 3],
                  'B': [4, 5, 6],
                  'C': [7, 8, 9]})

In [26]:
df


Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


The 'key' column may be a group indicator, and the other columns are data values. When using `pandas.melt`, we must indicate which columns (if any) are group indicators. Let’s use `'key'` as the only group indicator here:

In [27]:
melted = pd.melt(df, ['key'])

In [30]:
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


Using `pivot`, we can reshape back to the original layout:

In [40]:
# reshaped = melted.pivot('key', 'variable', 'value')
reshaped = melted.pivot(columns='variable', values='value', index='key')


In [41]:
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


Since the result of `pivot` creates an index from the column used as the row labels, we may want to use `reset_index` to move the data back into a column:

In [42]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


You can also specify a subset of columns to use as value columns:

In [44]:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


`pandas.melt` can be used without any group identifiers, too:

In [45]:
pd.melt(df, value_vars=['A', 'B', 'C'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


In [46]:
pd.melt(df, value_vars=['key', 'A', 'B'])

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6


Now that you have some pandas basics for data import, cleaning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the course when we discuss more advanced analytics.