<h1 align='center'>8.3 Reshaping and Pivoting

<b>Reshaping with Hierarchical Indexing

Hierarchical  indexing  provides  a  consistent  way  to  rearrange  data  in  a  DataFrame.

There are two primary actions:

        stack
            This “rotates” or pivots from the columns in the data to the rows
        unstack
            This pivots from the rows into the columns

In [3]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                   index=pd.Index(['Ohio', 'Colorado'], name='state'),
                   columns=pd.Index(['one', 'two', 'three'],
                    name='number'))

In [4]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [7]:
result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

By default the innermost level is unstacked (same with stack). You can unstack a dif‐ferent level by passing a level number or name

In [8]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [9]:
result.unstack(level=0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking might introduce missing data if all of the values in the level aren’t foundin each of the subgroups:

In [11]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [13]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filters out missing data by default, so the operation is more easily invertible:

In [14]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [15]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

In [19]:
df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'], name='side'))
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [21]:
df.unstack()

side,left,left,left,right,right,right
number,one,two,three,one,two,three
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Ohio,0,1,2,5,6,7
Colorado,3,4,5,8,9,10


In [18]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


When  you  unstack  in  a  DataFrame,  the  level  unstacked  becomes  the  lowest  level  in the result:

When calling stack, we can indicate the name of the axis to stack:

In [22]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


<b>Pivoting “Long” to “Wide” Format

A common way to store multiple time series in databases and CSV is in so-called longor stacked format. 

In [26]:
data=pd.read_csv(r"D:\macrodata.csv")
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [28]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                         name='date')
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [30]:
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [32]:
data1 = data.reindex(columns=columns)
data1.head()

item,realgdp,infl,unemp
0,2710.349,0.0,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2


In [36]:
data1.index = periods.to_timestamp('D', 'end')
data1.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,2710.349,0.0,5.8
1959-06-30 23:59:59.999999999,2778.801,2.34,5.1
1959-09-30 23:59:59.999999999,2775.488,2.74,5.3
1959-12-31 23:59:59.999999999,2785.204,0.27,5.6
1960-03-31 23:59:59.999999999,2847.699,2.31,5.2


In [40]:
ldata = data1.stack().reset_index().rename(columns={0: 'value'})
ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34


This is the so-called long format for multiple time series, or other observational datawith two or more keys (here, our keys are date and item). Each row in the table represents a single observation.

Data  is  frequently  stored  this  way  in  relational  databases  like  MySQL,  as  a  fixedschema  (column  names  and  data  types)  allows  the  number  of  distinct  values  in  the item  column  to  change  as  data  is  added  to  the  table.  

In  the  previous  example,  dateand item would usually be the primary keys (in relational database parlance), offeringboth  relational  integrity  and  easier  joins.  In  some  cases,  the  data  may  be  more  diffi‐cult  to  work  with  in  this  format;  you  might  prefer  to  have  a  DataFrame  containingone column per distinct item value indexed by timestamps in the date column. 

Data‐Frame’s pivot method performs exactly this transformation



In [43]:
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


The  first  two  values  passed  are  the  columns  to  be  used  respectively  as  the  row  andcolumn  index,  then  finally  an  optional  value  column  to  fill  the  DataFrame.  


Suppose you had two value columns that you wanted to reshape simultaneously:

In [45]:
ldata['value2'] = np.random.randn(len(ldata))
ldata.head()

Unnamed: 0,date,item,value,value2
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,-0.302819
1,1959-03-31 23:59:59.999999999,infl,0.0,-0.712186
2,1959-03-31 23:59:59.999999999,unemp,5.8,-0.818666
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,-0.450042
4,1959-06-30 23:59:59.999999999,infl,2.34,-1.733989


In [47]:
pivoted = ldata.pivot('date', 'item')
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.712186,-0.302819,-0.818666
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.733989,-0.450042,0.097393
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,0.866501,0.069469,0.340655
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,0.814089,0.128743,-0.922541
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,0.049107,0.222105,-1.787127


By omitting the last argument, you obtain a DataFrame with hierarchical columns

Note  that  pivot  is  equivalent  to  creating  a  hierarchical  index  using  set_index  fol‐lowed by a call to unstack

In [49]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.712186,-0.302819,-0.818666
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.733989,-0.450042,0.097393
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,0.866501,0.069469,0.340655
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,0.814089,0.128743,-0.922541
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,0.049107,0.222105,-1.787127


<b>Pivoting “Wide” to “Long” Format

An  inverse  operation  to  pivot  for  DataFrames  is  pandas.melt.  Rather  than  trans‐forming one column into many in a new DataFrame, it merges multiple columns intoone, producing a DataFrame that is longer than the input. 

In [52]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                   'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]})

df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


The 'key' column may be a group indicator, and the other columns are data values.When using pandas.melt, we must indicate which columns (if any) are group indica‐tors. Let’s use 'key' as the only group indicator here:

In [54]:
melted = pd.melt(df, ['key'])
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [58]:
reshaped = melted.pivot('key', 'variable', 'value')
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


Since the result of pivot creates an index from the column used as the row labels, wemay want to use reset_index to move the data back into a column

In [57]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


You can also specify a subset of columns to use as value columns:

In [59]:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


pandas.melt can be used without any group identifiers, too:

In [62]:
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [61]:
pd.melt(df, value_vars=['A', 'B', 'C','key'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9
9,key,foo
