# Chapter 8. Data Wrangling: Join, Combine, and Reshape
<a id='index'></a>
## Table of Content
- [8.1 Hierarchical Indexing](#81)
    - [8.1.1 Reordering and Sorting Levels](#811)
    - [8.1.2 Summary Statistics by Level](#812)
    - [8.1.3 Indexing with ad DataFrame's columns](#813)
- [8.2 Combining and Merging Datasets](#82)
    - [8.2.1 Database-Style DataFrame Joins](#821)
    - [8.2.2 Merging on Index](#822)
    - [8.2.3 Concatenating Along an Axis](#823)
    - [8.2.4 Combining Data with Overlap](#824)
- [8.3 Reshaping and Pivoting](#83)
    - [8.3.1 Reshaping with Hierarchical Indexing](#831)
    - [8.3.2 Pivoting “Long” to “Wide” Format](#832)
    - [8.3.3 Pivot "Wide" to "Long" Format](#833)

## 8.1 Hierarchical Indexing
<a id='81'></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Series with multi-indexes
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,3,1,2,2,3]])
data

a  1   -0.236161
   2    0.855534
   3   -0.066375
b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
d  2   -0.094909
   3    1.583782
dtype: float64

In [3]:
# What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The
# “gaps” in the index display mean “use the label directly above”:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

In [4]:
data['b']

1   -0.329522
3    0.627017
dtype: float64

In [5]:
data['b':'c']

b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
dtype: float64

In [6]:
data.loc[['b', 'd']]

b  1   -0.329522
   3    0.627017
d  2   -0.094909
   3    1.583782
dtype: float64

In [7]:
# Selection is even possible from an “inner” level:
data.loc[:, 2]

a    0.855534
c    0.373509
d   -0.094909
dtype: float64

In [8]:
# you could rearrange the data into a DataFrame using its unstack method
data.unstack()

Unnamed: 0,1,2,3
a,-0.236161,0.855534,-0.066375
b,-0.329522,,0.627017
c,-1.29814,0.373509,
d,,-0.094909,1.583782


In [9]:
# The inverse operation of unstack is stack:
data.unstack().stack()

a  1   -0.236161
   2    0.855534
   3   -0.066375
b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
d  2   -0.094909
   3    1.583782
dtype: float64

In [10]:
# With a DataFrame, either axis can have a hierarchical index
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), 
                     index=[['a','a','b','b'],
                            ['1','2','1','2']], 
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [11]:
# The hierarchical levels can have names (as strings or any Python objects). 
# If so, these will show up in the console output:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [12]:
# With partial column indexing you can similarly select groups of columns:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

<hr>

### 8.1.1 Reordering and Sorting Levels
<a id='811'></a>

In [14]:
# swaplevel takes two level numbers or names and returns a new object with the levels 
# interchanged (but the data is otherwise unaltered):
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [18]:
# sort_index, on the other hand, sorts the data using only the values in a single level.
frame.sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [20]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### 8.1.2 Summary Statistics by Level
<a id='812'></a>

In [21]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [25]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [26]:
frame.sum(level='color', axis=1).sum(level='key2')

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16,8
2,28,14


### 8.1.3 Indexing with ad DataFrame's columns
<a id='813'></a>

In [31]:
frame = pd.DataFrame({'a': range(7), 
                      'b': range(7, 0, -1), 
                      'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [40]:
frame.index.names = ['No.']
frame

Unnamed: 0_level_0,a,b,c,d
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [42]:
# DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index:
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [43]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [47]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


<hr>

## 8.2 Combining and Merging Datasets
<a id='82'></a>
- ***pandas.merge*** connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- ***pandas.concat*** concatenates or “stacks” together objects along an axis.
- The ***combine_first*** instance method enables splicing together overlapping data to fill in missing values in one object with values from another.
### 8.2.1 Database-Style DataFrame Joins
<a id='821'></a>

In [26]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})

In [27]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [28]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


In [31]:
# Merge or join operations combine datasets by linking rows using one or more keys.
# This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, 
# whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


In [32]:
# merge uses the overlapping column names as the keys. 
# It’s a good practice to specify explicitly,
pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


In [35]:
# If the column names are different in each object, you can specify them separately:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


> You may notice that the 'c' and 'd' values and associated data are missing from the result. By default merge does an ***'inner'*** join.
Therefore, other possible options are ***'left'***, ***'right'***, and ***'outer'***.

In [36]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


In [38]:
# Many-to-many merges have well-defined, though not necessarily intuitive, behavior. Here’s an example:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})

In [39]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


In [40]:
# Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' 
# rows in the result. The join method only affects the distinct key values appearing in the result:
pd.merge(df1, df2, how='inner')

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,3
2,1,b,1
3,1,b,3
4,5,b,1
5,5,b,3
6,2,a,0
7,2,a,2
8,4,a,0
9,4,a,2


In [41]:
# To merge with multiple keys, pass a list of column names:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


> A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually (see the earlier section on renaming axis labels), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects:

In [42]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [47]:
# or
pd.merge(left, right, on='key1', suffixes=['_leftx', '_righty'])

Unnamed: 0,key1,key2_leftx,lval,key2_righty,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### 8.2.2 Merging on Index
<a id='822'></a>
In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key:

In [48]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [49]:
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [50]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [52]:
# Since the default merge method is to intersect the join keys, 
# you can instead form the union of them with an outer join:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


In [53]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
lefth

Unnamed: 0,data,key1,key2
0,0.0,Ohio,2000
1,1.0,Ohio,2001
2,2.0,Ohio,2002
3,3.0,Nevada,2001
4,4.0,Nevada,2002


In [55]:
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                              'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [58]:
# You have to indicate multiple columns to merge on as a list 
# (note the handling of duplicate index values with how='outer'):
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4,5
0,0.0,Ohio,2000,6,7
1,1.0,Ohio,2001,8,9
2,2.0,Ohio,2002,10,11
3,3.0,Nevada,2001,0,1


In [59]:
# Using the indexes of both sides of the merge is also possible:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [61]:
# DataFrame has a convenient join instance for merging by index.
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [62]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


In [64]:
# Lastly, for simple index-on-index merges, you can pass a list of DataFrames to 
# join as an alternative to using the more general concat function described in the next section:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'],
                       columns=['New York', 'Oregon'])
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
b,,,7.0,8.0,,
c,3.0,4.0,9.0,10.0,9.0,10.0
d,,,11.0,12.0,,
e,5.0,6.0,13.0,14.0,11.0,12.0
f,,,,,16.0,17.0


### 8.2.3 Concatenating Along an Axis
<a id='823'></a>

In [67]:
arr = np.arange(12).reshape((3, 4))
np.concatenate([arr, arr])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [68]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [70]:
# The concat function in pandas provides a consistent way to address each of these concerns. 
# Suppose we have three Series with no index overlap:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

# Calling concat with these objects in a list glues together the values and indexes:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [72]:
# If axis=1 (by default 0), the result will instead be a DataFrame (axis=1 is the columns):
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [77]:
# Previous case there is no overlap on the other axis, which as you can see is the 
# sorted union (the 'outer' join) of the indexes.
s4 = pd.concat([s1, s3])
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [79]:
# You can instead intersect them by passing join='inner':
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


In [81]:
# In this last example, the 'f' and 'g' labels disappeared because of the join='inner' option.
# You can even specify the axes to be used on the other axes with join_axes:
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Unnamed: 0,0,1
a,0.0,0.0
c,,
b,1.0,1.0
e,,


In [82]:
# A potential issue is that the concatenated pieces are not identifiable in the result. 
# Suppose instead you wanted to create a hierarchical index on the concatenation axis. 
# To do this, use the keys argument:
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [83]:
# Spread the data in to flattened structure
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


In [86]:
pd.concat([s1, s2, s3], keys=['one', 'two', 'three'], axis=1)

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [87]:
# The same logic extends to DataFrame objects:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [88]:
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [91]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [94]:
# Equalalent to 
pd.concat({'level1': df1, 'level2': df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [93]:
# Different axis = 0
pd.concat([df1, df2], keys=['level1', 'level2'])

Unnamed: 0,Unnamed: 1,four,one,three,two
level1,a,,0.0,,1.0
level1,b,,2.0,,3.0
level1,c,,4.0,,5.0
level2,a,6.0,,5.0,
level2,c,8.0,,7.0,


In [95]:
pd.concat([df1, df2], axis=1, keys=['level1','level2'], names=['upper', 'lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [97]:
# A last consideration concerns DataFrames in which the row index does not contain any relevant data:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df1

Unnamed: 0,a,b,c,d
0,0.033507,0.863772,0.112738,-1.304616
1,0.334212,0.056916,0.146768,0.159085
2,-1.089473,0.109103,0.323687,-1.727164


In [98]:
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df2

Unnamed: 0,b,d,a
0,-1.046414,1.935099,0.114939
1,0.324846,-0.941661,-0.951848


In [99]:
# In this case, if you can pass ignore_index=True:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,0.033507,0.863772,0.112738,-1.304616
1,0.334212,0.056916,0.146768,0.159085
2,-1.089473,0.109103,0.323687,-1.727164
3,0.114939,-1.046414,,1.935099
4,-0.951848,0.324846,,-0.941661


### 8.2.4 Combining Data with Overlap
<a id='824'></a>

In [103]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a'])

b[-1] = np.nan

In [104]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [105]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [106]:
np.where(pd.isnull(a), b, a)

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

In [108]:
# Series has a combine_first method, which performs the equivalent of 
# this operation along with pandas’s usual data alignment logic:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

In [110]:
# With DataFrames, combine_first does the same thing column by column, so you can think of it 
# as “patching” missing data in the calling object with data from the object you pass:
df1 = pd.DataFrame({'a': [1., np.NaN, 5., np.NaN],
                    'b': [np.NaN, 2., np.NaN, 6.],
                    'c': range(2, 18, 4)})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [111]:
df2 = pd.DataFrame({'a': [5., 4., np.NaN, 3., 7.],
                    'b': [np.NaN, 3., 4., 6., 8.]})
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [112]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


<hr>

## 8.3 Reshaping and Pivoting
<a id='83'></a>

### 8.3.1 Reshaping with Hierarchical Indexing
<a id='831'></a>
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
* **stack**
    * This “rotates” or pivots from the columns in the data to the rows
* **unstack**
    * This pivots from the rows into the columns

In [130]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)), 
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'], name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [115]:
# Using the stack method on this data pivots the columns into the rows, producing a Series:
data.stack()

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [116]:
# From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with unstack:
data.stack().unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [120]:
# By default the innermost level is unstacked (same with stack). 
# You can unstack a different level by passing a level number or name:

data.stack().unstack('state')
# = data.stack().unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [122]:
# Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [123]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [124]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [125]:
# dropna to prevent stack from removing NaN columns
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

In [127]:
# When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
df = pd.DataFrame({'left': data.stack(), 'right': data.stack() + 5}, 
                  columns=pd.Index(['left', 'right'], name='side'))

df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [128]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [129]:
# When calling stack, we can indicate the name of the axis to stack:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


### 8.3.2 Pivoting “Long” to “Wide” Format
<a id='832'></a>

In [146]:
data = pd.read_csv('examples/macrodata.csv')
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [147]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')

data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,2710.349,0.0,5.8
1959-06-30,2778.801,2.34,5.1
1959-09-30,2775.488,2.74,5.3
1959-12-31,2785.204,0.27,5.6
1960-03-31,2847.699,2.31,5.2


In [163]:
# data.stack()
# data.stack().reset_index()
# data.stack().reset_index().rename(columns={0: 'value'})
ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata.head()

# This is the so-called long format for multiple time series, or other observational 
# data with two or more keys (here, our keys are date and item). Each row in the table 
# represents a single observation.

Unnamed: 0,date,item,value
0,1959-03-31,realgdp,2710.349
1,1959-03-31,infl,0.0
2,1959-03-31,unemp,5.8
3,1959-06-30,realgdp,2778.801
4,1959-06-30,infl,2.34


In [164]:
# In some cases, the data may be more difficult to work with in this format; 
# you might prefer to have a DataFrame containing one column per distinct item 
# value indexed by timestamps in the date column. 
# DataFrame’s pivot method performs exactly this transformation:

# The first two values passed are the columns to be used respectively as the row and column index, 
# then finally an optional value column to fill the DataFrame.
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


In [170]:
# Suppose you had two value columns that you wanted to reshape simultaneously:
ldata['value2'] = np.random.randn(len(ldata))
ldata[:10]

Unnamed: 0,date,item,value,value2
0,1959-03-31,realgdp,2710.349,-0.00745
1,1959-03-31,infl,0.0,-1.137752
2,1959-03-31,unemp,5.8,-0.583975
3,1959-06-30,realgdp,2778.801,0.704402
4,1959-06-30,infl,2.34,-0.162753
5,1959-06-30,unemp,5.1,-0.318483
6,1959-09-30,realgdp,2775.488,0.517937
7,1959-09-30,infl,2.74,1.565753
8,1959-09-30,unemp,5.3,1.325127
9,1959-12-31,realgdp,2785.204,-0.023468


In [171]:
# By omitting the last argument, you obtain a DataFrame with hierarchical columns:
pivoted = ldata.pivot('date', 'item')
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-1.137752,-0.00745,-0.583975
1959-06-30,2.34,2778.801,5.1,-0.162753,0.704402,-0.318483
1959-09-30,2.74,2775.488,5.3,1.565753,0.517937,1.325127
1959-12-31,0.27,2785.204,5.6,-0.5823,-0.023468,-0.274981
1960-03-31,2.31,2847.699,5.2,-1.124776,1.473199,-0.481643


In [172]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31,0.0,2710.349,5.8
1959-06-30,2.34,2778.801,5.1
1959-09-30,2.74,2775.488,5.3
1959-12-31,0.27,2785.204,5.6
1960-03-31,2.31,2847.699,5.2


In [174]:
# Note that pivot is equivalent to creating a hierarchical index using set_index followed by a call to unstack:
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31,0.0,2710.349,5.8,-1.137752,-0.00745,-0.583975
1959-06-30,2.34,2778.801,5.1,-0.162753,0.704402,-0.318483
1959-09-30,2.74,2775.488,5.3,1.565753,0.517937,1.325127
1959-12-31,0.27,2785.204,5.6,-0.5823,-0.023468,-0.274981
1960-03-31,2.31,2847.699,5.2,-1.124776,1.473199,-0.481643


### 8.3.3 Pivot "Wide" to "Long" Format
<a id='833'></a>
An inverse operation to ***pivot*** for DataFrames is ***pandas.melt***. Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input. Let’s look at an example:

In [175]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                   'A': [1,2,3],
                   'B': [4,5,6],
                   'C': [7,8,9]})
df

Unnamed: 0,A,B,C,key
0,1,4,7,foo
1,2,5,8,bar
2,3,6,9,baz


In [191]:
# The 'key' column may be a group indicator, and the other columns are data values. 
# When using pandas.melt, we must indicate which columns (if any) are group indicators. 
# Let’s use 'key' as the only group indicator here:
melted = pd.melt(df, ['key'])
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [180]:
# After melt the DataFrame, then use "Pivot" to reshape the layout
reshaped = melted.pivot('key', 'variable', 'value')
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


In [188]:
x = reshaped.reset_index()
x.loc[:,['A', 'B', 'C', 'key']]

variable,A,B,C,key
0,2,5,8,bar
1,3,6,9,baz
2,1,4,7,foo


In [189]:
# You can also specify a subset of columns to use as value columns. This example we omit 'C'
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


In [195]:
# Mel can be used without any group identifiers, too
pd.melt(df, value_vars=['A', 'B', 'C'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


In [196]:
pd.melt(df, value_vars=['key', 'A', 'B'])

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6


<hr>

[Back to top](#index)