# Chapter 8. Data Wrangling: Join, Combine, and Reshape
<a id='index'></a>
## Table of Content
- [8.1 Hierarchical Indexing](#81)
    - [8.1.1 Reordering and Sorting Levels](#811)
    - [8.1.2 Summary Statistics by Level](#812)
    - [8.1.3 Indexing with ad DataFrame's columns](#813)
- [8.2 Combining and Merging Datasets](#82)
    - [8.2.1 Database-Style DataFrame Joins](#821)

## 8.1 Hierarchical Indexing
<a id='81'></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Series with multi-indexes
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,3,1,2,2,3]])
data

a  1   -0.236161
   2    0.855534
   3   -0.066375
b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
d  2   -0.094909
   3    1.583782
dtype: float64

In [3]:
# What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The
# “gaps” in the index display mean “use the label directly above”:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

In [4]:
data['b']

1   -0.329522
3    0.627017
dtype: float64

In [5]:
data['b':'c']

b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
dtype: float64

In [6]:
data.loc[['b', 'd']]

b  1   -0.329522
   3    0.627017
d  2   -0.094909
   3    1.583782
dtype: float64

In [7]:
# Selection is even possible from an “inner” level:
data.loc[:, 2]

a    0.855534
c    0.373509
d   -0.094909
dtype: float64

In [8]:
# you could rearrange the data into a DataFrame using its unstack method
data.unstack()

Unnamed: 0,1,2,3
a,-0.236161,0.855534,-0.066375
b,-0.329522,,0.627017
c,-1.29814,0.373509,
d,,-0.094909,1.583782


In [9]:
# The inverse operation of unstack is stack:
data.unstack().stack()

a  1   -0.236161
   2    0.855534
   3   -0.066375
b  1   -0.329522
   3    0.627017
c  1   -1.298140
   2    0.373509
d  2   -0.094909
   3    1.583782
dtype: float64

In [10]:
# With a DataFrame, either axis can have a hierarchical index
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), 
                     index=[['a','a','b','b'],
                            ['1','2','1','2']], 
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [11]:
# The hierarchical levels can have names (as strings or any Python objects). 
# If so, these will show up in the console output:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [12]:
# With partial column indexing you can similarly select groups of columns:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

<hr>

### 8.1.1 Reordering and Sorting Levels
<a id='811'></a>

In [14]:
# swaplevel takes two level numbers or names and returns a new object with the levels 
# interchanged (but the data is otherwise unaltered):
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [18]:
# sort_index, on the other hand, sorts the data using only the values in a single level.
frame.sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [20]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### 8.1.2 Summary Statistics by Level
<a id='812'></a>

In [21]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [25]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [26]:
frame.sum(level='color', axis=1).sum(level='key2')

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16,8
2,28,14


### 8.1.3 Indexing with ad DataFrame's columns
<a id='813'></a>

In [31]:
frame = pd.DataFrame({'a': range(7), 
                      'b': range(7, 0, -1), 
                      'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [40]:
frame.index.names = ['No.']
frame

Unnamed: 0_level_0,a,b,c,d
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [42]:
# DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index:
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [43]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [47]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## 8.2 Combining and Merging Datasets
<a id='82'></a>
- ***pandas.merge*** connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- ***pandas.concat*** concatenates or “stacks” together objects along an axis.
- The ***combine_first*** instance method enables splicing together overlapping data to fill in missing values in one object with values from another.
### 8.2.1 Database-Style DataFrame Joins
<a id='821'></a>

<hr>

[Back to top](#index)