# Chapter 7: Data Wrangling (Clean, Transform, Merge, Reshape)
> Much of the programming work in data analysis and modeling is spent on data prep-
aration: loading, cleaning, transforming, and rearranging. Sometimes the way that data
is stored in files or databases is not the way you need it for a data processing application.
Many people choose to do ad hoc processing of data from one form to another using
a general purpose programming, like Python, Perl, R, or Java, or UNIX text processing
tools like sed or awk. Fortunately, pandas along with the Python standard library pro-
vide you with a high-level, flexible, and high-performance set of core manipulations
and algorithms to enable you to wrangle data into the right form without much trouble.
If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere
in the pandas library, feel free to suggest it on the mailing list or GitHub site. Indeed,
much of the design and implementation of pandas has been driven by the needs of real
world applications.

**Overview**:
* Combining and Merging Data Sets
* Reshaping and Pivoting
* Data Transformation
* String Manipulation
* Example: USDA Food Database

# Combining and Merging Data Sets

* **pandas.merge**: connects rows in DataFrames based on one or more keys. This will
be familiar to users of SQL or other relational databases, as it implements database
join operations.
* **pandas.concat** glues or stacks together objects along an axis.
* **combine_first** instance method enables splicing together overlapping data to fill
in missing values in one object with values from another.

## Database-style DataFrame Merges

In [2]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

### **Case 1**: Merge many-to-one: One DataFrame has multiple rows, and one has one row for each value

In [7]:
df1 = DataFrame({
    'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    'data1': np.arange(7)
    }
)
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [10]:
df2 = DataFrame({
        'key': ['a', 'b', 'd'], 
        'data2': np.arange(3)
    })
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


**pd.merge** is able to merge if both DataFrame has at least 1 common columns, it will keep the same value

In [11]:
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


We can specify which column to jon in by using **on** option

In [15]:
pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If the column names are different in each object, you can specify them separately by using **left_on** and **right_on** options:

In [16]:
df3 = DataFrame({
        'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
        'data1': np.arange(7)
    })
df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [17]:
df4 = DataFrame({
        'rkey': ['a', 'b', 'd'], 
        'data2': np.arange(3)
    })
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,d


In [18]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


**pd.merge** by default  does an **'inner' join**, then some rows of each DataFrame can be lost. 

If we don't want to lose any data, then we can change option **how** of **pd.merge** from **inner** by default to **outer**

In [21]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


### Case 2: Merge many-to-many

In [23]:
df1 = DataFrame({
        'key': ['b', 'b', 'a', 'c', 'a', 'b'],
        'data1': range(6)
    })
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [24]:
df2 = DataFrame({
        'key': ['a', 'b', 'a', 'b', 'd'],
        'data2': range(5)
    })
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,a
3,3,b
4,4,d


**how='left'**: Keep all data of df1

In [25]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


In [26]:
pd.merge(df1, df2, on='key', how='inner')

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,3
2,1,b,1
3,1,b,3
4,5,b,1
5,5,b,3
6,2,a,0
7,2,a,2
8,4,a,0
9,4,a,2


Merge with multiple keys:

In [27]:
left = DataFrame({
        'key1': ['foo', 'foo', 'bar'],
        'key2': ['one', 'two', 'one'],
        'lval': [1, 2, 3]
    })
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [28]:
right = DataFrame({
        'key1': ['foo', 'foo', 'bar', 'bar'],
        'key2': ['one', 'one', 'one', 'two'],
        'rval': [4, 5, 6, 7]
    })
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [31]:
pd.merge(left, right, on =['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


To determine which key combinations will appear in the result depending on the choice
of merge method, think of the multiple keys as forming an array of tuples to be used
as a single join key (even though it’s not actually implemented that way).

A last issue to consider in merge operations is the treatment of overlapping column
names. While you can address the overlap manually (see the later section on renaming
axis labels), merge has a suffixes option for specifying strings to append to overlapping
names in the left and right DataFrame objects:

In [33]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [35]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


## Merging on Index