# Data Wrangling: Clean, Transform, Merge, Reshape

Much of the programming work in data analysis and modeling is spent on data preparation:
loading, cleaning, transforming, and rearranging. Sometimes the way that data
is stored in files or databases is not the way you need it for a data processing application.
Many people choose to do ad hoc processing of data from one form to another using
a general purpose programming, like Python, Perl, R, or Java, or UNIX text processing
tools like sed or awk. Fortunately, pandas along with the Python standard library provide
you with a high-level, flexible, and high-performance set of core manipulations
and algorithms to enable you to wrangle data into the right form without much trouble.
If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere
in the pandas library, feel free to suggest it on the mailing list or GitHub site. Indeed,
much of the design and implementation of pandas has been driven by the needs of real
world applications.

## Combining and Merging Data Sets

### Combining and Merging Data Sets
Data contained in pandas objects can be combined together in a number of built-in
ways:
* pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.

* pandas.concat glues or stacks together objects along an axis.

* combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

I will address each of these and give a number of examples. They’ll be utilized in examples
throughout the rest of the book.

### Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more keys.
These operations are central to relational databases. The merge function in pandas is
the main entry point for using these algorithms on your data.
Let’s start with a simple example:

In [2]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

In [3]:
df1 = DataFrame({'key' : ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':range(7)})

In [7]:
df2 = DataFrame({'key' : ['a', 'b', 'd'], 'data2':range(3)})

In [8]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [9]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


This is an example of a many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column. Calling
merge with these objects we obtain:

In [10]:
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


Note that I didn’t specify which column to join on. If not specified, merge uses the
overlapping column names as the keys. It’s a good practice to specify explicitly, though:

In [11]:
pd.merge(df1, df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If the column names are different in each object, you can specify them separately:

In [15]:
df3 = DataFrame({'lkey' : ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':range(7)})

In [16]:
df4 = DataFrame({'rkey' : ['a', 'b', 'd'], 'data2':range(3)})

In [17]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


You probably noticed that the 'c' and 'd' values and associated data are missing from
the result. By default merge does an 'inner' join; the keys in the result are the intersection.
Other possible options are 'left', 'right', and 'outer'. The outer join takes the
union of the keys, combining the effect of applying both left and right joins:

In [18]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


Many-to-many merges have well-defined though not necessarily intuitive behavior.
Here’s an example:

In [19]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

In [23]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [24]:
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,a
3,3,b
4,4,d


In [25]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,3.0
2,1,b,1.0
3,1,b,3.0
4,2,a,0.0
5,2,a,2.0
6,3,c,
7,4,a,0.0
8,4,a,2.0
9,5,b,1.0


Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b'
rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result.
The join method only affects the distinct key values appearing in the result:

In [26]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,3
2,1,b,1
3,1,b,3
4,5,b,1
5,5,b,3
6,2,a,0
7,2,a,2
8,4,a,0
9,4,a,2


To merge with multiple keys, pass a list of column names:

In [27]:
left = DataFrame({
                'key1': ['foo', 'foo', 'bar']
                ,'key2': ['one', 'two', 'one']
                 })

In [28]:
right = DataFrame({
                'key1': ['foo', 'foo', 'bar', 'bar']
                ,'key2': ['one', 'one', 'one', 'two']
                ,'rval': [4, 5, 6, 7]
                 })

In [29]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,rval
0,foo,one,4.0
1,foo,one,5.0
2,foo,two,
3,bar,one,6.0
4,bar,two,7.0


To determine which key combinations will appear in the result depending on the choice
of merge method, think of the multiple keys as forming an array of tuples to be used
as a single join key (even though it’s not actually implemented that way).

NOTE:
When joining columns-on-columns, the indexes on the passed Data-Frame objects are discarded.

A last issue to consider in merge operations is the treatment of overlapping column
names. While you can address the overlap manually (see the later section on renaming
axis labels), merge has a suffixes option for specifying strings to append to overlapping
names in the left and right DataFrame objects:

In [30]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,key2_y,rval
0,foo,one,one,4
1,foo,one,one,5
2,foo,two,one,4
3,foo,two,one,5
4,bar,one,one,6
5,bar,one,two,7


In [31]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,key2_right,rval
0,foo,one,one,4
1,foo,one,one,5
2,foo,two,one,4
3,foo,two,one,5
4,bar,one,one,6
5,bar,one,two,7


See Table 7-1 for an argument reference on merge. Joining on index is the subject of the next section.

Table 7-1. merge function arguments

Argument Description

left DataFrame to be merged on the left side

right DataFrame to be merged on the right side

how One of 'inner', 'outer', 'left' or 'right'. 'inner' by default

on Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys

left_on Columns in left DataFrame to use as join keys

right_on Analogous to left_on for left DataFrame

left_index Use row index in left as its join key (or keys, if a MultiIndex)

right_index Analogous to left_index

sort Sort merged data lexicographically by join keys; True by default. Disable to get better performance in some cases on large datasets

suffixes Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y'). For example, if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result

copy If False, avoid copying data into resulting data structure in some exceptional cases. By default always copies

## Merging on Index

In some cases, the merge key or keys in a DataFrame will be found in its index. In this
case, you can pass left_index=True or right_index=True (or both) to indicate that the
index should be used as the merge key:

In [32]:
left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})

In [33]:
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [34]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [35]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [36]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


Since the default merge method is to intersect the join keys, you can instead form the
union of them with an outer join:

In [37]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


With hierarchically-indexed data, things are a bit more complicated:

In [38]:
lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']
                   ,'key2': [2000, 2001, 2002, 2001, 2002]
                   ,'data': np.arange(5.)})

In [39]:
righth = DataFrame(np.arange(12).reshape((6, 2)),
                    index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                            [2001, 2000, 2000, 2000, 2001, 2002]], columns=['event1', 'event2'])

In [40]:
lefth

Unnamed: 0,data,key1,key2
0,0.0,Ohio,2000
1,1.0,Ohio,2001
2,2.0,Ohio,2002
3,3.0,Nevada,2001
4,4.0,Nevada,2002


In [41]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In this case, you have to indicate multiple columns to merge on as a list (pay attention
to the handling of duplicate index values):

In [42]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4,5
0,0.0,Ohio,2000,6,7
1,1.0,Ohio,2001,8,9
2,2.0,Ohio,2002,10,11
3,3.0,Nevada,2001,0,1


In [43]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')

Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4.0,5.0
0,0.0,Ohio,2000,6.0,7.0
1,1.0,Ohio,2001,8.0,9.0
2,2.0,Ohio,2002,10.0,11.0
3,3.0,Nevada,2001,0.0,1.0
4,4.0,Nevada,2002,,
4,,Nevada,2000,2.0,3.0


Using the indexes of both sides of the merge is also not an issue:

In [44]:
left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], columns=['Ohio', 'Nevada'])

In [45]:
right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

In [46]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [47]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [48]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


DataFrame has a more convenient join instance for merging by index. It can also be
used to combine together many DataFrame objects having the same or similar indexes
but non-overlapping columns. In the prior example, we could have written

In [50]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In part for legacy reasons (much earlier versions of pandas), DataFrame’s join method
performs a left join on the join keys. It also supports joining the index of the passed
DataFrame on one of the columns of the calling DataFrame:

In [51]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


Lastly, for simple index-on-index merges, you can pass a list of DataFrames to join as
an alternative to using the more general concat function described below:

In [60]:
another = DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], 
                    index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])

In [59]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [61]:
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
b,,,7.0,8.0,,
c,3.0,4.0,9.0,10.0,9.0,10.0
d,,,11.0,12.0,,
e,5.0,6.0,13.0,14.0,11.0,12.0
f,,,,,16.0,17.0


In [56]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [57]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


### Concatenating Along an Axis
Another kind of data combination operation is alternatively referred to as concatenation,
binding, or stacking. NumPy has a concatenate function for doing this with raw
NumPy arrays:

In [62]:
arr = np.arange(12).reshape((3, 4))

In [63]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [64]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In the context of pandas objects such as Series and DataFrame, having labeled axes
enable you to further generalize array concatenation. In particular, you have a number
of additional things to think about:
* If the objects are indexed differently on the other axes, should the collection of axes be unioned or intersected?
* Do the groups need to be identifiable in the resulting object?
* Does the concatenation axis matter at all?

The concat function in pandas provides a consistent way to address each of these concerns.
I’ll give a number of examples to illustrate how it works. Suppose we have three
Series with no index overlap:

In [65]:
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['f', 'g'])

Calling concat with these object in a list glues together the values and indexes:

In [66]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [67]:
s1

a    0
b    1
dtype: int64

By default concat works along axis=0, producing another Series. If you pass axis=1, the
result will instead be a DataFrame (axis=1 is the columns):

In [68]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In this case there is no overlap on the other axis, which as you can see is the sorted
union (the 'outer' join) of the indexes. You can instead intersect them by passing
join='inner':

In [69]:
s4 = pd.concat([s1 * 5, s3])

In [70]:
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,5
f,,5
g,,6


In [71]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,5


You can even specify the axes to be used on the other axes with join_axes:

In [72]:
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Unnamed: 0,0,1
a,0.0,0.0
c,,
b,1.0,5.0
e,,


One issue is that the concatenated pieces are not identifiable in the result. Suppose
instead you wanted to create a hierarchical index on the concatenation axis. To do this,
use the keys argument:

In [73]:
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])

In [74]:
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [75]:
# Much more on the unstack function later
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


In the case of combining Series along axis=1, the keys become the DataFrame column
headers:

In [76]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


The same logic extends to DataFrame objects:

In [77]:
df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])

In [78]:
df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])

In [79]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If you pass a dict of objects instead of a list, the dict’s keys will be used for the keys
option:

In [80]:
pd.concat({'level1': df1, 'level2': df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


There are a couple of additional arguments governing how the hierarchical index is
created (see Table 7-2):

In [82]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


A last consideration concerns DataFrames in which the row index is not meaningful in the context of the analysis:

In [83]:
df1 = DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

In [84]:
df2 = DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [85]:
df1

Unnamed: 0,a,b,c,d
0,-0.654771,0.576284,-0.027951,1.126414
1,0.313798,0.65661,1.131592,0.266183
2,-0.625898,1.236147,-0.537745,1.468091


In [87]:
df2

Unnamed: 0,b,d,a
0,-0.428264,1.84225,-0.633368
1,-0.456348,-0.931057,0.211579


In this case, you can pass ignore_index=True:

In [88]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.654771,0.576284,-0.027951,1.126414
1,0.313798,0.65661,1.131592,0.266183
2,-0.625898,1.236147,-0.537745,1.468091
3,-0.633368,-0.428264,,1.84225
4,0.211579,-0.456348,,-0.931057


Table 7-2. concat function arguments

Argument Description

objs List or dict of pandas objects to be concatenated. The only required argument

axis Axis to concatenate along; defaults to 0

join One of 'inner', 'outer', defaulting to 'outer'; whether to intersection (inner) or union
(outer) together indexes along the other axes

join_axes Specific indexes to use for the other n-1 axes instead of performing union/intersection logic

keys Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis. Can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple level arrays passed in levels)

levels Specific indexes to use as hierarchical index level or levels if keys passed

names Names for created hierarchical levels if keys and / or levels passed

verify_integrity Check new axis in concatenated object for duplicates and raise exception if so. By default
(False) allows duplicates

ignore_index Do not preserve indexes along concatenation axis, instead producing a new
range(total_length) index

### Combining Data with Overlap

Another data combination situation can’t be expressed as either a merge or concatenation
operation. You may have two datasets whose indexes overlap in full or part. As
a motivating example, consider NumPy’s where function, which expressed a vectorized
if-else:

In [89]:
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])

In [90]:
b = Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a'])

In [91]:
b[-1] = np.nan

In [92]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [93]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [94]:
np.where(pd.isnull(a), b, a)

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

Series has a combine_first method, which performs the equivalent of this operation plus data alignment:

In [95]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames, combine_first naturally does the same thing column by column, so
you can think of it as “patching” missing data in the calling object with data from the
object you pass:

In [96]:
b[:-2]

f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64

In [97]:
a[2:]

d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [100]:
df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
        'b': [np.nan, 2., np.nan, 6.],
        'c': range(2, 18, 4)})

In [101]:
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
        'b': [np.nan, 3., 4., 6., 8.]})

In [102]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


In [103]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [104]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0
