# Week 7. Data Wrangling: Join, Combine, and Reshape

In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. Hence, we need to firstly combine, join, and rearrange data. 

In [5]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 10
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

---

## 7.3 Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. 
 * It provides a way for you to work with higher dimensional data in a lower dimensional form. 
 
Let's create a Series with a **list of lists (or arrays)** as the index:

In [6]:
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1   -0.204708
   2    0.478943
   3   -0.519439
b  1   -0.555730
   3    1.965781
c  1    1.393406
   2    0.092908
d  2    0.281746
   3    0.769023
dtype: float64

In [4]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

**Partial indexing for a hierarchically indexed object**

In [None]:
data['b']

In [None]:
data['b':'c']

In [None]:
data.loc[['b', 'd']]

Selection is even possible from an **“inner”** level.

Suppose that you want to extract all the rows whose inner index values equal 2:

In [None]:
data.loc[:, 2]

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a **pivot table**. 

In [None]:
data.unstack()   # transform one-dimensional Series into two-dimensional DataFrame

In [None]:
# The inverse operation of unstack is stack:
data.unstack().stack()

With a DataFrame, either axis can have a hierarchical index:

In [None]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

The hierarchical levels can have names (as strings or any Python objects):

In [None]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

With partial column indexing you can similarly select groups of columns:

In [None]:
frame['Ohio']

A ```MultiIndex``` can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [None]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                          names=['state', 'color'])

### 7.3.1 Reordering and Sorting Levels

* At times you will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. <br>
<br>
* The ```swaplevel``` takes two level numbers or names and returns a new object with the levels interchanged (***but the data is otherwise unaltered***). 

In [None]:
frame

In [None]:
frame.swaplevel('key1', 'key2')

```sort_index``` sorts the data using only the values **in a single level**.

In [None]:
frame.sort_index(level=1)

In [None]:
frame.sort_index(level=0)

### 7.3.3 Indexing with a DataFrame's columns

In [None]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

DataFrame’s ```set_index``` function will create a new DataFrame using one or more of its columns as the index:

In [None]:
frame2 = frame.set_index(['c', 'd'])
frame2

By default the columns are removed from the DataFrame, though you can leave them in:

In [None]:
frame.set_index(['c', 'd'], drop=False)

```reset_index``` does the opposite of ```set_index```: the hierarchical index levels are moved into the columns:

In [None]:
frame2.reset_index()

---

## 7.4 Combining and Merging Datasets

Data contained in pandas objects can be combined together in a number of ways:
* ```pandas.merge```: connects rows in DataFrames based on one or more keys. 
* ```pandas.concat```: concatenates or “stacks” together objects along an axis.
* The ```combine_first``` method enables splicing together overlapping data to fill in missing values in one object with values from another.

### 7.4.1 Database-Style DataFrame Joins

```Merge``` or ```join``` operations combine datasets by linking rows using one or more keys. These operations are central to e.g., SQL.

In [None]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df1

In [None]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df2

This is an example of a **many-to-one** join: The data in ```df1``` has multiple rows labeled a and b, whereas ```df2``` has only one row for each value in the ```key``` column.

In [None]:
pd.merge(df1, df2)

Note that I didn’t specify which column to join on. If that information is not specified, merge uses the overlapping column names as the keys. **It’s a good practice to specify explicitly, though:**

In [None]:
pd.merge(df1, df2, on='key')  

If the column names are different in each object, you can specify them separately:

In [None]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

* By default merge does an ```'inner'``` join; the keys in the result are the intersection, or the common set found in both tables. <br>
<br>
* Other possible options are ```'left'```, ```'right'```, and ```'outer'```. 
  * The outer join takes the union of the keys, combining the effect of applying both left and right joins.

In [None]:
pd.merge(df1, df2, how='outer')

**Many-to-many merges:**

In [None]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df1

In [None]:
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})
df2

In [None]:
pd.merge(df1, df2, on='key', how='left')

## Many-to-many joins form the Cartesian product of the rows. Since there were three 'b' rows in the 
## left DataFrame and two in the right one, there are six 'b' rows in the result.

In [None]:
pd.merge(df1, df2, how='inner')

To merge with **multiple keys**, pass a list of column names:

In [None]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1', 'key2'], how='outer')

A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually, merge has a ```suffixes``` option for specifying strings to append to overlapping names in the left and right DataFrame objects:

In [None]:
pd.merge(left, right, on='key1')

In [None]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

### 7.4.2 Merging on Index

In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you can pass ```left_index=True``` or ```right_index=True``` (or both) to indicate that the index should be used as the merge key.

In [None]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
left1

In [None]:
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
right1

In [None]:
pd.merge(left1, right1, left_on='key', right_index=True)

In [None]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

With hierarchically indexed data, things are more complicated, as joining on index is implicitly a **multiple-key merge**:

In [None]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
lefth

In [None]:
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                              'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
righth

In [None]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

In [None]:
pd.merge(lefth, righth, left_on=['key1', 'key2'],
         right_index=True, how='outer')

In [None]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
left2

In [None]:
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])
right2

In [None]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

DataFrame has a convenient join instance for merging by ```index```.
* In part for legacy reasons (i.e., much earlier versions of pandas), DataFrame’s ```join``` method performs a left join on the join keys, exactly preserving the left frame’s row index.

In [None]:
left2.join(right2, how='outer')

In [None]:
left1.join(right1, on='key')   # indicate that use 'key' of the left1 as index to merge with right1

Lastly, for simple index-on-index merges, you can pass **a list of DataFrames** to join as an alternative to using the more general ```concat``` function described later"

In [None]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'],
                       columns=['New York', 'Oregon'])
another

In [None]:
left2.join([right2, another])

In [None]:
left2.join([right2, another], how='outer')

### 7.4.3 Concatenating Along an Axis

Another kind of data combination operation is referred to interchangeably as concatenation, binding, or stacking. NumPy’s ```concatenate``` function can do this with NumPy arrays:

In [7]:
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [8]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation (```pandas.concat``` function). 

In [9]:
# Suppose we have three Series with no index overlap:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [10]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

* By default concat works along ```axis=0```, producing another Series.  <br>
<br>
* If you pass ```axis=1```, the result will instead be a DataFrame (```axis=1``` is the columns):

In [None]:
pd.concat([s1, s2, s3], axis=1)

In this case there is no overlap on the other axis, which as you can see is the sorted union (the 'outer' join) of the indexes. You can instead intersect them by passing ```join='inner'```.

In [None]:
s4 = pd.concat([s1, s3])
s4

In [None]:
pd.concat([s1, s4], axis=1)

In [None]:
pd.concat([s1, s4], axis=1, join='inner')  # 'f' and 'g' labels disappeared because of join='inner' option.

Suppose instead you wanted to create a hierarchical index on the concatenation axis. To do this, use the ```keys``` argument:

In [11]:
s1

a    0
b    1
dtype: int64

In [12]:
s2

c    2
d    3
e    4
dtype: int64

In [13]:
s3

f    5
g    6
dtype: int64

In [14]:
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])   
result   # create a hierarchical index on the concatenation axis

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [15]:
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


In the case of combining Series along ```axis=1```, the ```keys``` become the DataFrame **column headers**

In [16]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [17]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [18]:
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [19]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If you pass a dict of objects instead of a list, the dict’s keys will be used for the ```keys``` option:

In [20]:
pd.concat({'level1': df1, 'level2': df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [21]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
          names=['upper', 'lower'])   # we can name the created axis levels with the names argument

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


A last consideration concerns DataFrames in which the row index does not contain any relevant data:

In [22]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df1

Unnamed: 0,a,b,c,d
0,1.246435,1.007189,-1.296221,0.274992
1,0.228913,1.352917,0.886429,-2.001637
2,-0.371843,1.669025,-0.43857,-0.539741


In [23]:
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df2

Unnamed: 0,b,d,a
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614


In [24]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,1.246435,1.007189,-1.296221,0.274992
1,0.228913,1.352917,0.886429,-2.001637
2,-0.371843,1.669025,-0.43857,-0.539741
3,-1.021228,0.476985,,3.248944
4,0.302614,-0.577087,,0.124121


### ```concat``` function arguments

* ```objs``` List or dict of pandas objects to be concatenated; this is the only required argument
* ```axis``` Axis to concatenate along; defaults to 0 (along rows)
* ```join``` Either ```'inner'``` or ```'outer'``` (```'outer'``` by default); whether to intersection (inner) or union (outer) together indexes along the other axes
* ```keys``` Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels)
* ```levels``` Specific indexes to use as hierarchical index level or levels if keys passed
* ```names``` Names for created hierarchical levels if ```keys``` and/or ```levels``` passed
* ```ignore_index``` Do not preserve indexes along concatenation ```axis```, instead producing a new ```range(total_length)``` index

### 7.4.4 Combining Data with Overlap

In [25]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [26]:
b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [27]:
np.where(pd.isnull(a), b, a)   # performs the array-oriented equivalent of an if-else expression

array([0. , 2.5, 2. , 3.5, 4.5, nan])

#### Series has a ```combine_first``` method, which performs the equivalent of this operation:

In [28]:
b.combine_first(a)

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [33]:
b.combine_first(a).sort_index()

a    NaN
b    4.0
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

In [29]:
b[:-2].combine_first(a[2:])   # the result is sorted by index

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames, ```combine_first``` does the same thing column by column, so you can think of it as “patching” missing data in the calling object with data from the object you pass:

In [None]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df1

In [None]:
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df2

In [None]:
df1.combine_first(df2)

---

## END