<h1 style="font-size:3rem; color: sienna;">Data Wrangling_Join, Combine, and Reshape:</h1>

In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, join, and rearrange data.

First, we will dicuss the concept of *hierarchical indexing* in pandas, which is used extensively in some of these operations. We then dig into the particular data manipulations.

# Table of Contents

- 1.1  **[Hierarchical Indexing](#Hierarchical_Indexing)**
   
- 1.2  **[Reordering and Sorting Levels](#Reordering)**

- 1.3  **[Summary Statistics by Level](#Summary_Statistics_by_Level)**

- 1.4  **[Indexing with a DataFrame’s columns](#Indexing)**

<a id="Hierarchical_Indexing"></a>
## Part 1: Hierarchical Indexing

*Hierarchical indexing* is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example; create a `Series` with a list of lists (or arrays) as the index:

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.Series(np.random.randn(9),
                  index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], 
                  [1,2,3,1,3,1,2,2,3]])

In [3]:
data

a  1    2.309037
   2    0.229874
   3    1.051791
b  1    0.630351
   3    0.459164
c  1    0.266681
   2   -1.565938
d  2    0.665968
   3   -0.151836
dtype: float64

What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:

In [4]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchically indexed object, so-called *partial* indexing is possible, enabling you to concisely select subsets of the data:

In [5]:
data['b']

1    0.630351
3    0.459164
dtype: float64

In [6]:
data['b':'c']

b  1    0.630351
   3    0.459164
c  1    0.266681
   2   -1.565938
dtype: float64

In [7]:
data.loc[['b', 'd']]

b  1    0.630351
   3    0.459164
d  2    0.665968
   3   -0.151836
dtype: float64

In [12]:
data

a  1    2.309037
   2    0.229874
   3    1.051791
b  1    0.630351
   3    0.459164
c  1    0.266681
   2   -1.565938
d  2    0.665968
   3   -0.151836
dtype: float64

In [11]:
data.loc['a':'c',1:2]

a  1    2.309037
   2    0.229874
b  1    0.630351
c  1    0.266681
   2   -1.565938
dtype: float64

Selection is even possible from an “inner” level:

In [13]:
data.loc[:, 2]

a    0.229874
c   -1.565938
d    0.665968
dtype: float64

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its `unstack` method:

In [14]:
data

a  1    2.309037
   2    0.229874
   3    1.051791
b  1    0.630351
   3    0.459164
c  1    0.266681
   2   -1.565938
d  2    0.665968
   3   -0.151836
dtype: float64

In [15]:
data.unstack()

Unnamed: 0,1,2,3
a,2.309037,0.229874,1.051791
b,0.630351,,0.459164
c,0.266681,-1.565938,
d,,0.665968,-0.151836


The inverse operation of `unstack` is stack:

In [16]:
data.unstack().stack()

a  1    2.309037
   2    0.229874
   3    1.051791
b  1    0.630351
   3    0.459164
c  1    0.266681
   2   -1.565938
d  2    0.665968
   3   -0.151836
dtype: float64

With a DataFrame, either axis can have a hierarchical index: 

In [17]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])

In [18]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:

In [19]:
frame.index.names = ['key1', 'key2']

In [20]:
frame.columns.names = ['state', 'color']

In [21]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With partial column indexing you can similarly select groups of columns:

In [33]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [42]:
# frame.loc[['a'], 'Ohio']

frame.loc[('a', 2), ('Ohio', 'Green')]



np.int64(3)

<a id="Reordering"></a>
## 1.2 Reordering and Sorting Levels

At times you will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The `swaplevel` takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [23]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use `sort_index` so that the result is lexicographically sorted by the indicated level:

In [24]:
frame.sort_index(level=1) 

#Use sort_index() function to sort on values in specified index level(s).

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [25]:
frame.sort_index(level=0) 

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [26]:
# another exmaple
arrays = [np.array(['xx', 'xx', 'ff', 'ff',
                    'bb', 'bb', 'br', 'br']),
          np.array(['two', 'one', 'two', 'one',
                    'two', 'one', 'two', 'one'])]

In [27]:
s = pd.Series([2, 3, 4, 5, 6, 7, 8, 9], index=arrays)

In [28]:
s

xx  two    2
    one    3
ff  two    4
    one    5
bb  two    6
    one    7
br  two    8
    one    9
dtype: int64

In [29]:

s.sort_index(level=0)

bb  one    7
    two    6
br  one    9
    two    8
ff  one    5
    two    4
xx  one    3
    two    2
dtype: int64

In [30]:
frame.swaplevel(0, 1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [31]:
frame.swaplevel(0, 1).sort_index(level=0) 

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


<a id="Summary_Statistics_by_Level"></a>
## 1.3 Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the `level` you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

In [43]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [54]:
!pip show pandas

Name: pandas
Version: 2.2.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2011-2023, Open source contributors.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be u

In [55]:
import pandas as pd
import numpy as np

frame = pd.DataFrame(
    np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]
)

# Adding names to levels for clarity (optional)
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

# Sum over 'key2' level (combine rows with the same lower-level key)
result = frame.groupby(level=0).sum()
print(result)


state  Ohio     Colorado
color Green Red    Green
key1                    
a         3   5        7
b        15  17       19


In [56]:
#frame.sum(level='key1')
frame.groupby(level='key1').sum()


state,Ohio,Ohio,Colorado
color,Green,Red,Green
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [58]:
frame.groupby(level='color',axis=1).sum()

  frame.groupby(level='color',axis=1).sum()


Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


<a id="Indexing"></a>
## 1.4 Indexing with a DataFrame’s columns


It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [59]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})


In [63]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [64]:
frame2 = frame.set_index(['c', 'd'])

In [65]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them in:

In [66]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [67]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [68]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [69]:
# Setting names for the multi-index of frame2
frame2.index.set_names(['index_1', 'index_2'], inplace=True)

# Setting a name for the columns of frame2
frame2.columns.name = "Profile"

frame2


Unnamed: 0_level_0,Profile,a,b
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1
