<h1><center>DATA WRANGLING</center></h1>

## Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have mul‐
tiple (two or more) index levels on an axis.

In [1]:
import pandas as pd
import numpy as np

In [6]:
data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1,3,1,2,2,3]])

In [9]:
data

a  1   -1.500312
   2    0.984660
   3    0.304341
b  1    0.296940
   3   -1.425450
c  1   -1.177166
   2   -0.324394
d  2    0.974205
   3    0.617651
dtype: float64

In [10]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [11]:
data['b']

1    0.29694
3   -1.42545
dtype: float64

In [12]:
data['b':'c']

b  1    0.296940
   3   -1.425450
c  1   -1.177166
   2   -0.324394
dtype: float64

In [13]:
data['c':'d']

c  1   -1.177166
   2   -0.324394
d  2    0.974205
   3    0.617651
dtype: float64

In [15]:
data.loc[['b','d']]

b  1    0.296940
   3   -1.425450
d  2    0.974205
   3    0.617651
dtype: float64

In [16]:
data.loc[:,2]

a    0.984660
c   -0.324394
d    0.974205
dtype: float64

In [17]:
# Rearrange the data into a DataFrame using its unstack method:

data.unstack()   

Unnamed: 0,1,2,3
a,-1.500312,0.98466,0.304341
b,0.29694,,-1.42545
c,-1.177166,-0.324394,
d,,0.974205,0.617651


In [19]:
# The inverse operation of unstack is stack:

data.unstack().stack()

a  1   -1.500312
   2    0.984660
   3    0.304341
b  1    0.296940
   3   -1.425450
c  1   -1.177166
   2   -0.324394
d  2    0.974205
   3    0.617651
dtype: float64

In [132]:
# With a DataFrame, either axis can have a hierarchical index:

frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
        columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])


In [133]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [134]:
frame.index.names=['Key1','Key2']

In [135]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
Key1,Key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [136]:
frame.columns.names = ['state','color']

In [137]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Key1,Key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [31]:
# With partial column indexing you can similarly select groups of columns:

frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
Key1,Key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [37]:
'''A MultiIndex can be created by itself and then reused; the columns in the preceding
DataFrame with level names could be created like this:'''

from pandas import MultiIndex

MIndex = MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],names=['state', 'color'])

In [38]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
        columns = MIndex)

In [39]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


## Reordering and Sorting Levels

At times you will need to rearrange the order of the levels on an axis or sort the data
by the values in one specific level. The swaplevel takes two level numbers or names
and returns a new object with the levels interchanged (but the data is otherwise
unaltered):

In [47]:
frame.swaplevel('Key1','Key2')

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
Key2,Key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [52]:
frame.sort_index(level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
Key1,Key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [53]:
frame.swaplevel('Key1','Key2').sort_index(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
Key2,Key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


In [139]:
frame.swaplevel(1, 0).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Key2,Key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


## Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level
option in which you can specify the level you want to aggregate by on a particular
axis. Consider the above DataFrame; we can aggregate by level on either the rows or
columns like so:


In [58]:
frame.sum(level='Key1')

Unnamed: 0_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Green,Red,Green
Key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [62]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
Key1,Key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [64]:
frame.columns.names = ['state','color']

In [65]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Key1,Key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [74]:
frame.swaplevel('Key1','Key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Key2,Key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [73]:
frame.sum(level="Key1",axis=0)

state,Ohio,Ohio,Colorado
color,Green,Red,Green
Key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [75]:
frame.sum(level="Key2",axis=0)

state,Ohio,Ohio,Colorado
color,Green,Red,Green
Key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


## Indexing with a DataFrame’s columns

In [89]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
        'd': [0, 1, 2, 0, 1, 2, 3]})

In [77]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [90]:
frame2=frame.set_index(['c','d'])

In [91]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them
in:

In [80]:
frame2=frame.set_index(['c','d'],drop=False)

In [83]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


reset_index, on the other hand, does the opposite of set_index; the hierarchical
index levels are moved into the columns:

In [97]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Combining and Merging Datasets

Data contained in pandas objects can be combined together in a number of ways:
- pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements -database join operations.
- pandas.concat concatenates or “stacks” together objects along an axis.
- The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

Different join types with how argument.

- 'inner' Use only the key combinations observed in both tables
- 'left' Use all key combinations found in the left table
- 'right' Use all key combinations found in the right table
- 'output' Use all key combinations observed in both tables together

### merge function arguments

- left - DataFrame to be merged on the left side.
- right - DataFrame to be merged on the right side.
- how - One of 'inner', 'outer', 'left', or 'right'; defaults to 'inner'.
- on - Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys
given, will use the intersection of the column names in left and right as the join keys.
- left_on - Columns in left DataFrame to use as join keys.
- right_on - Analogous to left_on for left DataFrame.
- left_index - Use row index in left as its join key (or keys, if a MultiIndex).
- right_index - Analogous to left_index.
- sort - Sort merged data lexicographically by join keys; True by default (disable to get better performance in
some cases on large datasets).
- suffixes - Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y') (e.g., if
'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result).
- copy - If False, avoid copying data into resulting data structure in some exceptional cases; by default always
copies.
- indicator - Adds a special column _merge that indicates the source of each row; values will be 'left_only',
'right_only', or 'both' based on the origin of the joined data in each row.


In [99]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})

In [100]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [102]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

In [103]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [104]:
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Note that I didn’t specify which column to join on. If that information is not speci‐
fied, merge uses the overlapping column names as the keys. It’s a good practice to
specify explicitly, though:

In [105]:
 pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


If the column names are different in each object, you can specify them separately:

In [106]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})


In [107]:
 df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],'data2': range(3)})


In [109]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [110]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


In [111]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [112]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


Many-to-many merges have well-defined, though not necessarily intuitive, behavior.
Here’s an example:

In [114]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

In [115]:
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})


In [116]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [117]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [118]:
pd.merge(df1,df2, on='key',how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


Many-to-many joins form the Cartesian product of the rows. Since there were three
'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the
result. The join method only affects the distinct key values appearing in the result:


In [119]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


To merge with multiple keys, pass a list of column names

In [120]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],'key2': ['one', 'two', 'one'],'lval': [1, 2, 3]})

In [121]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [122]:
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],'key2': ['one', 'one', 'one', 'two'],'rval': [4, 5, 6, 7]})


In [123]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [128]:
pd.merge(left, right, on=['key1', 'key2'], how='left')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1,4.0
1,foo,one,1,5.0
2,foo,two,2,
3,bar,one,3,6.0


A last issue to consider in merge operations is the treatment of overlapping column
names. While you can address the overlap manually (see the earlier section on
renaming axis labels), merge has a suffixes option for specifying strings to append
to overlapping names in the left and right DataFrame objects:


In [129]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [130]:
 pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Merging on Index

In some cases, the merge key(s) in a DataFrame will be found in its index. In this
case, you can pass left_index=True or right_index=True (or both) to indicate that
the index should be used as the merge key:

In [21]:
import pandas as pd
import numpy as np
left1 = pd.DataFrame({'Key':['a','b','c','d','e'],'value':range(5)})

In [33]:
left1.set_index('Key',inplace=True)

In [34]:
left1

Unnamed: 0_level_0,value
Key,Unnamed: 1_level_1
a,0
b,1
c,2
d,3
e,4


In [35]:
right1 = pd.DataFrame({'key':['a','b'],'group_val': [3.5, 7]})


In [36]:
right1

Unnamed: 0,key,group_val
0,a,3.5
1,b,7.0


In [47]:
pd.merge(left1, right1,right_on='key',left_index = True,how='inner')


Unnamed: 0,value,key,group_val
0,0,a,3.5
1,1,b,7.0


Since the default merge method is to intersect the join keys, you can instead form the
union of them with an outer join:

In [49]:
 pd.merge(left1, right1, right_on='key', left_index=True, how='outer')

Unnamed: 0,value,key,group_val
0.0,0,a,3.5
1.0,1,b,7.0
,2,c,
,3,d,
,4,e,


In [5]:
 lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio','Nevada', 'Nevada'],'key2': [2000, 2001, 2002, 2001, 2002],
 ....: 'data': np.arange(5.)})


In [6]:
lefth

Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [7]:
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),index=[['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
 ....: [2001, 2000, 2000, 2000, 2001, 2002]],
 ....: columns=['event1', 'event2'])


In [8]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [9]:
 pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [50]:
 pd.merge(lefth, righth, left_on=['key1', 'key2'],right_index=True, how='outer')


Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


Using the indexes of both sides of the merge is also possible:


In [12]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],index=['a', 'c', 'e'],columns=['Ohio', 'Nevada'])


In [14]:
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],index=['b', 'c', 'd', 'e'],columns=['Missouri', 'Alabama'])

In [15]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [16]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [54]:
pd.merge(left2, right2, how='inner',left_index=True,right_index=True)

#perform merge operation in both left and right table index [left_index=True,right_index=True]

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
c,3.0,4.0,9.0,10.0
e,5.0,6.0,13.0,14.0


In [55]:
pd.merge(left2, right2, how='outer',left_index=True,right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


DataFrame has a convenient join instance for merging by index. It can also be used
to combine together many DataFrame objects having the same or similar indexes but
non-overlapping columns. In the prior example, we could have written:

In [56]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In part for legacy reasons (i.e., much earlier versions of pandas), DataFrame’s join
method performs a left join on the join keys, exactly preserving the left frame’s row
index. It also supports joining the index of the passed DataFrame on one of the col‐
umns of the calling DataFrame

In [68]:
left1

Unnamed: 0_level_0,value
Key,Unnamed: 1_level_1
a,0
b,1
c,2
d,3
e,4


In [60]:
right1.columns=['Key','Group_val']

In [61]:
right1

Unnamed: 0,Key,Group_val
0,a,3.5
1,b,7.0


In [69]:
right1.join(left1,on='Key')

Unnamed: 0,Key,Group_val,value
0,a,3.5,0
1,b,7.0,1


In [70]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],index=['a', 'c', 'e', 'f'],columns=['New York', 'Oregon'])


In [71]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [73]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [74]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [72]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [75]:
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,,
d,,,11.0,12.0,,
f,,,,,16.0,17.0


### Concatenating Along an Axis

Another kind of data combination operation is referred to interchangeably as concat‐
enation, binding, or stacking. NumPy’s concatenate function can do this with
NumPy arrays:


In [76]:
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [79]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [80]:
np.concatenate([arr, arr], axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

### concat function arguments

The concat function in pandas provides a consistent way to address each of these
concerns. I’ll give a number of examples to illustrate how it works. Suppose we have
three Series with no index overlap:

- **objs** - List or dict of pandas objects to be concatenated; this is the only required argument
- **axis** - Axis to concatenate along; defaults to 0 (along rows)
- **join** - Either 'inner' or 'outer' ('outer' by default); whether to intersection (inner) or union (outer) together indexes along the other axes
- **join_axes** - Specific indexes to use for the other n–1 axes instead of performing union/intersection logic
- **keys** - Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels)
- **levels** - Specific indexes to use as hierarchical index level or levels if keys passed
- **names** - Names for created hierarchical levels if keys and/or levels passed
- **verify_integrity** - Check new axis in concatenated object for duplicates and raise exception if so; by default (False)
allows duplicates
- **ignore_index** - Do not preserve indexes along concatenation axis, instead producing a new range(total_length) index


In [81]:
s1 = pd.Series([0, 1], index=['a', 'b'])

In [82]:
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [84]:
s3 = pd.Series([5, 6], index=['f', 'g'])

In [85]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default concat works along axis=0, producing another Series. If you pass axis=1,
the result will instead be a DataFrame (axis=1 is the columns):

In [88]:
pd.concat([s1, s2, s3], axis=1,join="outer")

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [89]:
s4 = pd.concat([s1, s3])

In [90]:
s4

a    0
b    1
f    5
g    6
dtype: int64

In [94]:
 pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [95]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


You can even specify the axes to be used on the other axes with join_axes:

In [108]:
result= pd.concat([s1,s1,s2],axis=0,keys=['one','two','three'])

In [109]:
result

one    a    0
       b    1
two    a    0
       b    1
three  c    2
       d    3
       e    4
dtype: int64

In [110]:
result.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,,,
two,0.0,1.0,,,
three,,,2.0,3.0,4.0


In the case of combining Series along axis=1, the keys become the DataFrame col‐
umn headers:

In [112]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


The same logic extends to DataFrame objects:

In [114]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
 ....: columns=['one', 'two'])

In [115]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [117]:
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
 ....: columns=['three', 'four'])


In [118]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [119]:
pd.concat([df1, df2], axis=0, keys=['level1', 'level2'])

Unnamed: 0,Unnamed: 1,one,two,three,four
level1,a,0.0,1.0,,
level1,b,2.0,3.0,,
level1,c,4.0,5.0,,
level2,a,,,5.0,6.0
level2,c,,,7.0,8.0


In [120]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If you pass a dict of objects instead of a list, the dict’s keys will be used for the keys
option:

In [121]:
pd.concat({'level1': df1, 'level2': df2}, axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [122]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
 .....: names=['upper', 'lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [123]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

In [124]:
df1

Unnamed: 0,a,b,c,d
0,-1.355375,-0.04208,0.413574,-1.313131
1,-0.277643,0.503902,0.441448,-0.174268
2,-0.673647,-0.336619,0.703966,0.630479


In [125]:
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [126]:
df2

Unnamed: 0,b,d,a
0,0.637043,-0.182044,1.142839
1,-0.856315,-0.373455,2.074


In [129]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,-1.355375,-0.04208,0.413574,-1.313131
1,-0.277643,0.503902,0.441448,-0.174268
2,-0.673647,-0.336619,0.703966,0.630479
3,1.142839,0.637043,,-0.182044
4,2.074,-0.856315,,-0.373455


### Combining Data with Overlap

There is another data combination situation that can’t be expressed as either a merge
or concatenation operation. You may have two datasets whose indexes overlap in full
or part. As a motivating example, consider NumPy’s where function, which performs
the array-oriented equivalent of an if-else expression:


In [130]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],index=['f', 'e', 'd', 'c', 'b', 'a'])

In [131]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [132]:
b = pd.Series(np.arange(len(a), dtype=np.float64),index=['f', 'e', 'd', 'c', 'b', 'a'])

In [134]:
b[-1] = np.nan

In [135]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [147]:
np.where(pd.isnull(a),b,a)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

In [149]:
pd.isnull(a)

f     True
e    False
d     True
c    False
b    False
a     True
dtype: bool

Series has a combine_first method, which performs the equivalent of this operation
along with pandas’s usual data alignment logic:

In [153]:
b[0:3].combine_first(a[0:4])

c    3.5
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames, combine_first does the same thing column by column, so you
can think of it as “patching” missing data in the calling object with data from the
object you pass:

In [155]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],'b': [np.nan, 2., np.nan, 6.],'c': range(2, 18, 4)})

In [156]:
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],'b': [np.nan, 3., 4., 6., 8.]})


In [157]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [158]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [159]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping and Pivoting

There are a number of basic operations for rearranging tabular data. These are alter‐
natingly referred to as reshape or pivot operations.

### Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.

There are two primary actions:

**stack** - This “rotates” or pivots from the columns in the data to the rows

**unstack** - This pivots from the rows into the columns

In [160]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
 .....: index=pd.Index(['Ohio', 'Colorado'], name='state'),
 .....: columns=pd.Index(['one', 'two', 'three'],
 .....: name='number'))


In [161]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [162]:
result = data.stack()

In [163]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

In [164]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default the innermost level is unstacked (same with stack). You can unstack a dif‐
ferent level by passing a level number or name:

In [165]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [167]:
result.unstack('number')

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [168]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [169]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

In [170]:
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

In [171]:
data2 = pd.concat([s1, s2], keys=['one', 'two'])

In [172]:
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [178]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [179]:
data2.unstack(0)

Unnamed: 0,one,two
a,0.0,
b,1.0,
c,2.0,4.0
d,3.0,5.0
e,,6.0


In [180]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [181]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in
the result:

In [182]:
df = pd.DataFrame({'left': result, 'right': result + 5},
 .....: columns=pd.Index(['left', 'right'], name='side'))

In [183]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [184]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [185]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


### Pivoting “Long” to “Wide” Format

A common way to store multiple time series in databases and CSV is in so-called long
or stacked format. Let’s load some example data and do a small amount of time series
wrangling and other data cleaning:

In [188]:
data = pd.read_csv('./Dataset/macrodata.csv')

In [189]:
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [190]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
 .....: name='date')

In [191]:
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [192]:
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

In [193]:
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [194]:
data = data.reindex(columns=columns)

In [195]:
data

item,realgdp,infl,unemp
0,2710.349,0.00,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2
...,...,...,...
198,13324.600,-3.16,6.0
199,13141.920,-8.79,6.9
200,12925.410,0.94,8.1
201,12901.504,3.37,9.2


In [196]:
data.index = periods.to_timestamp('D', 'end')

In [197]:
data

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,2710.349,0.00,5.8
1959-06-30 23:59:59.999999999,2778.801,2.34,5.1
1959-09-30 23:59:59.999999999,2775.488,2.74,5.3
1959-12-31 23:59:59.999999999,2785.204,0.27,5.6
1960-03-31 23:59:59.999999999,2847.699,2.31,5.2
...,...,...,...
2008-09-30 23:59:59.999999999,13324.600,-3.16,6.0
2008-12-31 23:59:59.999999999,13141.920,-8.79,6.9
2009-03-31 23:59:59.999999999,12925.410,0.94,8.1
2009-06-30 23:59:59.999999999,12901.504,3.37,9.2


In [198]:
ldata = data.stack().reset_index().rename(columns={0: 'value'})

In [199]:
ldata

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.000
2,1959-03-31 23:59:59.999999999,unemp,5.800
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.340
...,...,...,...
604,2009-06-30 23:59:59.999999999,infl,3.370
605,2009-06-30 23:59:59.999999999,unemp,9.200
606,2009-09-30 23:59:59.999999999,realgdp,12990.341
607,2009-09-30 23:59:59.999999999,infl,3.560


In [200]:
ldata[:10]

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34
5,1959-06-30 23:59:59.999999999,unemp,5.1
6,1959-09-30 23:59:59.999999999,realgdp,2775.488
7,1959-09-30 23:59:59.999999999,infl,2.74
8,1959-09-30 23:59:59.999999999,unemp,5.3
9,1959-12-31 23:59:59.999999999,realgdp,2785.204


In [201]:
pivoted = ldata.pivot('date', 'item', 'value')

In [202]:
pivoted

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.00,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2
...,...,...,...
2008-09-30 23:59:59.999999999,-3.16,13324.600,6.0
2008-12-31 23:59:59.999999999,-8.79,13141.920,6.9
2009-03-31 23:59:59.999999999,0.94,12925.410,8.1
2009-06-30 23:59:59.999999999,3.37,12901.504,9.2


The first two values passed are the columns to be used respectively as the row and
column index, then finally an optional value column to fill the DataFrame. Suppose
you had two value columns that you wanted to reshape simultaneously

In [204]:
ldata['value2'] = np.random.randn(len(ldata))

In [205]:
ldata[:10]

Unnamed: 0,date,item,value,value2
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,0.10107
1,1959-03-31 23:59:59.999999999,infl,0.0,-0.775337
2,1959-03-31 23:59:59.999999999,unemp,5.8,-0.289894
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,0.236553
4,1959-06-30 23:59:59.999999999,infl,2.34,-1.522471
5,1959-06-30 23:59:59.999999999,unemp,5.1,0.322256
6,1959-09-30 23:59:59.999999999,realgdp,2775.488,0.642608
7,1959-09-30 23:59:59.999999999,infl,2.74,1.293816
8,1959-09-30 23:59:59.999999999,unemp,5.3,-0.076443
9,1959-12-31 23:59:59.999999999,realgdp,2785.204,1.62303


In [206]:
pivoted = ldata.pivot('date', 'item')

In [207]:
pivoted

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.00,2710.349,5.8,-0.775337,0.101070,-0.289894
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.522471,0.236553,0.322256
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,1.293816,0.642608,-0.076443
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,-1.269019,1.623030,-1.246025
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,1.442736,1.090278,-1.222467
...,...,...,...,...,...,...
2008-09-30 23:59:59.999999999,-3.16,13324.600,6.0,0.974920,1.217666,0.265989
2008-12-31 23:59:59.999999999,-8.79,13141.920,6.9,0.673879,-3.020404,-3.041866
2009-03-31 23:59:59.999999999,0.94,12925.410,8.1,1.708662,0.287558,-0.911421
2009-06-30 23:59:59.999999999,3.37,12901.504,9.2,-0.713630,-0.931730,-0.321302


In [208]:
pivoted[:5]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.775337,0.10107,-0.289894
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.522471,0.236553,0.322256
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,1.293816,0.642608,-0.076443
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,-1.269019,1.62303,-1.246025
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,1.442736,1.090278,-1.222467


In [209]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


In [210]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')

In [212]:
unstacked[:7]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.775337,0.10107,-0.289894
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.522471,0.236553,0.322256
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,1.293816,0.642608,-0.076443
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,-1.269019,1.62303,-1.246025
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,1.442736,1.090278,-1.222467
1960-06-30 23:59:59.999999999,0.14,2834.39,5.2,-0.378595,2.462551,-0.111058
1960-09-30 23:59:59.999999999,2.7,2839.022,5.6,0.932123,-0.36154,0.378706


### Pivoting “Wide” to “Long” Format

An inverse operation to pivot for DataFrames is pandas.melt. Rather than trans‐
forming one column into many in a new DataFrame, it merges multiple columns into
one, producing a DataFrame that is longer than the input. Let’s look at an example:


In [215]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],'A': [1, 2, 3],'B': [4, 5, 6],'C': [7, 8, 9]})

In [216]:
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


The 'key' column may be a group indicator, and the other columns are data values.
When using pandas.melt, we must indicate which columns (if any) are group indica‐
tors. Let’s use 'key' as the only group indicator here:

In [217]:
melted = pd.melt(df, ['key'])

In [218]:
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [219]:
reshaped = melted.pivot('key', 'variable', 'value')

In [220]:
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


In [221]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


In [222]:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


In [223]:
pd.melt(df, value_vars=['A', 'B', 'C'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


In [224]:
pd.melt(df, value_vars=['key', 'A', 'B'])

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6
