# ECS784  - Lab 3 part 2
# PANDAS (contnued)

In this notebook we will look at:

   * Merging dataFrames; 
   * Concatenating dataFrames;
   * Pivoting dataFrames;
   * Removing/deleting data. 

In [1]:
# Importing the necessary libraries and set up

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50) # The maximum number of columns that can be shown is 50  CHANGED V2
# Makes the matplotlib graphs appear and stored within the notebook
%matplotlib inline

In [2]:
# The dataFrame below has two keys/column names, each of which has five values
frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'], 'price': [12.33,11.44,33.21,13.23,33.62]})

In [3]:
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'], 'color': ['white','red','red','black']})

In [4]:
frame1

Unnamed: 0,id,price
0,ball,12.33
1,pencil,11.44
2,pen,33.21
3,mug,13.23
4,ashtray,33.62


In [5]:
frame2

Unnamed: 0,id,color
0,pencil,white
1,pencil,red
2,ball,red
3,pen,black


In [6]:
# Merging the two dataFrames

merged_frame = pd.merge(frame1,frame2)

In [7]:
merged_frame # Note the merged data contains rows corresponding to IDs present in both dataFrames

Unnamed: 0,id,price,color
0,ball,12.33,red
1,pencil,11.44,white
2,pencil,11.44,red
3,pen,33.21,black


In [8]:
# Example 2: When more than one column name matches in dataframes

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],'color': ['white','red','red','black','green'],
'brand': ['OMG','ABC','ABC','POD','POD']})

In [9]:
frame1

Unnamed: 0,id,color,brand
0,ball,white,OMG
1,pencil,red,ABC
2,pen,red,ABC
3,mug,black,POD
4,ashtray,green,POD


In [10]:
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],'brand': ['OMG','POD','ABC','POD']})

In [11]:
frame2

Unnamed: 0,id,brand
0,pencil,OMG
1,pencil,POD
2,ball,ABC
3,pen,POD


In [12]:
result=pd.merge(frame1,frame2)

In [13]:
result

Unnamed: 0,id,color,brand


In [14]:
# Both columns in frame2 are present in frame1. Default merge operation results in an empty dataFrame because there are no entries in the two dataFrames with the same merge keys.
# Merge method returns an empty dataFrame

In [15]:
result.columns # This returns all the columns in the dataFrame

Index(['id', 'color', 'brand'], dtype='object')

### Use ‘on’ parameter to explicitly state the column on which merging should be based

In [16]:
# Merge the two dataFrames on the basis of column 'id'

pd.merge(frame1,frame2,on='id') # Because of contradictory values in 'brand', we get 'brand_x' and 'brand_y'

Unnamed: 0,id,color,brand_x,brand_y
0,ball,white,OMG,ABC
1,pencil,red,ABC,OMG
2,pencil,red,ABC,POD
3,pen,red,ABC,POD


In [17]:
# Merge the two dataFrames on the basis of 'brand' column

pd.merge(frame1,frame2,on='brand') # Similar to the above, we now get two different 'id' columns

Unnamed: 0,id_x,color,brand,id_y
0,ball,white,OMG,pencil
1,pencil,red,ABC,ball
2,pen,red,ABC,ball
3,mug,black,POD,pencil
4,mug,black,POD,pen
5,ashtray,green,POD,pencil
6,ashtray,green,POD,pen



## Using the 'how' parameter to specify type of merging

In [18]:
frame1

Unnamed: 0,id,color,brand
0,ball,white,OMG
1,pencil,red,ABC
2,pen,red,ABC
3,mug,black,POD
4,ashtray,green,POD


In [19]:
frame2

Unnamed: 0,id,brand
0,pencil,OMG
1,pencil,POD
2,ball,ABC
3,pen,POD


In [20]:
# Perfrom outer join on frame1 and frame2

pd.merge(frame1,frame2,on='id',how='outer') # Outer performs union, rather than intersection, merging.
# This means that all rows, from both dataFrames, are included in the merged dataFrame, with missing values indicated as 'NaN'.
# Set how='inner' and observe how the rows containing missing values are not considered (this is the default operation).

Unnamed: 0,id,color,brand_x,brand_y
0,ashtray,green,POD,
1,ball,white,OMG,ABC
2,mug,black,POD,
3,pen,red,ABC,POD
4,pencil,red,ABC,OMG
5,pencil,red,ABC,POD


In [21]:
# Performing merging with reference to a specific dataFrame:

pd.merge(frame1,frame2,on='id',how='left')   
# 'left' takes all rows from frame1 (the left dataFrame parameter) and any rows from frame2 that match frame1

Unnamed: 0,id,color,brand_x,brand_y
0,ball,white,OMG,ABC
1,pencil,red,ABC,OMG
2,pencil,red,ABC,POD
3,pen,red,ABC,POD
4,mug,black,POD,
5,ashtray,green,POD,


In [22]:
# Performing merging with reference to a specific dataFrame:

pd.merge(frame1,frame2,on='id',how='right') 
# 'right' takes all rows from frame2 (the right dataFrame parameter) and any rows from frame1 that match frame2

Unnamed: 0,id,color,brand_x,brand_y
0,pencil,red,ABC,OMG
1,pencil,red,ABC,POD
2,ball,white,OMG,ABC
3,pen,red,ABC,POD


### To merge multiple keys, simply add a list to the on option

In [23]:
frame1   # let's print the contents of frame1 again

Unnamed: 0,id,color,brand
0,ball,white,OMG
1,pencil,red,ABC
2,pen,red,ABC
3,mug,black,POD
4,ashtray,green,POD


In [24]:
frame2   # let's print the contents of frame2 again

Unnamed: 0,id,brand
0,pencil,OMG
1,pencil,POD
2,ball,ABC
3,pen,POD


In [25]:
pd.merge(frame1,frame2,on=['id','brand'],how='outer')  # The result now includes all rows from both dataFrames

Unnamed: 0,id,color,brand
0,ashtray,green,POD
1,ball,,ABC
2,ball,white,OMG
3,mug,black,POD
4,pen,red,ABC
5,pen,,POD
6,pencil,red,ABC
7,pencil,,OMG
8,pencil,,POD


# Concatenating Series


In [26]:
# The Pandas concat() function

In [27]:
# Let's first apply this method on Series
# Creating two Series containing four random values each
ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])

In [28]:
ser1

1    0.635845
2    0.348446
3    0.806241
4    0.844691
dtype: float64

In [29]:
ser2

5    0.565951
6    0.373020
7    0.786822
8    0.449127
dtype: float64

In [30]:
# concatenate the two series
ser3 = pd.concat([ser1,ser2]) # If you want to reset the index ordering you can do this by adding parameter ignore_index=True

In [31]:
ser3

1    0.635845
2    0.348446
3    0.806241
4    0.844691
5    0.565951
6    0.373020
7    0.786822
8    0.449127
dtype: float64

   * By default, the concat() function assumes the default parameter axis = 0, and returns a series object.
   * If we set axis = 1, then the result will be a DataFrame.

In [32]:
ser3 = pd.concat([ser1,ser2],axis=1)   # setting axis=1 returns a dataFrame
# The dataFrame generates missing values (NaN) for all rows because no index from ser1 matches an index from ser2

In [33]:
ser3

Unnamed: 0,0,1
1,0.635845,
2,0.348446,
3,0.806241,
4,0.844691,
5,,0.565951
6,,0.37302
7,,0.786822
8,,0.449127


The concatenation returns an outer join by default. We can change it to 'inner':

In [34]:
pd.concat([ser1,ser2],axis=1,join='inner') # Inner join (intersection) returns no matched cases, since no indexes match

Unnamed: 0,0,1


# Concatenating dataframes

In [35]:
# We now create two dataFrames 3x3 matrix containing nine random values each dataFrame
frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[1,2,3], columns=['A','B','C'])
frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[4,5,6], columns=['A','B','C'])

In [36]:
frame1

Unnamed: 0,A,B,C
1,0.178477,0.918225,0.668928
2,0.758696,0.112519,0.151538
3,0.008968,0.631922,0.329342


In [37]:
frame2

Unnamed: 0,A,B,C
4,0.960597,0.943332,0.072883
5,0.01891,0.797814,0.694305
6,0.475249,0.344593,0.985131


In [38]:
pd.concat([frame1, frame2])   # By default, axis=0, so concatenation is performed vertically

Unnamed: 0,A,B,C
1,0.178477,0.918225,0.668928
2,0.758696,0.112519,0.151538
3,0.008968,0.631922,0.329342
4,0.960597,0.943332,0.072883
5,0.01891,0.797814,0.694305
6,0.475249,0.344593,0.985131


In [39]:
# Set axis=1 to perform horizontal concatenation - but note the missing value due to no indexes matched
pd.concat([frame1, frame2], axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
1,0.178477,0.918225,0.668928,,,
2,0.758696,0.112519,0.151538,,,
3,0.008968,0.631922,0.329342,,,
4,,,,0.960597,0.943332,0.072883
5,,,,0.01891,0.797814,0.694305
6,,,,0.475249,0.344593,0.985131


# Pivoting with Hierarchical Indexing

In the context of pivoting there are two basic operations:
  * Stacking: rotates or pivots the data structure converting columns to rows.
  * Unstacking: converts rows into columns.
  * This is similar to pivoting tables in Excel.

In [40]:
# Create a 3x3 dataFrame with the specified indexes and column headings
# The values in the dataFrame will be arranged from 0 to 8 (nine values)
frame1 = pd.DataFrame(np.arange(9).reshape(3,3), index=['white','black','red'],columns=['ball','pen','pencil'])

In [41]:
frame1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [42]:
# Using the stack() function pivots the columns into rows, thus converting the dataFrame into a hierarchical series:

ser5 = frame1.stack()

In [43]:
ser5

white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int64

In [44]:
# The unstack() function reverses the operation

ser5.unstack()   # produces the original structure
                

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


## Removing columns and rows

In [45]:
# Redefine a dataFrame
frame1 = pd.DataFrame(np.arange(9).reshape(3,3), index=['white','black','red'],columns=['ball','pen','pencil'])

In [46]:
frame1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [47]:
# To remove a column, simply use the del command 

del frame1['ball']

In [48]:
frame1

Unnamed: 0,pen,pencil
white,1,2
black,4,5
red,7,8


In [49]:
# Reinitialise the dataFrame
frame1 = pd.DataFrame(np.arange(9).reshape(3,3), index=['white','black','red'],columns=['ball','pen','pencil'])

In [50]:
frame1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [51]:
# Same applies to indexes/rows, using the drop() function

frame1.drop('white')
# To drop a column you have to use axis=1, e.g., frame1.drop('ball',axis=1)

Unnamed: 0,ball,pen,pencil
black,3,4,5
red,6,7,8
