## 4.6 Merging and exporting data

## Contents

01. Create data to experiment on
02. Concatenate dataframes
03. Append data
04. Merge data

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

##01. Create data to experiment on

In [2]:
# Define a dictionary containing January 2020 data 

data1 = {'customer_id':['6732', '767', '890', '635'], 
        'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'], 
        'purchased_meat':[0, 13, 3, 4], 
        'purchased_alcohol':[1, 2, 10, 0],
        'purchased_snacks': [10, 5, 1, 7]} 

In [3]:
# Define a dictionary containing February 2020 data 

data2 = {'customer_id':['6732', '767', '890', '635'], 
        'month':['Feb-20', 'Feb-20', 'Feb-20', 'Feb-20'], 
        'purchased_meat':[0, 10, 5, 3], 
        'purchased_alcohol':[2, 4, 14, 0],
        'purchased_snacks': [15, 3, 2, 6]} 

In [4]:
# Convert the dictionary into DataFrame 
# An index column, which is required when combining data with the pd.concat() function

df = pd.DataFrame(data1,index=[0, 1, 2, 3])
df_1 = pd.DataFrame(data2,index=[0, 1, 2, 3])

In [5]:
df

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7


In [6]:
df_1

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


## 02. Concatenate dataframes

Concatenation is a good choice for combining data sets that have multiple rows and columns of the same length. Say, for example, that you have two data sets with five columns each that both carry the same information, only with different values. The concatenate function will let you stack these data sets either on top of one another or side by side. 

In [7]:
# Create a list that contains our dataframes

frames = [df, df_1]

In [8]:
# Check the output

frames

[  customer_id   month  purchased_meat  purchased_alcohol  purchased_snacks
 0        6732  Jan-20               0                  1                10
 1         767  Jan-20              13                  2                 5
 2         890  Jan-20               3                 10                 1
 3         635  Jan-20               4                  0                 7,
   customer_id   month  purchased_meat  purchased_alcohol  purchased_snacks
 0        6732  Feb-20               0                  2                15
 1         767  Feb-20              10                  4                 3
 2         890  Feb-20               5                 14                 2
 3         635  Feb-20               3                  0                 6]

In [9]:
# Check the data types to be sure it is a list

type(frames)

list

In [10]:
# Concatenate the dataframes using default options. This function allows you to stack the data on top of one another.

df_concat = pd.concat(frames)

In [11]:
# Check the output

df_concat

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


In [12]:
# Concatenate the dataframes using the axis = 1 --> create a wide format as an output. 
#This function allows you to have the data sets right next to each other. 

df_concat = pd.concat(frames, axis = 1)

In [13]:
# Check the output

df_concat

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,customer_id.1,month.1,purchased_meat.1,purchased_alcohol.1,purchased_snacks.1
0,6732,Jan-20,0,1,10,6732,Feb-20,0,2,15
1,767,Jan-20,13,2,5,767,Feb-20,10,4,3
2,890,Jan-20,3,10,1,890,Feb-20,5,14,2
3,635,Jan-20,4,0,7,635,Feb-20,3,0,6


## 03. Append data

Appending data is a straightforward approach for adding rows to an existing dataframe with the same number of columns.The append() function works the same as pd.concat() when using its default settings (axis = 0). This means the resulting output will be in long format.

Below, the function df.append() is being used to append the df_1 dataframe onto the df dataframe. The dataframe upon which you want to append another dataframe is included before the dot (df), while the dataframe you want to append onto another dataframe is included in the parentheses (df_1).

In [14]:
df_appended = df.append(df_1)

  df_appended = df.append(df_1)


In [15]:
df_appended

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
2,890,Jan-20,3,10,1
3,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
2,890,Feb-20,5,14,2
3,635,Feb-20,3,0,6


In [16]:
# Create data with different columns from df

data3 = {'customer_id':['6732', '767', '890', '635'], 
        'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'], 
        'days_purchased_on':[0, 13, 3, 4]} 

In [17]:
# Convert to dataframe

df_2 = pd.DataFrame(data3,index=[0, 1, 2, 3])

In [18]:
df_2

Unnamed: 0,customer_id,month,days_purchased_on
0,6732,Jan-20,0
1,767,Jan-20,13
2,890,Jan-20,3
3,635,Jan-20,4


In [19]:
# Create a new dataset combining df and df_2

df_append_test = df.append(df_2)

  df_append_test = df.append(df_2)


In [20]:
# See below that panda recognizes that you're trying to append dataframes of different sizes and issues you a warning message. 

df_append_test

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0.0,1.0,10.0,
1,767,Jan-20,13.0,2.0,5.0,
2,890,Jan-20,3.0,10.0,1.0,
3,635,Jan-20,4.0,0.0,7.0,
0,6732,Jan-20,,,,0.0
1,767,Jan-20,,,,13.0
2,890,Jan-20,,,,3.0
3,635,Jan-20,,,,4.0


## 04. Merge data

The best use cases for the df.merge() function are those where the dataframes you want to combine don’t match in shape—different from concatenate and append. In these cases, you’ll need a key or some kind of common identifier column that brings the two (or more) data sets together.

The df.merge() function comes with another important criterion for determining how data will be combined: the on argument. It plays the same role now in Python—designating a common identifier column on which to merge the data

There may be times where the dataframes you’re provided with have a common column, but that column has a different name in each dataframe—for instance, “cust_id” in the first and “customer_id” in the second. In this scenario, you’ll need to rename one of the columns before executing a merge.

In [21]:
# Merge df and df_2 using customer_id as a key 

df_merged = df.merge(df_2, on = ['customer_id'])

In [22]:
df_merged

Unnamed: 0,customer_id,month_x,purchased_meat,purchased_alcohol,purchased_snacks,month_y,days_purchased_on
0,6732,Jan-20,0,1,10,Jan-20,0
1,767,Jan-20,13,2,5,Jan-20,13
2,890,Jan-20,3,10,1,Jan-20,3
3,635,Jan-20,4,0,7,Jan-20,4


You may notice there is two new columns: “month_x” and “month_y.” This is a result of the “month” column existing in both dataframes. Because you didn’t specify it as a key, like you did the “customer_id” column, it’s duplicated in the final dataframe. Here’s what the final dataframe would look like if you did specify month as a key:



In [23]:
# Merge df and df_2 using customer_id and month as a keys 

df_merged = df.merge(df_2, on = ['customer_id', 'month'])

In [24]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,13
2,890,Jan-20,3,10,1,3
3,635,Jan-20,4,0,7,4


A quick and easy way to check for a full match is via the indicator = True argument.  This argument creates a new column that reports on the specifics of the merge. A value of both means the key (or keys) you specified exist in both dataframes, while a value of left_only or right_only indicates that the key only exists in either the left or right dataframe.

In [25]:
# Merge df and df_2 using customer_id and month as a keys, add a merge flag

df_merged = df.merge(df_2, on = ['customer_id', 'month'], indicator = True)

In [26]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,13,both
2,890,Jan-20,3,10,1,3,both
3,635,Jan-20,4,0,7,4,both


In [27]:
# This is how you can get counts on how the data merged. 

df_merged['_merge'].value_counts()

both          4
left_only     0
right_only    0
Name: _merge, dtype: int64

In [28]:
#. If you just want to test a merge without actually saving it to a new dataframe 
#(or overwriting your current dataframe), you can do so via the following code:


pd.merge(df,df_2, on = ['customer_id', 'month'], indicator = True)

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,13,both
2,890,Jan-20,3,10,1,3,both
3,635,Jan-20,4,0,7,4,both


There’s one more argument you should be aware of when using the df.merge() function: the how argument. This argument specifies, as the name implies, how you want the dataframes to be merged, and it can take the values left, right, inner (default), or outer

In [29]:
# The function below would merge the df_2 dataframe with the df dataframe using an inner join.

#df.merge(df_2, on = ['customer_id', 'month'], how = 'inner')