- It is often necessary to merge data from different sources prior to data analysis.
- In the example below, sales volume data according to date and product code,  
  and sales price data according to product code are stored separately.
- In order to obtain and analyze sales by product,  
  we want to merge sales price information with sales volume data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_volume = pd.read_excel("data/sample_data_sales_volume.xlsx")
print(df_volume.shape)
df_volume

(9, 3)


Unnamed: 0,date,code,sales_volume
0,2023-07-01,AB123,5
1,2023-07-01,AB124,3
2,2023-07-01,AB125,15
3,2023-07-01,AB126,6
4,2023-07-01,AB127,7
5,2023-07-01,AB128,19
6,2023-07-01,AB129,5
7,2023-07-01,AB130,6
8,2023-07-01,AB131,8


In [3]:
df_price = pd.read_excel("data/sample_data_sales_price.xlsx")
print(df_price.shape)
df_price

(12, 2)


Unnamed: 0,code,sales_price
0,AB123,1000
1,AB124,1200
2,AB125,1100
3,AB126,1150
4,AB127,1500
5,AB128,1600
6,AB129,1000
7,AB130,1700
8,AB131,1200
9,AB132,1500


- The data that is the criterion for merging is "df_volume" with sales date and quantity information,  
  and we want to merge the sales price data into this data frame.
- The merge key is the product code('code' column) corresponding to the primary key of both data.

In [5]:
df_sales = pd.merge(df_volume, df_price, on='code', how='left')

print(df_volume.shape)
print(df_price.shape)
print(df_sales.shape)
df_sales

(9, 3)
(12, 2)
(9, 4)


Unnamed: 0,date,code,sales_volume,sales_price
0,2023-07-01,AB123,5,1000
1,2023-07-01,AB124,3,1200
2,2023-07-01,AB125,15,1100
3,2023-07-01,AB126,6,1150
4,2023-07-01,AB127,7,1500
5,2023-07-01,AB128,19,1600
6,2023-07-01,AB129,5,1000
7,2023-07-01,AB130,6,1700
8,2023-07-01,AB131,8,1200


- After merging the data,  
  the sales is calculated by multiplying the sales volume by the selling price.

In [6]:
df_sales['sales'] = df_sales['sales_volume']*df_sales['sales_price']

print(df_sales.shape)
df_sales

(9, 5)


Unnamed: 0,date,code,sales_volume,sales_price,sales
0,2023-07-01,AB123,5,1000,5000
1,2023-07-01,AB124,3,1200,3600
2,2023-07-01,AB125,15,1100,16500
3,2023-07-01,AB126,6,1150,6900
4,2023-07-01,AB127,7,1500,10500
5,2023-07-01,AB128,19,1600,30400
6,2023-07-01,AB129,5,1000,5000
7,2023-07-01,AB130,6,1700,10200
8,2023-07-01,AB131,8,1200,9600


- If you merge by specifying the merge function option as "how = 'outer', indicator = True",  
  you can get more detailed information.
- The '_merge' column, which is automatically created when 'indicator' is set to 'True',  
  indicates the source(data frame) of the information.
- As below, products with codes 'AB132', 'AB133', and 'AB134' are information  
  that exists only in the 'df_price' data frame,  
  and it can be seen that there is no sales record of the product on July 1, 2023.

In [7]:
df_sales_1 = pd.merge(df_volume, df_price, on='code', how='outer', indicator=True)

print(df_sales_1.shape)
df_sales_1

(12, 5)


Unnamed: 0,date,code,sales_volume,sales_price,_merge
0,2023-07-01,AB123,5.0,1000,both
1,2023-07-01,AB124,3.0,1200,both
2,2023-07-01,AB125,15.0,1100,both
3,2023-07-01,AB126,6.0,1150,both
4,2023-07-01,AB127,7.0,1500,both
5,2023-07-01,AB128,19.0,1600,both
6,2023-07-01,AB129,5.0,1000,both
7,2023-07-01,AB130,6.0,1700,both
8,2023-07-01,AB131,8.0,1200,both
9,NaT,AB132,,1500,right_only
