# **Guided LAB -343.4.2 -  How to merge Pandas DataFrames by multiple columns.**

## **Lab Objective:**

This lab focuses on merging Pandas DataFrames using the pd.merge() function, specifically on merging by multiple columns. You will learn how to merge DataFrames when column names are the same and when they are different, and how to handle duplicate keys during merging. The lab covers various scenarios with practical examples and explanations.

**Key Concepts**

- **pd.merge() Function:** Understanding the syntax and usage of the pd.merge() function for merging DataFrames.
- **Merging by Multiple Columns:** Specifying multiple columns as keys for merging using the on parameter.
- **Handling Different Column Names**: Merging DataFrames with different column names using left_on and right_on parameters.
- **Duplicate Keys:** Identifying and handling duplicate keys during merging using the validate parameter..

## **Learning Objectives:**

By completing this lab, you will gain hands-on experience in merging Pandas DataFrames by multiple columns, which is a crucial skill for data manipulation and analysis.

### **Introduction:**

- The **pandas.merge()** function merge two DataFrames based on a common column or index. It resembles SQL’s JOIN operation and offers more control over how DataFrames are combined.
- Using merge() function you can do merging by columns, merging by index, merging on multiple columns, and different join types. By default, it merges on all common columns that exist on both DataFrames and performs an inner join.

- The syntax for the pandas.merge() function is:


```
pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)
```



**Let’s create two DataFrames and run the above examples to understand pandas join.**

In [1]:
import pandas as pd

In [2]:
# Create Pandas DataFrame
left_df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30day','40days','60days','55days','50days']})

right_df = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion","html"],
                    'Fee': [20000,25000,30000,24000,40000,4000],
                    'Percentage':['10%','20%','25%','20%','10%','50%']})

print("First DataFrame:\n", left_df)
print("Second DataFrame:\n", right_df)

First DataFrame:
    Courses    Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Python  30000   60days
3   pandas  24000   55days
4     Java  40000   50days
Second DataFrame:
     Courses    Fee Percentage
0      Java  20000        10%
1   PySpark  25000        20%
2    Python  30000        25%
3    pandas  24000        20%
4  Hyperion  40000        10%
5      html   4000        50%


## **Example: Merge default pandas DataFrame without any key column**

You can pass two DataFrames to be merged into the pandas.merge() function. This function collects all common columns in both DataFrames and replaces each common column in both DataFrames with a single one. It merges the DataFrames df and df1 assigned to merged_df.

By default, the merge() function applies join contains on all columns that are present on both DataFrames and uses inner join. We have the columns Courses and Fee common to both the DataFrames.

In [3]:
merged_df = pd.merge(left_df,right_df)
merged_df

Unnamed: 0,Courses,Fee,Duration,Percentage
0,PySpark,25000,40days,20%
1,Python,30000,60days,25%
2,pandas,24000,55days,20%


In [4]:
merged_df.shape

(3, 4)

## **Example: Pandas Merge DataFrames Based on single Columns**



If you want to merge DataFrames based on a single key column, you can simply pass the column name as a string to the on parameter. For example:

In [6]:
result = pd.merge(left_df, right_df, on="Courses")
result

Unnamed: 0,Courses,Fee_x,Duration,Fee_y,Percentage
0,PySpark,25000,40days,25000,20%
1,Python,30000,60days,30000,25%
2,pandas,24000,55days,24000,20%
3,Java,40000,50days,20000,10%


In [7]:
result.shape

(4, 5)

## **Example: Use pandas.merge() to Multiple Columns**

You can also explicitly specify the column names you want to use for joining. To specify column names use on param of the merge() function. This also takes a list of names when you want to merge multiple columns.

In [8]:

# Use pandas.merge() on multiple columns
df_result = pd.merge(left_df,right_df, on=['Courses','Fee'])
print("After merging the DataFrames:\n", df_result)

After merging the DataFrames:
    Courses    Fee Duration Percentage
0  PySpark  25000   40days        20%
1   Python  30000   60days        25%
2   pandas  24000   55days        20%


In [9]:
df_result.shape

(3, 4)

## **Example:  Use pandas.merge() when Column Names Different**
When you have column names on the left and right that are different and want to use these as a join column, use left_on and right_on parameters.

This also takes a list of column names as values to merge on multiple columns.

The left_on will be set to the name of the column in the left DataFrame and right_on will be set to the name of the column in the right DataFrame.

This also takes a list of names when you want to merge multiple columns.

In [None]:
result = pd.merge(left_df, right_df, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print("After merging the DataFrames:\n", result)

After merging the DataFrames:
    Courses    Fee Duration Percentage
0    Spark  20000    30day        NaN
1  PySpark  25000   40days        20%
2   Python  30000   60days        25%
3   pandas  24000   55days        20%
4     Java  40000   50days        NaN


In [12]:
print(left_df)
print(right_df)

result_2 = pd.merge(left_df, right_df, how='left', left_on=['Courses','Duration'], right_on = ['Courses','Percentage'])
print("After merging the DataFrames:\n", result_2)

   Courses    Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Python  30000   60days
3   pandas  24000   55days
4     Java  40000   50days
    Courses    Fee Percentage
0      Java  20000        10%
1   PySpark  25000        20%
2    Python  30000        25%
3    pandas  24000        20%
4  Hyperion  40000        10%
5      html   4000        50%
After merging the DataFrames:
    Courses  Fee_x Duration  Fee_y Percentage
0    Spark  20000    30day    NaN        NaN
1  PySpark  25000   40days    NaN        NaN
2   Python  30000   60days    NaN        NaN
3   pandas  24000   55days    NaN        NaN
4     Java  40000   50days    NaN        NaN


In [None]:
result.shape

(5, 4)

# **Example: Checking for duplicate keys**

- We can use the validate argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.

- In the following example, there are duplicate values of B in the right DataFrame. As this is not a one-to-one merge – as specified in the validate argument – an exception will be raised.

In [None]:
left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="inner", validate="one_to_one")

MergeError: ignored

## We are aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left DataFrame, we can use the **validate='one_to_many'** argument instead, which will not raise an exception.

In [None]:
result = pd.merge(left, right, on="B", how="inner", validate="one_to_many")
print(result)

   A_x  B  A_y
0    2  2    4
1    2  2    5
2    2  2    6


# **Example: Consider a scenario where you have information about sales transactions and product details, and you want to merge these datasets based on both 'ProductID' and 'StoreID' keys.**

In [13]:
# Create a DataFrame with sales transactions
sales_data = {
    'TransactionID': [1, 2, 3, 4, 5],
    'ProductID': [101, 102, 103, 101, 105],
    'StoreID': [1, 2, 1, 3, 2],
    'Quantity': [5, 3, 2, 4, 1],
    'Amount': [500.00, 300.00, 200.00, 400.00, 150.00]
}

df_sales = pd.DataFrame(sales_data)

# Create a DataFrame with product details
products_data = {
    'ProductID': [101, 102, 103, 104, 105],
    'ProductName': ['Laptop', 'Headphones', 'Smartphone', 'Tablet', 'Monitor'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics']
}

df_products = pd.DataFrame(products_data)

# Merge the DataFrames based on 'ProductID' and 'StoreID' keys
df_combined = pd.merge(df_sales, df_products, on='ProductID', how='left')

# Display the combined DataFrame
print("Combined DataFrame:")
df_combined

Combined DataFrame:


Unnamed: 0,TransactionID,ProductID,StoreID,Quantity,Amount,ProductName,Category
0,1,101,1,5,500.0,Laptop,Electronics
1,2,102,2,3,300.0,Headphones,Electronics
2,3,103,1,2,200.0,Smartphone,Electronics
3,4,101,3,4,400.0,Laptop,Electronics
4,5,105,2,1,150.0,Monitor,Electronics
