# **7. Merging, Joining & Concatenation**

## **2. `pd.merge()` – SQL-style merging**

In [1]:
import pandas as pd

This operation is indispensable in data science workflows where you’re frequently combining data from **multiple sources** using one or more **common keys**—similar to SQL joins.

## ✅ 1. What `pd.merge()` Does and When to Use It

### 🔹 What It Does:

`pd.merge()` allows you to combine two pandas `DataFrames` based on **one or more common columns or indexes** using **SQL-style join operations** like:

* Inner join
* Left join
* Right join
* Outer join

🔗 It performs **relational joins**, enabling you to combine datasets that share some form of relationship (primary key–foreign key style).


### 🔹 When to Use:

* When two datasets share **common identifiers** (e.g., customer ID, product code, user email).
* When you want **fine control** over how datasets are matched and merged.
* When performing **multi-table joins** like SQL.


## 🧾 2. Syntax and Core Parameters

```python
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=False,
         suffixes=('_x', '_y'), indicator=False, validate=None)
```

### 🔹 Core Parameters:

| Parameter                   | Description                                             |
| --------------------------- | ------------------------------------------------------- |
| `left`, `right`             | DataFrames to merge                                     |
| `how`                       | Type of join: `'inner'`, `'left'`, `'right'`, `'outer'` |
| `on`                        | Column(s) to join on (must be present in both)          |
| `left_on`, `right_on`       | Use when joining on different column names              |
| `left_index`, `right_index` | Join on index instead of columns                        |
| `suffixes`                  | Add suffixes for overlapping column names               |
| `indicator`                 | Adds a column to show merge source                      |
| `validate`                  | Check for 1:1, 1\:m, m:1, or m\:m relationships         |


## 🧠 3. Different Methods & Techniques

Let’s explore the most used join types and methods with `pd.merge()`:

---

### 🔸 A. Inner Join (default) – Intersection

```python
pd.merge(df1, df2, on='id', how='inner')
```

* Keeps only rows with keys present in **both** DataFrames.

---

### 🔸 B. Left Join – Left-preserving

```python
pd.merge(df1, df2, on='id', how='left')
```

* Keeps **all rows from left**, matches from right.

---

### 🔸 C. Right Join – Right-preserving

```python
pd.merge(df1, df2, on='id', how='right')
```

* Keeps **all rows from right**, matches from left.

---

### 🔸 D. Outer Join – Union

```python
pd.merge(df1, df2, on='id', how='outer')
```

* Keeps **all rows** from both, fills unmatched with NaN.

---

### 🔸 E. Merging on Different Column Names

```python
pd.merge(df1, df2, left_on='cust_id', right_on='id')
```

---

### 🔸 F. Merging on Index

```python
pd.merge(df1, df2, left_index=True, right_index=True)
```

---

### 🔸 G. Adding Merge Indicator

```python
pd.merge(df1, df2, on='id', how='outer', indicator=True)
```

Shows from where each row originated: `'left_only'`, `'right_only'`, or `'both'`.

---

### 🔸 H. Handling Overlapping Columns

```python
pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
```


## ⚠️ 4. Common Pitfalls and Best Practices

| Pitfall                                      | What to Do                                                    |
| -------------------------------------------- | ------------------------------------------------------------- |
| ❌ Joining on wrong column(s) or no match     | ✅ Double-check keys using `.columns` and `.unique()`          |
| ❌ Unexpected duplicates in output            | ✅ Use `validate='one_to_one'`, `validate='one_to_many'`, etc. |
| ❌ Overlapping column names get overwritten   | ✅ Use `suffixes=('_x', '_y')` to disambiguate                 |
| ❌ Missing matches due to datatype mismatch   | ✅ Use `.astype()` to ensure matching column types             |
| ❌ Merging on wrong direction (left vs inner) | ✅ Know which table should drive the merge logic               |
| ❌ Memory issues on large merges              | ✅ Consider filtering, indexing, or chunking data before merge |


## 🧪 5. Examples on Real/Pseudo Data

### Example 1: **Inner Join on a Common Key**

In [3]:
customers = pd.DataFrame({
    'cust_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'cust_id': [2, 3, 4],
    'order_id': [201, 202, 203]
})

display(customers)
display(orders)

Unnamed: 0,cust_id,name
0,1,Alice
1,2,Bob
2,3,Charlie


Unnamed: 0,cust_id,order_id
0,2,201
1,3,202
2,4,203


In [5]:
pd.merge(left=customers, right=orders, on='cust_id', how='inner')

Unnamed: 0,cust_id,name,order_id
0,2,Bob,201
1,3,Charlie,202


### Example 2: **Left Join with Missing Match**

In [6]:
pd.merge(left=customers, right=orders, on='cust_id', how='left')

Unnamed: 0,cust_id,name,order_id
0,1,Alice,
1,2,Bob,201.0
2,3,Charlie,202.0


### Example 3: **Join on Different Column Names**

In [7]:
df1 = pd.DataFrame({'uid': [1, 2], 'score': [85, 90]})
df2 = pd.DataFrame({'user_id': [2, 3], 'grade': ['A', 'B']})

display(df1)
display(df2)

Unnamed: 0,uid,score
0,1,85
1,2,90


Unnamed: 0,user_id,grade
0,2,A
1,3,B


In [8]:
pd.merge(left=df1, right=df2, left_on='uid', right_on='user_id', how='inner')

Unnamed: 0,uid,score,user_id,grade
0,2,90,2,A


### Example 4: **Merge with Index**

In [9]:
df1 = pd.DataFrame({'val1': [100, 200]}, index=['a', 'b'])
df2 = pd.DataFrame({'val2': [300, 400]}, index=['a', 'c'])

display(df1, df2)

Unnamed: 0,val1
a,100
b,200


Unnamed: 0,val2
a,300
c,400


In [12]:
pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

Unnamed: 0,val1,val2
a,100.0,300.0
b,200.0,
c,,400.0


## 🌍 6. Real-World Use Cases

| Scenario                                          | Description                                |
| ------------------------------------------------- | ------------------------------------------ |
| 🧾 **Join customer profiles with transactions**   | Match on customer ID                       |
| 📦 **Product catalog and inventory levels**       | Match on product SKU                       |
| 🕵️‍♂️ **User logs and demographics**             | Match on user ID or email                  |
| 📅 **Time series from different sources**         | Merge by timestamp or datetime index       |
| 📈 **Model prediction results with ground truth** | Merge on record ID for evaluation          |
| 📊 **Sales and marketing data**                   | Match campaign ID, region, or channel info |


## 📌 Summary Table

| Feature      | Summary                                                    |
| ------------ | ---------------------------------------------------------- |
| Purpose      | SQL-style join for relational data                         |
| Join Types   | `inner`, `left`, `right`, `outer`                          |
| Match On     | Columns or index                                           |
| Key Params   | `on`, `left_on`, `right_on`, `suffixes`, `indicator`       |
| When to Use  | When datasets share common keys/IDs                        |
| Alternatives | `concat()` (for stacking) or `join()` (simpler index join) |


<center><b>Thanks</b></center>