# **7. Merging, Joining & Concatenation**

## 1. `pd.concat()` – Concatenation of DataFrames

In [2]:
import pandas as pd

**`pd.concat()` – Concatenation of DataFrames**, which is a powerful tool in pandas when working with **multiple datasets** you want to **stack or combine** either **vertically** (like stacking rows) or **horizontally** (like adding columns).

## ✅ 1. What `pd.concat()` Does and When to Use It

### 🔹 What It Does:

`pd.concat()` **combines multiple pandas objects** (like `DataFrames` or `Series`) **along a particular axis (rows or columns)**.

* Think of it as **stacking** or **gluing** datasets together.
* It’s **not SQL-like** (unlike `merge()`), as it **doesn't require keys** to align on—alignment happens based on **index** (by default).

### 🔹 When to Use:

* You want to:

  * Stack data **vertically** (e.g., monthly sales reports).
  * Append features side-by-side (horizontally).
  * Combine datasets **without complex key logic**.
* Common in **ETL pipelines** where data comes in chunks (e.g., log files, time-based data).


## 🧾 2. Syntax and Core Parameters

```python
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False)
```

### 🔹 Core Parameters:

| Parameter          | Description                                                             |
| ------------------ | ----------------------------------------------------------------------- |
| `objs`             | A list/tuple of pandas objects to concatenate                           |
| `axis`             | 0 → row-wise (vertical), 1 → column-wise (horizontal)                   |
| `join`             | `'outer'` (union, default) or `'inner'` (intersection of index/columns) |
| `ignore_index`     | If `True`, reset index in result                                        |
| `keys`             | Add hierarchical index labels                                           |
| `verify_integrity` | If `True`, raise error on duplicate indexes                             |
| `sort`             | Sort unaligned columns when joining                                     |


## 🧠 3. Different Methods & Techniques Using `pd.concat()`

### 🔸 A. Vertical Concatenation (Stacking Rows)

```python
result = pd.concat([df1, df2], axis=0)
```

* Index is preserved by default.
* Columns are aligned automatically.

### 🔸 B. Horizontal Concatenation (Adding Columns)

```python
result = pd.concat([df1, df2], axis=1)
```

* Index is used for alignment.
* Works like a side-by-side append.

### 🔸 C. Ignoring Index

```python
pd.concat([df1, df2], ignore_index=True)
```

* Resets index in result to range index.

### 🔸 D. Inner Join on Columns/Index

```python
pd.concat([df1, df2], join='inner', axis=0)
```

* Only common columns or rows will be retained.

### 🔸 E. Using `keys` to Create MultiIndex

```python
pd.concat([df1, df2], keys=['Q1', 'Q2'])
```

* Useful to identify source dataset in stacked output.

### 🔸 F. Verifying Index Integrity

```python
pd.concat([df1, df2], verify_integrity=True)
```

* Ensures no duplicate indexes—throws error if found.


## ⚠️ 4. Common Pitfalls and Best Practices

| Pitfall                                           | Recommendation                                                |
| ------------------------------------------------- | ------------------------------------------------------------- |
| Misaligned indices causing unexpected NaNs        | Always inspect index alignment before horizontal concat       |
| Not resetting index after vertical concat         | Use `ignore_index=True` if original index is meaningless      |
| Using `concat()` when merge/join is better suited | Only use `concat()` when you don't need key-based logic       |
| Overwriting common column names                   | Be careful with column name overlaps during horizontal concat |
| Memory issues with very large datasets            | Concatenate in batches or use Dask for out-of-core processing |


## 🧪 5. Examples on Real/Pseudo Data

### ✅ Vertical Concatenation – Stacking Monthly Sales

In [3]:
jan = pd.DataFrame({'Product': ['A', 'B'], 'Sales': [100, 150]})
feb = pd.DataFrame({'Product': ['A', 'B'], 'Sales': [200, 120]})

display(jan)
display(feb)

Unnamed: 0,Product,Sales
0,A,100
1,B,150


Unnamed: 0,Product,Sales
0,A,200
1,B,120


In [7]:
pd.concat([jan, feb], axis=0)

Unnamed: 0,Product,Sales
0,A,100
1,B,150
0,A,200
1,B,120


In [6]:
pd.concat([jan, feb], axis=0, ignore_index=True)

Unnamed: 0,Product,Sales
0,A,100
1,B,150
2,A,200
3,B,120


### ✅ Horizontal Concatenation – Adding Features

In [8]:
info = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
score = pd.DataFrame({'Score': [88, 92]})

display(info)
display(score)

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


Unnamed: 0,Score
0,88
1,92


In [11]:
pd.concat([info, score], axis=1)

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92


### ✅ MultiIndex with `keys`

In [12]:
q1 = pd.DataFrame({'Sales': [100, 200]})
q2 = pd.DataFrame({'Sales': [150, 250]})

display(q1)
display(q2)

Unnamed: 0,Sales
0,100
1,200


Unnamed: 0,Sales
0,150
1,250


In [13]:
pd.concat([q1, q2], keys=['Q1', 'Q2'])

Unnamed: 0,Unnamed: 1,Sales
Q1,0,100
Q1,1,200
Q2,0,150
Q2,1,250


## 🌍 6. Real-World Use Cases

| Use Case                    | Description                                             |
| --------------------------- | ------------------------------------------------------- |
| 🔁 **Batch Data Ingestion** | Stack multiple CSV files monthly into one DataFrame     |
| 🧾 **Log Aggregation**      | Combine daily/weekly log files                          |
| 📈 **Feature Engineering**  | Add newly computed columns like scores, predictions     |
| 🧪 **Data Enrichment**      | Concatenate customer profile data from multiple sources |
| 📚 **ETL Pipelines**        | Merge raw data at different stages of cleaning          |


## 📌 Summary

| Concept      | Details                                               |
| ------------ | ----------------------------------------------------- |
| Method       | `pd.concat()`                                         |
| Use Cases    | Combine multiple DataFrames (vertically/horizontally) |
| Key Controls | `axis`, `ignore_index`, `join`, `keys`                |
| Use When     | No need for key-based logic, simple appending         |
| Alternatives | Use `merge()`/`join()` when working with keys         |


<center><b>Thanks</b></center>