# **7. Merging, Joining & Concatenation**

## **6. Bonus Concepts- `pd.merge_asof()` & `pd.merge_ordered()`**

In [1]:
import pandas as pd

## 🔹 1. `pd.merge_asof()` – Time-Aware Merge (Nearest Key Merge)

### 1. **What it does and when to use it**

* Performs a **merge on the closest key rather than exact matches**.
* Best suited for **time series data**, such as **log syncing, financial data**, or **sensor measurements**.
* Only supports **left joins** and requires both DataFrames to be **sorted by the key**.


### 2. **Syntax and key parameters**

```python
pd.merge_asof(left, right, on='timestamp', by='group_col', direction='backward')
```

| Parameter             | Description                                  |
| --------------------- | -------------------------------------------- |
| `on`                  | Column to join on (usually datetime)         |
| `by`                  | Optional grouping column                     |
| `direction`           | `'backward'`, `'forward'`, or `'nearest'`    |
| `tolerance`           | Max allowed gap (e.g., pd.Timedelta('10ms')) |
| `allow_exact_matches` | Whether to allow exact match                 |


### 3. **Example**

In [2]:
# Machine event timestamps
events = pd.DataFrame({
    'timestamp': pd.to_datetime(['2023-01-01 09:00:00', '2023-01-01 09:15:00']),
    'event': ['start', 'stop']
})

# Sensor readings
sensors = pd.DataFrame({
    'timestamp': pd.to_datetime(['2023-01-01 08:59:00', '2023-01-01 09:10:00']),
    'temperature': [22.5, 23.0]
})


display(events, sensors)

Unnamed: 0,timestamp,event
0,2023-01-01 09:00:00,start
1,2023-01-01 09:15:00,stop


Unnamed: 0,timestamp,temperature
0,2023-01-01 08:59:00,22.5
1,2023-01-01 09:10:00,23.0


In [3]:
# Nearest sensor reading before each event
pd.merge_asof(events.sort_values('timestamp'),
              sensors.sort_values('timestamp'),
              on='timestamp',
              direction='backward')

Unnamed: 0,timestamp,event,temperature
0,2023-01-01 09:00:00,start,22.5
1,2023-01-01 09:15:00,stop,23.0


### 4. **Common Pitfalls**

* Forgetting to sort DataFrames before using `merge_asof`
* Using `merge_asof` on categorical or non-monotonic data
* Misunderstanding the `direction` parameter


### 5. **Real-World Use Cases**

* Matching logs from one system with another (e.g., app logs and API logs)
* Aligning stock trades with quotes (common in financial data)
* Matching sensor values just before a production event


## 🔹 2. `pd.merge_ordered()` – Ordered Merge with Fill Logic

### 1. **What it does and when to use it**

* Merges DataFrames and **preserves order** (used for **chronological merging**).
* Useful when merging datasets that include **forecast data** or **events interleaved in time**.

### 2. **Syntax and key parameters**

```python
pd.merge_ordered(left, right, on='timestamp', fill_method='ffill')
```

| Parameter     | Description                                       |
| ------------- | ------------------------------------------------- |
| `on`          | Merge column (usually datetime)                   |
| `fill_method` | Use `'ffill'` or `'bfill'` to fill missing values |
| `suffixes`    | Customize suffix for overlapping column names     |

### 3. **Example**

In [4]:
df1 = pd.DataFrame({
    'timestamp': pd.to_datetime(['2023-01-01', '2023-01-03']),
    'forecast': [100, 110]
})

df2 = pd.DataFrame({
    'timestamp': pd.to_datetime(['2023-01-02', '2023-01-03']),
    'actual': [98, 108]
})

display(df1, df2)

Unnamed: 0,timestamp,forecast
0,2023-01-01,100
1,2023-01-03,110


Unnamed: 0,timestamp,actual
0,2023-01-02,98
1,2023-01-03,108


In [5]:
pd.merge_ordered(df1, df2, on='timestamp', fill_method='ffill')

Unnamed: 0,timestamp,forecast,actual
0,2023-01-01,100,
1,2023-01-02,100,98.0
2,2023-01-03,110,108.0


### 4. **Common Pitfalls**

* Using `merge` instead of `merge_ordered` and expecting fill logic
* Failing to manage overlapping columns or suffixes
* Misinterpreting forward vs backward fill behavior


### 5. **Real-World Use Cases**

* Merging forecast and actual values over time
* Filling gaps in time series data from multiple sources
* Combining irregular time series data (e.g., system metrics + events)

## 🔹 3. Best Practices for Large Dataset Merges

* **Index-based joins** (when possible) are faster than column-based merges.
* Always ensure **key columns are clean and of the same type** (`int`, `str`, `datetime`).
* Use **`merge(..., validate='one_to_one')`** to detect unexpected duplicates.
* Use **categoricals or indexes** for optimized merge performance.
* For large datasets, **consider chunking** or **using Dask/Polars** for out-of-core merges.

## ✅ Summary Table: Merge Function Comparison

| Function             | Best Use Case                            | Supports Join Types      | Special Behavior              |
| -------------------- | ---------------------------------------- | ------------------------ | ----------------------------- |
| `pd.concat()`        | Combine along axis                       | No (stacking only)       | Simple appending or stacking  |
| `pd.merge()`         | SQL-style joins                          | Yes (inner, outer, etc.) | General-purpose               |
| `df.join()`          | Index-based join                         | Yes                      | Simple syntax for index joins |
| `pd.merge_asof()`    | Time-based nearest merge                 | Left only                | Works only on sorted data     |
| `pd.merge_ordered()` | Time-based ordered merge with fill logic | Outer (default)          | Preserves chronological order |


<center><b>Thanks</b></center>