# Advanced Pipeline Operations

Beyond basic filtering and assignment, Flow provides advanced operations for manipulating, combining, and analyzing structured data at scale.

These tools help you:

- Sort by xG or timestamp  
- Join datasets (e.g. match metadata + events)  
- Eliminate duplicates  
- Sample subsets for debugging  
- Combine multiple flows  

---

## 🔃 Sorting and Ordering

### `.sort_by()` – Sort Records

Sort records by one or more fields:

```python
from flow import Flow, where_equals
from pprint import pprint

sorted_events = Flow(events).sort_by("timestamp")
```

Sort shots by shot_xg, descending:

```python
shots = (
    Flow(events)
    .filter(where_equals("type_name", "Shot"))
    .sort_by("shot_xg", ascending=False)
)

pprint(shots.head(1))
```

Sort by multiple fields:

```python
Flow(events).sort_by(["team_name", "type_name"], ascending=False)
```

> 💡 Sorting loads the full flow into memory.

## 📏 Limiting Results

Use `.limit(n)` or `.head(n)` to take the first N records:

```python
top_5 = Flow(events).limit(5)
```

## 🎯 Sampling

### `.sample_n()` – Random N Records

```python
sample = Flow(events).sample_n(3, seed=42)
```

### `.sample_fraction(p)` – Fractional Sampling

```python
sample = Flow(events).sample_fraction(0.2, seed=1)  # 20% chance per row
```

## 🤝 Joining Datasets

Use `.join()` to merge two flows by one or more fields:

```python
events_flow = Flow(events)
matches_flow = Flow(matches)

joined = events_flow.join(
    matches_flow,
    on="match_id",            # Key to match on
    how="left",               # "left" or "inner"
    suffix="_match"           # For conflicting field names
)

print(joined.head(1))
```

### ⚠️ Notes on `.join()`

- The right-hand Flow is fully materialized in memory.
- Only "left" and "inner" joins are supported for now.

## ➕ Combining Flows

Use `.concat()` to merge multiple flows:

```python
combined = flow1.concat(flow2, flow3)
```

## 🚫 Handling Duplicates

### `.distinct()` – Drop Duplicates

Drop exact or partial duplicates:

```python
unique_events = Flow(events).distinct()

deduped = Flow(events).distinct("player_name", "type_name", keep="first")
```

Options for keep:

- "first" (default)
- "last"
- False → removes all duplicates

## 🧾 Extracting Unique Field Values

### `.distinct("field")` for unique values

```python
unique_players = Flow(events).distinct("player_name")
```

For combinations:

```python
unique = Flow(events).distinct("team_name", "type_name")
```

> 💡 Internally tracks key combinations so be careful on large datasets with high cardinality.

## 🧪 Example: Join Events with Match Info

```python
events = Flow(events)
matches = Flow(matches)

enriched = events.join(matches, on="match_id", how="left")
pprint(enriched.head(1))
```

```python
{
    'event_id': 1,
    'match_id': 123,
    'type_name': 'Pass',
    'player_name': 'Kevin De Bruyne',
    'team_name': 'Manchester City',
    'competition_name': 'Premier League',   # from match metadata
    'match_date': '2023-10-08'
}
```

## 🧠 Summary

Flow’s advanced operations let you:

- Sort and rank streams
- Sample intelligently
- Merge datasets using joins
- Deduplicate messy input
- Combine multiple sources

These tools are built for working with real-world, irregular JSON records - not just clean flat tables.

## 📥 Next: Saving and Exporting Data

In the next guide, we’ll look at writing flows to disk using `.to_jsonl()`, `.to_json()`, and `.to_pandas()` for final output or reporting.