# Best Practices, Performance, & Troubleshooting

**Flow** is designed to help you write clear, efficient pipelines - especially for semi-structured, JSON-like data. This chapter covers key patterns and tips to help you work confidently with large datasets, avoid common pitfalls, and get predictable performance.

## Understand Flow’s Single-Pass Nature
Flow operates as a lazy, streaming pipeline. Each operation defines a transformation - but nothing happens until you materialize the flow:

```python
flow = Flow(data).filter(...).assign(...)
```

At this point, nothing has run yet.

Only when you call `.collect()`, `.to_pandas()`, `.to_jsonl()`, or iterate over the flow does the processing begin and the stream is **consumed**.

## Want to Reuse a Flow? Use `.materialize()`

Most `Flow` operations are destructive - they consume the data.

If you need to inspect or use a `Flow` multiple times, call `.materialize()`:

```python
cleaned = Flow.from_jsonl("events.jsonl").filter(...).assign(...)

forked = cleaned.materialize()

summary = forked.group_by("team").summary(...)
shots = forked.filter(lambda r: r["type_name"] == "Shot")
```

✅ This converts the stream into a backed list, making the resulting `Flow` safe for reuse.

## Avoid Internal Buffering Confusion

- All flows are single-pass
- All inspection methods (`.first()`, `.keys()`, `.last()`, `len()`, etc.) consume the flow
- If you need to reuse, call `.materialize()` before inspecting

Example:

```python
flow = Flow.from_jsonl("match.jsonl")

flow.first()  # ← Consumes the stream!
flow.collect()  # ← Now returns nothing (already consumed)

# Correct usage
flow = Flow.from_jsonl("match.jsonl").materialize()
flow.first()  # Safe
flow.collect()  # Still safe
```

## Know Which Operations Consume the Stream

The following consume the entire flow and return a result:

| Operation                   | Description           |
| --------------------------- | --------------------- |
| `.collect()`                | List of records       |
| `.to_pandas()`              | Pandas DataFrame      |
| `.to_json()`, `.to_jsonl()` | JSON serialization    |
| `.group_by()`, `.summary()` | Aggregations          |
| `.sort()`, `.row_number()`  | Sorting, ranking      |
| `.take_last()`, `.last()`   | Back-of-stream access |
| `len(flow)`                 | Count of records      |
| `.keys()`                   | Union of fields       |
| `.describe()`               | Stats via pandas      |

💡 If you're calling any of these, the original stream is gone unless you `.materialize()` beforehand.

## Recommended Patterns

| Situation                        | Recommendation                                |
| -------------------------------- | --------------------------------------------- |
| One-off pipeline                 | `.filter(...).assign(...).collect()`          |
| Reuse across steps               | `.materialize()` once early                   |
| Branching into multiple analyses | `.materialize()` then fork                    |
| Large files                      | Use `.from_jsonl()` and avoid `.sort()` early |
| Exploring schema                 | `.head(n).collect()` for shallow preview      |
| Debugging steps                  | Use `.pipe()` or `.head()` to inspect         |
| Avoiding surprises               | Always materialize before reuse               |

## Debugging Pipelines

### Break Up Long Chains

Split pipelines into small, readable steps:

```python
step1 = flow.filter(...)
print(step1.head(3).collect())  # Inspect safely
step2 = step1.assign(...)
```

### Use `.pipe()` for Debug Hooks

```python
def peek(flow):
    print(flow.head(3).collect())
    return flow

Flow(events).filter(...).pipe(peek).assign(...)
```

Great for inspecting intermediate state without disrupting your pipeline logic.

## When Not to Use Flow

`Flow` is built for in-memory processing of structured/semi-structured records. Consider alternatives if:

- You need true SQL-style joins across many large tables
- Your data is too large to fit in memory even grouped
- You need low-latency stream processing
- You’re working with tabular, relational data with well-defined schema

## Summary

- ✅ Use `.materialize()` to safely reuse or inspect data
- ⚠️ Remember: most inspection methods consume the stream
- 🧰 Use `.pipe()` and `.head()` for debugging
- 📦 Prefer `.from_jsonl()` for scalable streaming input
- 🔁 Fork or branch flows explicitly using `.materialize()` or `.fork()`

Flow is designed to make pipelines simple, readable, and safe - as long as you’re mindful of when your data is being consumed.

