# Best Practices, Performance, & Troubleshooting

This chapter provides tips for working effectively with `Flow`, optimizing performance, and avoiding common mistakes. `Flow` is built to be flexible and safe by default, but understanding its lazy behavior and internal memory handling can help you write cleaner and more efficient pipelines.

## You Can Reuse Any Flow

Unlike regular Python iterators, a `Flow` can be reused as many times as you like. You can:

- Call `.collect()` multiple times
- Loop through it more than once
- Peek at `.first()` or `.keys()` safely
- Chain further operations even after materialization

```python
flow.first()   # buffers 1 record
flow.head(100) # buffers another 100
flow.collect() # buffers everything
```

If you're reusing the same Flow repeatedly (especially on large datasets), this memory can add up.

## Use `.cache()` for Predictability

To avoid memory surprises and improve performance when reusing data, use `.cache()`:

```python
flow = Flow.from_jsonl("match_events.jsonl").cache()
```

This:

- Loads everything once
- Stores it in memory explicitly
- Makes all operations faster and repeatable

Use `.cache()` if:

- You're branching into multiple sub-pipelines
- You're debugging
- You want to avoid surprises from internal buffering

This caches the entire `Flow` in memory, making it reusable and predictable. It's like a "save-point" partway through your pipeline so you can reuse it without having to run the previous steps again.

## Know Which Operations Are Memory-Heavy

The following operations materialize the full `Flow` into memory:

- `.collect()`
- `.to_pandas()` / `.to_json()` / `.to_jsonl()`
- `.group_by()` and `.summary()`
- `.sort()`, `.row_number()`, `.sample()`, `.take_last()`
- `.last()`
- `.len()`

## Recommended Patterns


| Situation                      | Recommendation                               |
|-------------------------------|----------------------------------------------|
| One-off pipeline              | Just use `.filter(...).assign(...).collect()`|
| Reuse across multiple steps   | Use `.cache()` once early                    |
| Working with large files      | Use `.from_jsonl()` + streaming methods      |
| Schema exploration            | Use `.first()`, `.keys(limit=10)`           |
| Testing intermediate steps    | Use `.limit()`, `.head()`, `.pipe()`        |
| Avoiding memory surprises     | Use `.cache()` before `.collect()` or `.sort()` |

## Debugging Pipelines

### Break Long Chains

Split your pipeline into steps:

```python
step1 = flow.filter(...)
print(step1.head().collect())
step2 = step1.assign(...)
```

### Use `.pipe()` to Inject Debug Prints

```python
def peek(flow):
    print(flow.head(3).collect())
    return flow

flow.filter(...).pipe(peek).assign(...)
```

## When Not to Use Flow

`Flow` is ideal for working with JSON-like records in Python. But other tools may be better if:

- You need SQL joins across many large tables
- You want indexed storage or transactional queries
- Your dataset is too large to sort or group in memory
- You need true real-time stream processing 

## Summary

`Flow` is designed to be safe, reusable, and lazy by default. You can:

- Chain and reuse pipelines without worrying about exhaustion
- Safely explore and inspect data at any point
- Cache data for repeatable performance

Understanding how `Flow` balances laziness with internal buffering will help you stay efficient and in control when processing football (or any) event data.