# Introduction

## Flow: a lazy, streaming data pipeline

In football analytics, much of the data - events, matches, player tracking - is delivered as nested JSON records. These files can be large, noisy, and full of nested structure, but we often want to perform familiar operations: filtering, selecting fields, computing aggregates, or joining on player or match IDs.

In many tasks, you want to run a **lightweight data pipeline** over streams of JSON/dict records without loading everything into memory.

That’s where **Flow** comes in.

## Understanding How `Flow` Works

Think of a `Flow` as a smart, efficient assembly line for your data records. You start with a stream of raw data (like a list of game events from a JSON file), and then you tell `Flow` what steps to perform on each record as it moves along the line.

### Planning Your Work (Lazy Evaluation & Chaining)

When you use methods like `.filter(...)`, `.select(...),` or `.assign(...)` on a `Flow`, you're not immediately processing all your data. Instead, you're building a plan or a recipe. Each step you add creates a new `Flow` object that remembers all the previous steps plus the new one you just added. This is called "chaining."

**Example:**

```python
# Let's say raw_events is a list of dictionaries from a JSON file
# This line just sets up the starting point
data_plan = Flow(raw_events)

# This adds a "filter" step to the plan. No data is processed yet.
filtered_plan = data_plan.filter(lambda event: event.get("type_name") == "Shot")

# This adds a "select" step. Still just a plan!
final_plan = filtered_plan.select("player_name", "shot_outcome_name")
```

At this stage, `final_plan` knows it needs to: 1) get raw_events, 2) filter for shots, 3) select player and outcome. But no actual filtering or selecting has happened. This is "lazy evaluation" – work is delayed until absolutely necessary.

### Doing the Work (Materialization)

The actual data processing – running records through your planned assembly line – happens only when you ask for the final results. This is often called "materialization."

Common ways to get results:

- `.collect()`: Turns the final planned output into a Python list of dictionaries.
- `.to_pandas()`: Converts the output into a pandas DataFrame.
- `.to_jsonl("output.jsonl")`: Writes each resulting record to a file.
- Looping: `for record in final_plan: print(record)`

**Example (continuing from above):**

```python
shot_data_list = final_plan.collect()
# NOW the work happens:
# 1. Records from raw_events are read one by one.
# 2. Each record is checked if it's a "Shot".
# 3. For shots, only "player_name" and "shot_outcome_name" are kept.
# 4. These processed records are collected into shot_data_list.
```

###  How Flow Works: Lazy, One-Way by Default (but You Can Cache)

`Flow` is designed for lazy, memory-efficient processing of potentially large datasets. It does not load or transform data until absolutely necessary (e.g., when you call `.collect()` or `.to_pandas()`).

### Lazy and One-Way by Default

By default, a `Flow` consumes its underlying data as it processes it — much like a generator. This means:

- Most operations (`filter()`, `.assign()`, `.select()`) just build the pipeline without executing it.
- When you finally collect or iterate the flow, it runs the full pipeline from scratch.

```python
flow = Flow(data)
filtered = flow.filter(...)
# nothing happens yet
result = filtered.collect()  # Now it runs everything
```

### Behind the Scenes: `tee()` for Safety

To minimize surprises, most operations like `.collect()`, `.__iter__()`, `.first()`, etc. internally use `itertools.tee()`. This splits the internal stream so:

- You can call `.collect()` multiple times without losing data.
- You can inspect the first record and still iterate the rest.

```python
flow = Flow(data)
print(flow.first())      # peeks safely
print(flow.collect())    # still has all data
```

But this safety comes at a slight cost: the more you inspect a flow, the more data is potentially buffered in memory.

### If You Want a “Save Point”: Use `.cache()`

If you plan to re-use a flow many times and don’t want to re-run or re-buffer:

```python
cached_flow = flow.cache()
```

- This materializes and stores all records in memory.
- Subsequent operations are fast and repeatable.
- Great for debugging or branching off multiple analyses.

### Best Practices

- Use `Flow(data).filter(...).collect()` for one-time pipelines.
- Use `.cache()` if you plan to re-use the flow or need stable results.
- Avoid reusing non-cached flows in multiple places unless you understand the buffering behavior.
- If in doubt, re-create the Flow from your source data.

## Why This Approach?

The lazy streaming model used by `Flow` offers a set of practical advantages:

- Scales to large files: Many football datasets (like event or tracking data) are too big to load fully into memory. `Flow` lets you filter, transform, and summarize data record-by-record without blowing your RAM.
- Fast startup: Since no data is loaded until needed, your pipeline can be constructed instantly — even for gigabyte-scale files.
- Modular and composable: Each step is clean and isolated. You can build up pipelines gradually and reuse common steps without side effects.
- Debuggable and inspectable: You can safely peek at `flow.first()`, `.head()`, or call `.collect()` repeatedly thanks to internal buffering, without breaking the stream.
- Easy to branch: Want to try a few different filters or summaries on the same base data? Use `.cache()` once and explore all you want.
- Familiar feel: Flow is conceptually similar to chaining operations in pandas or SQL - but with streaming, nested JSON, and performance in mind.

This model keeps your analysis efficient, testable, and expressive - especially valuable when wrangling real-world football data that is often messy and nested.

