# Introduction to Flow

## Flow: a Lazy, Streaming Data Pipeline

In football analytics, much of the data - events, matches, tracking - is delivered as nested JSON records. These files can be large and complex, but the operations we want to perform are often familiar: filtering, selecting fields, computing aggregates, or joining on player and match IDs.

**Flow** is built for exactly this: lightweight, chainable pipelines that process streaming JSON-like `dict` records efficiently, without needing to load everything into memory upfront.


## How Flow Works

A `Flow` is like a conveyor belt of records. You define what should happen - `.filter()`, `.assign()`, `.group_by()` - and nothing is processed until you explicitly ask for results.

### Lazy and Chainable

Flow pipelines are lazy. You can freely chain transformations without triggering any computation:

```python
flow = Flow(events)
shots = flow.filter(lambda r: r.get("type_name") == "Shot")
result = shots.select("player_name", "shot_xg")
```

No computation happens yet. This is just a recipe.

## Materialization: When Work Happens

`Flow` is single-pass. Once a pipeline is consumed - via iteration, conversion, or saving - it cannot be reused.

Materialization happens when you:

- Call `.collect()` - returns a list of all records
- Call `.to_pandas()` - builds a DataFrame
- Use `for r in flow` - starts streaming records
- Save to disk with `.to_jsonl()`, `.to_json_files()`, etc.

## Need to Reuse the Data? Use `.materialize()`

If you want to inspect or reuse data across multiple steps, use `.materialize()`:

```python
flow = Flow.from_folder("data/json").filter(...).assign(...)

# Create a reusable, in-memory Flow
data = flow.materialize()

# Now it's safe to reuse
summary = data.group_by("team").summary(...)
df = data.filter(...).to_pandas()
```

This consumes the original stream and gives you a new `Flow` backed by a list - reusable and safe for repeated access.

## Just Want the Results? Use `.collect()`

If you just want the records as a list of dicts:

```python
records = Flow.from_folder("data/json").filter(...).collect()
```

This is the most basic way to get results, but it’s not reusable.

## ⚠️ Inspection Methods Consume the Stream

Because `Flow` is single-pass, even methods like `.first()`, `.head()`, and `.keys()` will consume the underlying records.

```python
flow = Flow(events)
print(flow.first())
```

To safely inspect, materialize the stream first using `.materialize()`:

```python
flow = Flow(events).materialize()
print(flow.first())    # safe
print(flow.head(3))    # safe
print(flow.keys())     # safe
```

## Working with Nested JSON

Football data is often deeply nested. `Flow` helps you work with it naturally:

```python
flow.select("player.name")     # directly select nested field
flow = flow.flatten()          # flatten all nested fields
```

This will flatten the data for you automatically, so you can focus on the operations you need.

## Why Flow?

**Flow** is built for the messy, structured nature of real-world football data:

- ✅ Streams large JSON files without loading everything at once
- ✅ Chains transformations fluently and readably
- ✅ Plays well with nested dicts
- ✅ Avoids pandas until you need it
- ✅ Keeps pipelines composable and inspectable

## Best Practices

- Use `.materialize()` if you plan to inspect or reuse data more than once
- Avoid using `.first()`, `.head()`, or `.keys()` on unmaterialized streams - they consume the Flow
- Think of `Flow` as a one-shot pipeline - just like Python's generators

## Summary

Once you understand `Flow`'s model, you can:

- Read files with `Flow.from_jsonl()` or `Flow.from_folder()`
- Transform and select fields with `.filter()`, `.assign()`, `.select()`
- Join datasets with `.join()`
- Group and summarize with `.group_by().summary()`
- Output data with `.to_jsonl()`, `.to_pandas()`, etc.

`Flow` lets you write expressive, powerful data pipelines that are fast, safe, and memory-conscious.