# Introduction

## Flow: a lazy, streaming data pipeline

In football analytics, much of the data - events, matches, player tracking - is delivered as nested JSON records. These files can be large, noisy, and full of structure, but we often want to perform familiar operations: filtering, selecting fields, computing aggregates, or joining on player or match IDs.

Flow is designed for exactly this: lightweight pipelines that operate on record streams without loading everything into memory. It’s built from the ground up for JSON-like dicts, meaning you can start processing immediately without flattening or reshaping your data.


## How Flow Works

A `Flow` is like a smart conveyor belt for your records. You tell it what operations to perform - like `.filter()`, `.select()`, `.assign()` - and it remembers the plan without doing anything right away.

### Lazy, Chainable Operations

When you build a Flow pipeline, you're chaining together a sequence of lazy transformations.

```python
flow = Flow(events)
shots = flow.filter(lambda r: r.get("type_name") == "Shot")
result = shots.select("player_name", "shot_xg")
```

At this point, nothing has actually run. It’s only when you call a materializing method like `.collect()` or `.to_pandas()` that data is processed.

## Materialization: When Work Happens

`Flow` evaluates your pipeline only when necessary - this is called materialization:

- `.collect()` → returns a list of records
- `.to_pandas()` → builds a DataFrame
- `.to_jsonl("out.jsonl")` → writes records to disk
- `for r in flow` → begins processing the stream

Until you do one of these, Flow remains idle and memory-efficient.

## Automatically Safe to Reuse

Even though `Flow` is built like a generator, it’s safe to reuse. You can:

- Call `.first()`, `.head()`, `.collect()` multiple times
- Loop over the same Flow again
- Chain additional steps later

`Flow` automatically preserves any records it’s seen before, so you never lose data by accident.

```python
flow = Flow(data)

print(flow.first())     # safe peek
print(flow.head(3))     # still safe
print(flow.collect())   # all data still there
```

This buffering happens automatically behind the scenes using a Python technique called “teeing” (but you don’t need to know that to use it).

## But Be Aware of Memory Usage

Because `Flow` preserves previously accessed records, repeated reuse will buffer more data in memory over time.

If you plan to inspect or reuse the data repeatedly, it's best to cache it explicitly:

```python
flow = Flow.from_jsonl("data.jsonl").cache()
```

This loads everything once and makes future operations fast and predictable.

## Working with Nested JSON

Football data is often deeply nested. `Flow` helps you work with it naturally:

```python
flow.select("player.name")     # directly select nested field
flow = flow.flatten()          # flatten all nested fields
```

This will flatten the data for you automatically, so you can focus on the operations you need.

## Why This Matters

`Flow’s` design offers practical advantages for sports analytics:

- Scales to large files (record-by-record processing)
- Starts fast (no up-front loading)
- Plays well with JSON and dicts
- Safe and repeatable by default
- Easy to debug, branch, and explore

## Next Steps

Once you understand `Flow’s` model, you can:

- Read files with `Flow.from_jsonl()` or `Flow.from_folder()`
- Transform and select fields with `.filter()`, `.assign()`, `.select()`
- Join datasets with `.join()`
- Group and summarize with `.group_by().summary()`
- Output data with `.to_jsonl()`, `.to_pandas()`, etc.

`Flow` lets you write expressive, powerful pipelines that are fast, safe, and memory-conscious — perfect for event-level football data.