# Introduction

## Matchflow: a lazy, streaming Data‐Pipeline

In football analytics, much of the data - events, matches, player tracking - is delivered as nested JSON records. These files can be large, noisy, and full of nested structure, but we often want to perform familiar operations: filtering, selecting fields, computing aggregates, or joining on player or match IDs.


In many tasks, you want to run a **lightweight data pipeline** over streams of JSON/dict records without loading everything into memory.

That’s where **Flow** comes in.

## Understanding How `Flow` Works

Think of a `Flow` as a smart, efficient assembly line for your data records. You start with a stream of raw data (like a list of game events from a JSON file), and then you tell `Flow` what steps to perform on each record as it moves along the line.

### Planning Your Work (Lazy Evaluation & Chaining)

When you use methods like `.filter(...)`, `.select(...),` or `.assign(...)` on a `Flow`, you're not immediately processing all your data. Instead, you're building a plan or a recipe. Each step you add creates a new `Flow` object that remembers all the previous steps plus the new one you just added. This is called "chaining."

**Example:**

```python
# Let's say raw_events is a list of dictionaries from a StatsBomb file
# This line just sets up the starting point
data_plan = Flow(raw_events)

# This adds a "filter" step to the plan. No data is processed yet.
filtered_plan = data_plan.filter(lambda event: event.get("type_name") == "Shot")

# This adds a "select" step. Still just a plan!
final_plan = filtered_plan.select("player_name", "shot_outcome_name")
```

At this stage, `final_plan` knows it needs to: 1) get raw_events, 2) filter for shots, 3) select player and outcome. But no actual filtering or selecting has happened. This is "lazy evaluation" – work is delayed until absolutely necessary.

### Doing the Work (Materialization)

The actual data processing – running records through your planned assembly line – happens only when you ask for the final results. This is often called "materialization."

Common ways to get results:

- `.collect()`: Turns the final planned output into a Python list of dictionaries.
- `.to_pandas()`: Converts the output into a pandas DataFrame.
- `.to_jsonl("output.jsonl")`: Writes each resulting record to a file.
- Looping: `for record in final_plan: print(record)`

**Example (continuing from above):**

```python
shot_data_list = final_plan.collect()
# NOW the work happens:
# 1. Records from raw_events are read one by one.
# 2. Each record is checked if it's a "Shot".
# 3. For shots, only "player_name" and "shot_outcome_name" are kept.
# 4. These processed records are collected into shot_data_list.
```

###  Important: The "One-Way Street" Nature (How `Flow` works):

`Flow` is designed to be efficient with potentially large streams of data. It generally processes data in a single pass. When you use an operation like `flow.limit(5)`, it doesn't just give you the first 5 items and leave the original flow untouched at the beginning.

- Internally, Flow uses a clever trick (`itertools.tee`) that's like putting a Y-splitter in a pipe.
- When you do `limited_flow = original_flow.limit(5)`:
    - `limited_flow` gets one branch of the "pipe" that will provide the first 5 items.
    - `original_flow` updates itself to use the other branch, which will effectively start after those first 5 items if `limited_flow` is actually used (e.g., by calling `.collect()` on it).
- **Think of it like this:** If you have a stack of 100 pages and you say `my_flow.limit(10).get_pages()`, you get the first 10 pages. The `my_flow` stack itself now effectively starts at page 11 for any subsequent operations on `my_flow`.
- **What this means for you:** If you want to perform two completely independent analyses starting from the very beginning of your original data, you should create a fresh `Flow` from your source data for each analysis:

```python
# Good: Two independent analyses
analysis1_result = Flow(raw_events).filter(some_condition).collect()
analysis2_result = Flow(raw_events).limit(10).collect() # Starts fresh from raw_events

# Potentially confusing: Second operation acts on a progressed Flow
# original_flow = Flow(raw_events)
# analysis1_result = original_flow.filter(some_condition).collect()
# # If analysis1 consumed many items, analysis2 might get fewer than 10 or different items!
# analysis2_result = original_flow.limit(10).collect()
```

### Strengths of this Approach

- Memory Efficient: Excellent for large data files (like detailed event data) because it processes records one by one (or in small chunks) instead of loading everything into memory at once.
- Clear & Readable Pipelines: Chaining operations `(.filter().select().assign()...)` makes your data transformation logic easy to follow.
- Flexible: You can combine many different operations to clean, reshape, and analyze your data.
- Lazy: If you only need the first few records (e.g., `flow.head(5).collect()`) or to check if data exists (`flow.first()`), it avoids processing the entire dataset.

### Limitations & Things to Keep in Mind

- Single-Pass for a Given Flow Instance: As explained above, once part of a `Flow` is consumed by an operation that returns a result (or by iterating it), that `Flow` instance has "moved forward." For truly independent operations on the full original dataset, start a new `Flow` from your source.
- Some Operations Consume Everything: Certain methods need to see all the data to work. For example, `.sort()`, `.summary()`, .`last()`, or getting the `len(flow)`. After these, the `Flow` instance you called them on will be "exhausted" (its internal iterator will be empty). The methods themselves will return a new `Flow` (or data structure) with the result.
- Not a Database Replacement: While powerful for data transformations, it's not designed for complex relational queries, persistent storage, or indexed lookups on massive datasets where a proper database would be more suitable.

