# Processing Multiple Files in Parallel


If your dataset is split across many files - one per match, gameweek, or team - you can use `folder_flow()` to apply a **Flow** pipeline to each file in parallel. This takes advantage of multiple CPU cores and makes your analysis much faster at scale.

## What `folder_flow` Does

- Scans a folder for `.json` or `.jsonl` files
- Loads each file into a Flow
- Applies your function (`flow_fn`) to each one
- Optionally:
  - Writes transformed records to a per-file output folder
  - Or combines all records and runs a final reduction step

### Example 1: Extract Shots and Save One File per Match

```python
from penaltyblog.matchflow import folder_flow, Flow

def extract_shots(flow: Flow) -> Flow:
    return (
        flow
        .filter(lambda r: r.get("type_name") == "Shot")
        .assign(is_goal=lambda r: r.get("shot_outcome") == "Goal")
        .select("match_id", "player_name", "shot_xg", "is_goal")
    )

folder_flow(
    input_folder="matches/",
    flow_fn=extract_shots,
    output_folder="processed_matches/"
)
```

This will:

- Load each file in `matches/`
- Extract only the shots
- Save the transformed records to `processed_matches/<original_filename>.json`

### Example 2: Merge All Files and Summarize Total Goals

```python
def count_goals(flow: Flow) -> Flow:
    return flow.filter(lambda r: r.get("is_goal")).summary(total_goals="count")

summary = folder_flow(
    input_folder="matches/",
    flow_fn=extract_shots,
    reduce_fn=count_goals
)

print(summary.collect())
```

This:

- Runs `extract_shots()` on every file
- Merges all results into a single Flow
- Applies `count_goals()` as a final summary step

## Parameters

| Parameter       | Description                                            |
| --------------- | ------------------------------------------------------ |
| `input_folder`  | Folder with `.json` or `.jsonl` files                  |
| `flow_fn`       | Your pipeline for a single file (must return a `Flow`) |
| `output_folder` | If set, saves output per file; no results are returned |
| `reduce_fn`     | Optional; run after merging results from all files     |
| `n_jobs`        | Number of worker processes (default: all CPU cores)    |
| `encoding`      | File encoding (default: `"utf-8"`)                     |

## Output File Example

```bash
processed_matches/
├── match1.json
├── match2.json
├── match3.json
```

Output file format depends on the input:

- `.jsonl` input → writes `.jsonl` file
- `.json` input → writes `.json` array

Use consistent extensions to control format.

### Tips & Gotchas

- ✅ `flow_fn` and `reduce_fn` must be pickleable - use regular named functions (not lambdas or closures).
- 🧪 Use `n_jobs=1` during development to debug more easily.
- 📦 Each file is processed independently - no shared state.
- ❌ You can't use non-pickleable libraries like open file handles or class-bound methods inside `flow_fn`.

## Summary

Use `folder_flow()` when:

- You want to scale your analysis across hundreds of files
- You need fast, parallelized processing
- You want to either:
  - Save per-file results
  - Merge them into a final summary

It’s perfect for event data, lineups, match summaries, or anything split into multiple files.