# Processing Multiple Files in Parallel


If you have one data file per match or per day, you can use `folder_flow()` to apply a Flow pipeline across all files in parallel - speeding up your workflows on multi-core machines.

## What `folder_flow` Does

- Scans a folder for `.json` or `.jsonl` files
- Loads each file into a Flow
- Applies your function (`flow_fn`) to each one
- Optionally:
  - Writes output files (one per input)
  - Or merges the results and reduces them with a final step

### Example: Extract Shots and Save One File per Match

```python
from penaltyblog.matchflow import folder_flow, Flow

def extract_shots(flow: Flow) -> Flow:
    return (
        flow
        .filter(lambda r: r.get("type_name") == "Shot")
        .assign(is_goal=lambda r: r.get("shot_outcome") == "Goal")
        .select("match_id", "player_name", "shot_xg", "is_goal")
    )

folder_flow(
    input_folder="matches/",
    flow_fn=extract_shots,
    output_folder="processed_matches/"
)
```

Each match file will be read, transformed, and saved using all your CPU cores.

### Example: Merge All Files and Get Total Goals

```python
def count_goals(flow: Flow) -> Flow:
    return flow.filter(lambda r: r.get("is_goal")).summary(total_goals="count")

summary = folder_flow(
    input_folder="matches/",
    flow_fn=extract_shots,
    reduce_fn=count_goals
)

print(summary.collect())
```

## Guidelines for `folder_flow`

| Parameter       | Description                                          |
| --------------- | ---------------------------------------------------- |
| `input_folder`  | Folder with your `.json` or `.jsonl` files           |
| `flow_fn`       | Your pipeline for a single file (must return a Flow) |
| `output_folder` | If given, saves output per file and returns nothing  |
| `reduce_fn`     | Optional final step if combining all results         |
| `n_jobs`        | Number of worker processes (default: all cores)      |

### Tips & Gotchas

- `flow_fn` and `reduce_fn` must be pickleable → define them as regular named functions rather than lambda functions
- Set `n_jobs=1` to debug errors more easily
- Each file is processed independently (no shared state)

## Summary

Use `folder_flow()` to:

- Scale up your analysis across many files
- Split your work into smaller, parallel pieces
- Save outputs or combine them flexibly

Perfect for batched data like match events or lineups.