# Processing Multiple Files in Parallel


Often, your data isn't in a single monolithic file but spread across multiple files - perhaps one file per match, per player, or per day. Processing these files sequentially can be time-consuming. `Flow` provides a utility function, `folder_flow`, designed to process a directory of data files in parallel, applying your `Flow` transformations efficiently.

## Core Idea:

`folder_flow` works by:

- Identifying all relevant data files (e.g., .json or .jsonl) in a specified input folder.
- For each file, it spawns a separate process (up to a specified number of parallel jobs).
- In each process:
    - The data file is loaded into a `Flow`.
    - A user-provided function (`flow_fn`) containing `Flow` pipeline logic is applied to this `Flow`.
- Optionally, the processed results from each file can be:
    - Written to a corresponding new file in an output folder.
    - Collected and merged into a single Flow.

Optionally, if results are merged, a final "reduce" function (`reduce_fn`) can be applied to this combined `Flow`.

## Using folder_flow

Here's the typical signature:

```python
from penaltyblog.matchflow import folder_flow

folder_flow(
    input_folder: Union[str, Path],
    flow_fn: Callable[["Flow"], "Flow"],
    output_folder: Optional[Union[str, Path]] = None,
    reduce_fn: Optional[Callable[["Flow"], "Flow"]] = None,
    n_jobs: Optional[int] = None,
    encoding: str = "utf-8",
    file_exts: tuple[str, ...] = (".json", ".jsonl"),
) -> Optional["Flow"]:
```

## Parameters:

- input_folder: Path to the directory containing your data files.
- flow_fn: Your processing logic. This is a Python function that:
    - Must accept a Flow object (containing data from one input file) as its first argument.
    - Must return a Flow object representing the processed data for that file.
    - Crucially, `flow_fn` must be "pickleable" (meaning Python's pickle module can serialize it). This usually means it should be a named function. Complex closures or instance methods can sometimes cause issues (this is an in-built limitation of to Python rather than `Flow`). `folder_flow` will try to check this and raise an error if it's not pickleable.
- output_folder (optional): If provided, the processed results from each input file will be saved as a new file (with the same name as the input file) in this directory. The format (`.json` or `.jsonl`) will match the input. If an output_folder is specified, `folder_flow` returns None.
- reduce_fn (optional): If output_folder is not specified, the results from all processed files are collected and merged into a single `Flow`. This reduce_fn can then be applied to this combined `Flow` for final aggregations or transformations.
    - Like `flow_fn`, it must accept a `Flow` and return a `Flow`, and be pickleable.
- n_jobs (optional): The number of parallel worker processes to use. Defaults to the number of CPU cores on your machine. Set to 1 for sequential processing (useful for debugging).
- encoding (optional): File encoding to use for reading and writing. Defaults to "utf-8".
- file_exts (optional): A tuple of file extensions (e.g., (".json", ".jsonl")) to identify which files in input_folder should be processed. Defaults to (".json", ".jsonl").


## Return Value:

- If output_folder is specified: Returns None.
- If output_folder is not specified: Returns a `Flow` object containing the combined (and optionally reduced) results from all processed files.

## Example: Processing Multiple Match Event Files

Let's say you have a folder `match_data/` containing several JSON event files (e.g., match1.json, match2.json, ...). You want to extract all shots, add a field indicating if the shot was a goal, and then:

- Save each processed match's shots to a new file.
- Alternatively, combine all shots from all matches and get a final count.

### Define your flow_fn:

This function will operate on the `Flow` created from a single match file.

```python
from penaltyblog.matchflow import folder_flow

# Make sure this function is defined at the top level of your script or in an importable module.
def extract_and_mark_shots(flow: Flow) -> Flow:
    """Processes a single match's events: filters for shots and marks goals."""
    return (
        flow
        .filter(lambda r: r.get("type_name") == "Shot") 
        .assign(is_goal=lambda r: r.get("shot_outcome") == "Goal") 
        .select("match_id", "player_name", "shot_xg", "is_goal", "period", "timestamp")
    )
```

### Using `folder_flow` to write individual processed files:

```python
input_dir = "path/to/your/match_data/"
output_dir = "path/to/your/processed_shots_per_match/"

# This will process all .json files in input_dir using extract_and_mark_shots,
# save each result to output_dir, and run using all available CPU cores.
folder_flow(
    input_folder=input_dir,
    flow_fn=extract_and_mark_shots,
    output_folder=output_dir,
    file_exts=(".json",) # Assuming files are .json
)
print(f"Processed shot files saved to {output_dir}")

After this runs, output_dir will contain files like match1.json, match2.json, etc., each containing only the processed shot data for that match.
```

### Using `folder_flow` to combine results and apply a reduce_fn:

Now, let's get a combined `Flow` of all shots from all matches and then count the total number of goals.

```python
def count_total_goals(all_shots_flow: Flow) -> Flow:
    """Takes a flow of all shots and returns a summary flow with total goals."""
    return all_shots_flow.filter(lambda r: r.get("is_goal") is True) \
                         .summary(total_goals_scored="count")

input_dir = "path/to/your/match_data/"

# No output_folder means results are merged
all_processed_shots_flow = folder_flow(
    input_folder=input_dir,
    flow_fn=extract_and_mark_shots, # Same flow_fn as before
    reduce_fn=count_total_goals,    # Apply our new reduce function
    file_exts=(".json",)
)

if all_processed_shots_flow:
    summary_data = all_processed_shots_flow.collect()
    # summary_data would be like: [{"total_goals_scored": X}]
    print(f"Summary from all matches: {summary_data}")
else:
    print("folder_flow did not return a Flow (should not happen if output_folder is None and no errors).")
```

## Important Considerations for Parallel Processing:

- Pickleability: As mentioned, `flow_fn` and `reduce_fn` must be pickleable. This is a requirement of Python's multiprocessing library. If you get errors related to pickling, ensure your functions are defined at the top level of a module or are simple enough not to rely on unpickleable closures.
- No Shared State (Almost): Each worker process gets its own copy of the `flow_fn` and operates on its assigned file independently. Avoid trying to modify global variables or shared state directly from within `flow_fn` across processes, as it may not work as expected. The design of `folder_flow` (returning data or writing to distinct files) handles the results.
- Resource Intensive `flow_fn`: If your `flow_fn` itself is very memory-intensive (e.g., it materializes a large part of a single file's `Flow` internally), running many such jobs in parallel could still strain system memory.
- Debugging: Parallel processing can sometimes make debugging harder. If you encounter issues, try running `folder_flow` with `n_jobs=1`. This will run the processing sequentially in the main process, making stack traces easier to understand.
- File Loading: `folder_flow` attempts to detect whether to load files as JSON Lines (.jsonl) or standard JSON (.json) based on the file extension.

`folder_flow` provides a powerful way to scale your `Flow`-based analyses across many files by leveraging multi-core processors. By defining a clear processing function for a single file (`flow_fn`) and optionally a reduction function (`reduce_fn`), you can significantly speed up your workflows for common batch processing tasks. 
