# Working with Files: Input & Output

A key part of any data processing workflow is reading data from various sources and writing out your results. `Flow` provides several convenient methods for handling common file formats, especially JSON-based ones typical in sports analytics.

## Loading Data into Flow

`Flow` offers class methods (which you call like `Flow.from_something(...)`) to create a `Flow` instance directly from data sources.

### From In-Memory Python Objects (`.from_records()`)

If you already have your data as a Python list of dictionaries (or just a single dictionary, or any iterable of dictionaries), you can easily create a `Flow`:

```python
# List of dictionaries
my_data = [
    {"id": 1, "value": "A"},
    {"id": 2, "value": "B"}
]
flow_from_list = Flow.from_records(my_data)

# Single dictionary
single_record = {"id": 3, "value": "C"}
flow_from_dict = Flow.from_records(single_record) # Creates a Flow with one record

# From a generator function
def my_generator():
    for i in range(3):
        yield {"id": i, "generated_value": i*10}

flow_from_gen = Flow.from_records(my_generator())

# You can also use the Flow constructor directly for iterables:
flow_from_constructor = Flow(my_data)
flow_from_constructor_gen = Flow(my_generator())
```

`.from_records()` is flexible and is often the starting point if your data is already in Python.

### From JSON Lines Files (`.from_jsonl()`)

JSON Lines (`.jsonl`) is a convenient format where each line in the file is a separate, valid JSON object. This format is excellent for streaming large datasets.

```python
# Assuming 'match_events.jsonl' exists and each line is a JSON event dictionary
events_flow = Flow.from_jsonl("path/to/your/match_events.jsonl")

# You can also specify encoding if it's not UTF-8
events_flow_custom_encoding = Flow.from_jsonl("data.jsonl", encoding="latin-1")

# Now you can chain operations:
shots_flow = events_flow.filter(lambda r: r.get("type_name") == "Shot")
shots_data = shots_flow.collect()
```

`Flow` will read the file line by line, parsing each line as a JSON record and yielding it into the stream. This is memory-efficient as the whole file isn't loaded at once.

### From a Single JSON File (`.from_file()`)

If your data is in a single JSON file that contains either a JSON array (a list of objects) or a single JSON object, you can use 
`.from_file()`:

```python
# Case 1: File contains a JSON array: [{"id":1,...}, {"id":2,...}]
flow_from_array_file = Flow.from_file("path/to/data_array.json")

# Case 2: File contains a single JSON object: {"match_id":123, "data":{...}}
# This will result in a Flow with a single record.
flow_from_object_file = Flow.from_file("path/to/single_object.json")
```

Note: `.from_file()` reads the entire file content into memory to parse the JSON structure, so it's best for files that comfortably fit in memory. For very large arrays of JSON objects, `.from_jsonl()` is preferred if you can use that format.

### From a Folder of JSON Files (`.from_folder()`)

If you have multiple JSON files in a single folder, and each file contains either a JSON array of records or a single JSON record, 
`.from_folder()` can load them all into a single `Flow`.

```python
# Assuming 'data_folder/' contains 'file1.json', 'file2.json', etc.
# Each .json file can be a list of records or a single record.
combined_flow = Flow.from_folder("path/to/your/data_folder")

# Non-JSON files in the folder are skipped.
# The records from all JSON files are streamed together.
```

This method iterates through the files in the folder, reads each one, and yields its records.

### From a Glob Pattern (`.from_glob()`)

For more flexible file matching, including searching in subdirectories, you can use `.from_glob()`:

```python
# Load all JSON files in 'data_folder' and its subfolders
all_json_flow = Flow.from_glob("path/to/your/data_folder/**/*.json")

# Load all event files from a competition season
season_events_flow = Flow.from_glob("competitions/super_league/season_2023/events_*.json")
```

`.from_glob()` uses Python's glob module to find matching file paths and then processes each file similar to `.from_folder().`

### Special Loaders: StatsBomb Open Data (`.statsbomb.from_github_file()`)

`Flow` includes a convenient helper to directly load StatsBomb open data files (events, matches, lineups) from their GitHub repository:

```python
# Load event data for StatsBomb match_id 7580
match_id = 266516
events_flow = Flow.statsbomb.from_github_file(match_id=match_id, type="events")

# Load lineup data for the same match
lineups_flow = Flow.statsbomb.from_github_file(match_id=match_id, type="lineups")

# Process as usual
shots = events_flow.filter(lambda r: r.get("type_name") == "Shot").collect()
```

This method handles fetching the data from the URL and parsing the JSON.





## Saving Data from `Flow`

Once you've processed your data, `Flow` provides methods to save the results. These methods typically consume the `Flow` (materialize it) as they write out the data and return the `Flow` instance itself, allowing for potential further chaining (though often saving is the last step).

### To Individual JSON Files (`.to_json_files()`)

This method writes each record in your Flow to a separate JSON file within a specified folder.

```python
shots_flow = events_flow.filter(lambda r: r.get("type_name") == "Shot")

# Create a folder 'output_shots' and save each shot event as a separate JSON file.
# Files will be named 'record_1.json', 'record_2.json', etc. by default.
shots_flow.to_json_files("output_shots/")

# You can also name files based on a field in the record:
processed_flow.to_json_files("output_by_player/", by="player_name")
# This would create files like 'Kevin_De_Bruyne.json', 'Erling_Haaland.json'.
# Note: Duplicate names from the 'by' field will overwrite previous files.
```

The output folder will be created if it doesn't exist.

### To a Single JSON Lines File (`.to_jsonl()`)

Writes all records in the `Flow` to a single `.jsonl` file, with each record as a JSON string on a new line. This is often a good format for later reloading or for use with other stream-processing tools.

```python
processed_flow = events_flow.filter(...).assign(...)
processed_flow.to_jsonl("path/to/output/processed_events.jsonl")
```

### To a Single JSON File (as an array) (`.to_json_single()`)

Saves all records from the `Flow` into a single JSON file, formatted as a JSON array (a list of objects).

```python
summary_flow = events_flow.group_by("team_name").aggregate(total_shots="count")
summary_flow.to_json_single("path/to/output/team_summary.json", indent=4) # indent for pretty printing
```

This method will collect all records into a list in memory before writing, so be mindful of memory usage for very large `Flows`.

### To a Pandas DataFrame (`.to_pandas()`)

A very common step is to convert your processed `Flow` into a pandas DataFrame for more advanced analysis, visualization, or to save in other formats (like CSV) via pandas.

```python
final_flow = events_flow.filter(...).select(...)
df = final_flow.to_pandas()

# Now you can use pandas capabilities:
print(df.head())
df.to_csv("output_data.csv", index=False)
```

This also collects all records into memory to construct the DataFrame.

Effectively loading your source data and saving your valuable results is critical. `Flow` aims to make these common IO tasks straightforward, especially for JSON-centric workflows. Next, we'll explore more advanced operations for manipulating and combining streams.
