# Working with Files: Input & Output

A key part of any data workflow is loading data from various sources and saving your results. `Flow` provides convenient methods for working with common formats - especially JSON and JSON Lines.


## Loading Data into Flow

`Flow` offers class methods (which you call like `Flow.from_something(...)`) to create a `Flow` instance directly from data sources.

### From In-Memory Python Objects (`.from_records()`)

If you already have your data as a Python list of dictionaries (or just a single dictionary, or any iterable of dictionaries), you can easily create a `Flow`:

```python
# List of dictionaries
my_data = [
    {"id": 1, "value": "A"},
    {"id": 2, "value": "B"}
]
flow_from_list = Flow.from_records(my_data)

# Single dictionary
single_record = {"id": 3, "value": "C"}
flow_from_dict = Flow.from_records(single_record) # Creates a Flow with one record

# From a generator function
def my_generator():
    for i in range(3):
        yield {"id": i, "generated_value": i*10}

flow_from_gen = Flow.from_records(my_generator())

# You can also use the Flow constructor directly for iterables:
flow_from_constructor = Flow(my_data)
flow_from_constructor_gen = Flow(my_generator())
```

`.from_records()` is flexible and is often the starting point if your data is already in Python.

### From JSON Lines Files (`.from_jsonl()`)

[JSON Lines](https://jsonlines.org/) (`.jsonl`) is a convenient format where each line in the file is a separate, valid JSON object. This format is excellent for streaming large datasets.

```python
# Assuming 'match_events.jsonl' exists and each line is a JSON event dictionary
events_flow = Flow.from_jsonl("path/to/your/match_events.jsonl")

# You can also specify encoding if it's not UTF-8
events_flow_custom_encoding = Flow.from_jsonl("data.jsonl", encoding="latin-1")

# Now you can chain operations:
shots_flow = events_flow.filter(lambda r: r.get("type_name") == "Shot")
shots_data = shots_flow.collect()
```

`Flow` will read the file line by line, parsing each line as a JSON record and yielding it into the stream. This is memory-efficient as the whole file isn't loaded at once.

### From a Single JSON File (`.from_file()`)

If your data is in a single JSON file that contains either a JSON array (a list of objects) or a single JSON object, you can use 
`.from_file()`:

```python
# Case 1: File contains a JSON array: [{"id":1,...}, {"id":2,...}]
flow_from_array_file = Flow.from_file("path/to/data_array.json")

# Case 2: File contains a single JSON object: {"match_id":123, "data":{...}}
# This will result in a Flow with a single record.
flow_from_object_file = Flow.from_file("path/to/single_object.json")
```

Note: `.from_file()` reads the entire file content into memory to parse the JSON structure, so it's best for files that comfortably fit in memory. For very large arrays of JSON objects, `.from_jsonl()` is preferred if you can use that format.

### From a Folder of JSON Files (`.from_folder()`)

If you have multiple JSON files in a single folder, and each file contains either a JSON array of records or a single JSON record, 
`.from_folder()` can load them all into a single `Flow`.

```python
# Assuming 'data_folder/' contains 'file1.json', 'file2.json', etc.
# Each .json file can be a list of records or a single record.
combined_flow = Flow.from_folder("path/to/your/data_folder")

# Non-JSON files in the folder are skipped.
# The records from all JSON files are streamed together.
```

This method iterates through the files in the folder, reads each one, and yields its records.

### From a Glob Pattern (`.from_glob()`)

For more flexible file matching, including searching in subdirectories, you can use `.from_glob()`:

```python
# Load all JSON files in 'data_folder' and its subfolders
all_json_flow = Flow.from_glob("path/to/your/data_folder/**/*.json")

# Load all event files from a competition season
season_events_flow = Flow.from_glob("competitions/super_league/season_2023/events_*.json")
```

`.from_glob()` uses Python's glob module to find matching file paths and then processes each file similar to `.from_folder().`

💡 Files that don't contain valid JSON or are not list/dict will be skipped silently.

### Special Loaders: StatsBomb Open Data (`.statsbomb.from_github_file()`)

`Flow` includes a convenient helper to directly load StatsBomb open data files. This loader supports the "events", "lineups", "matches" and similar types found in the StatsBomb Open Data [GitHub repo](https://github.com/statsbomb/open-data/).

```python
# Load event data for StatsBomb match_id 266516
match_id = 266516
events_flow = Flow.statsbomb.from_github_file(match_id=match_id, type="events")

# Load lineup data for the same match
lineups_flow = Flow.statsbomb.from_github_file(match_id=match_id, type="lineups")

# Process as usual
shots = events_flow.filter(lambda r: r.get("type_name") == "Shot").collect()
```

This method handles fetching the data from the URL and parsing the JSON.


## Saving Data from a Flow

Once you've transformed your data, you can save it using one of Flow’s output methods.

💡 Output methods will materialize the `Flow`.

### `.to_json_files()`: One File Per Record

Writes each record to a separate .json file in a folder.

```python
# Create one file per record
Flow(...).to_json_files("out/", by="event_id")
# → out/1234.json
# → out/1235.json
```

Name files by a field value:

```python
flow.to_json_files("output_by_player/", by="player_name")
# → output_by_player/Kevin De Bruyne.json
# → output_by_player/Erling Haaland.json
# → output_by_player/Bukayo Saka.json
```

💡 If multiple records share the same name, later ones may overwrite earlier ones.

### `.to_jsonl()`: Save as JSON Lines

Write all records into one file, one record per line.

```python
flow.to_jsonl("processed/events.jsonl")
# processed/events.jsonl containing:
# {"event_id": 1, "type_name": "Pass", ...}
# {"event_id": 2, "type_name": "Shot", ...}
```

💡 Great for reloading later with `.from_jsonl()`. 

### `.to_json_single()`: Save as JSON Array

Write all records into a single `.json` file as an array.

```python
summary = Flow(...).group_by(...).summary(...)
summary.to_json_single("summary.json", indent=4)
# summary.json containing:
# [
#   {"event_id": 1, "type_name": "Pass", ...},
#   {"event_id": 2, "type_name": "Shot", ...}
# ]
```

💡 Collects all records into memory before writing.

## `.to_pandas()`: Convert to a Pandas DataFrame

Handy when moving into pandas for further analysis or export.

```python
df = Flow(...).filter(...).select(...).to_pandas()
df.to_csv("cleaned.csv", index=False)
```

## Summary

| Source Format        | Use Method              | Streaming? | Notes                        |
| -------------------- | ----------------------- | ---------- | ---------------------------- |
| Python dict/list/gen | `.from_records()`       | ✅          | Very flexible starting point |
| JSON Lines file      | `.from_jsonl()`         | ✅          | Best for large datasets      |
| Single JSON file     | `.from_file()`          | ❌          | Loads entire file            |
| Folder of JSON files | `.from_folder()`        | ✅          | Handles nested lists/records |
| Glob pattern         | `.from_glob()`          | ✅          | Flexible file matching       |
| StatsBomb GitHub     | `.statsbomb.from_...()` | ✅          | Download open data directly  |

`Flow` makes it easy to load, process, and save JSON-centric data - helping you focus on analysis, not file wrangling.

## What’s Next?

Next, we’ll look at how to inspect your data stream and understand its contents.
