# Utility, Inspection, & Interoperability

Beyond transforming data, `Flow` provides several utility methods to help you inspect your data stream, understand its contents, and integrate `Flow` pipelines with other Python libraries and custom functions.

## Inspecting Your Flow

These methods help you peek into your `Flow` without necessarily processing the entire stream (though some, like `last()` or `len()`, do need to consume it).

### Getting the First Record (`.first()`)

`first()` returns the very first record (dictionary) from the `Flow` without consuming it from the `Flow` itself for subsequent operations. If the `Flow` is empty, it returns `None`.

```python
first_event = events_flow.first()
```

This is useful for quickly checking the structure of your records or grabbing a sample value.

### Getting the Last Record (`.last()`)

`last()` returns the very last record (dictionary) from the `Flow`.

**Important**: To find the last record, this method must iterate through and consume the entire `Flow`. The `Flow` instance it's called on will be exhausted afterwards. If the `Flow` is empty, it returns `None`.

```python
last_event_in_match = events_flow.last() # events_flow is now exhausted
if last_event_in_match:
    print(f"The last event timestamp was: {last_event_in_match.get('timestamp')}")
```

### Checking if a Flow is Empty (`.is_empty()`)

`is_empty()` returns `True` if the `Flow` contains no records, and `False` otherwise. This method attempts to check for the first record without consuming it for subsequent operations if the flow is not empty.

### Discovering Record Keys (`.keys()`)

`keys(limit=None)` returns a set of all unique field names (keys) found across the records in the Flow.

- By default (limit=None), it inspects all records, which will consume the `Flow`.
- If limit is set to an integer (e.g., limit=100), it only inspects the first limit records. In this case, the `Flow`'s internal iterator is advanced past these limit records for subsequent operations, but the rest of the stream remains. The method attempts to do this without consuming the inspected part from the main `Flow` object for later full iteration.

```python
# Get all keys from the first 10 records without consuming the whole flow for other uses
sample_keys = events_flow.keys(limit=10)
print(f"Keys found in the first 10 records: {sample_keys}")

# Get all keys from all records (consumes events_flow)
all_possible_keys = events_flow.keys()
print(f"All keys found in the dataset: {all_possible_keys}")
```

This is very useful for understanding the schema of your data, especially if it varies between records.

### Getting the Length of a Flow (`len()`)

You can use the built-in `len(my_flow)` function to count the number of records in a Flow.

**Important**: To count all records, `len()` must iterate through and consume the entire `Flow`. The `Flow` instance will be exhausted afterwards.

```python
number_of_events = len(events_flow) 
print(f"Total number of events: {number_of_events}")
```

### String Representation (`__repr__()`)

When you print a `Flow` object or inspect it in an interactive session, its `__repr__` method provides a summary:

```python
print(events_flow)
# Output might look like:
# <Penaltyblog Flow | n≈? | sample=[{'id': 1, ...}, {'id': 2, ...}, {'id': 3, ...}]>
# Or if len() was called before:
# <Penaltyblog Flow | n≈1500 | sample=[{'id': 1, ...}, {'id': 2, ...}, {'id': 3, ...}]>
```


## Materializing the Flow

### Collecting All Records (`.collect()`)

We've used this throughout: `my_flow.collect()` processes the entire planned pipeline and returns a Python list of all the resulting dictionaries. This is the primary way to get all your processed data into memory if needed.

This, by definition, consumes the `Flow`.

```python
processed_records_list = events_flow.filter(...).assign(...).collect()
```

### Extending Flow with Custom Logic (`.pipe()`)

The `.pipe(func, *args, **kwargs)` method allows you to insert your own custom functions (or functions from other libraries) into a `Flow` chain. The `Flow` object itself is passed as the first argument to your function, followed by any additional *args and **kwargs you provide.

Your function should typically return a `Flow` object if you want to continue chaining `Flow` methods, but it can return anything (e.g., a pandas DataFrame, a plot, a custom summary).

### Example: A custom function to filter and assign in one step

```python
def process_shots_custom(flow_object, min_xg=0.25):
    return (
        flow_object
        .filter(lambda r: r.get("type_name") == "Shot" and r.get("shot_xg", 0) >= min_xg)
        .assign(is_high_xg=True)
    )

high_xg_shots_flow = Flow(events).pipe(process_shots_custom, min_xg=0.25)

# Now you can continue chaining on high_xg_shots_flow or collect it:
result = high_xg_shots_flow.select("player_name", "shot_xg", "is_high_xg").collect()
```

### Example: Piping to a pandas function after converting

```python
def calculate_team_summary_with_pandas(flow_object):
    df = flow_object.to_pandas()
    # Now use pandas for more complex grouping/aggregation if needed
    return df.groupby("team_name")["shot_xg"].agg(["sum", "mean", "count"])

team_xg_summary_df = (
    events_flow.filter(lambda r: r.get("type_name") == "Shot")
    .pipe(calculate_team_summary_with_pandas)
)
print(team_xg_summary_df)
```

`.pipe()` is extremely powerful for integrating bespoke logic or leveraging functionalities from other libraries within your `Flow` pipeline.


## Interoperability with Pandas

Pandas is a cornerstone of data analysis in Python. `Flow` provides easy ways to move data to and from pandas DataFrames.

### Converting to a Pandas DataFrame (`.to_pandas()`)

As seen in the previous section on IO, `.to_pandas()` consumes the `Flow` and returns a pandas DataFrame.

```python
df_events = Flow(events).filter(lambda r: r.get("period") == 1).to_pandas()
print(df_events.info())
```

### Generating Descriptive Statistics (`.describe()`)

`.describe(percentiles=..., include=..., exclude=...)` consumes the `Flow`, converts it to a pandas DataFrame internally, and then calls the DataFrame's `.describe()` method, returning a pandas DataFrame with summary statistics.

```python
# Assuming shots_flow contains only shot events
numeric_shot_stats = shots_flow.select(
    "shot_statsbomb_xg", "shot_end_location_x", "shot_end_location_y") \
                              .describe()
print(numeric_shot_stats)

# Describe specific columns or dtypes
all_event_description = events_flow.describe(include='object') # Describe object/string columns
```

This is a quick way to get an overview of your data's distribution, similar to pandas.DataFrame.describe().

These utility and interoperability methods make `Flow` a more complete and practical tool. They allow you to understand your data at various stages, customize processing with your own logic, and seamlessly integrate with the broader Python data science ecosystem, especially pandas. 