# Utility, Inspection, & Interoperability

Beyond transforming data, `Flow` provides several utility methods to help you inspect your data stream, understand its contents, and integrate `Flow` pipelines with other Python libraries and custom functions.

## Inspecting Your Flow

Beyond transforming data, `Flow` provides several introspection tools to help you peek into the stream, check schema, count records, or safely debug your pipeline—without accidentally consuming your data.

Under the hood, many of these methods use `itertools.tee()` to preserve the internal state of the stream so that even destructive-looking operations like `.first()` or `.collect()` won't discard records unless explicitly intended.

## Peek at the First Record: `.first()`

Returns the first record in the stream without consuming it. If the `Flow` is empty, returns `None`.

```python
first_event = events_flow.first()
```

This is great for sampling structure or validating input without altering the pipeline.

## Grab the Final Record: `.last()`

Returns the last record in the stream. Since this requires scanning the full iterator, it will consume the flow. Use `.cache()` if you need to call `.last()` and still reuse the data.

```python
last_event = events_flow.last()
```

## Is There Anything in Here? `.is_empty()`

Checks whether the stream is empty. Uses `tee()` to avoid losing the first record if present.

```python
if events_flow.is_empty():
    print("No events found.")
```

## Discover All Field Names: `.keys(limit=None)`

Returns the union of keys across records, optionally limited to the first `n`. Uses `tee()` internally to preserve stream state.

```python
events_flow.keys(limit=10)  
events_flow.keys()          
```

This is useful for understanding schema drift in semi-structured JSON.

## Count the Records: `.len()`

Calling `.len()` materializes the stream, counting all records. This consumes the `Flow`, so only use when needed. To safely inspect and reuse later, call `.cache()` first.

```python
flow = Flow(data).cache()
print(len(flow))  # Safe and repeatable
```

## A Helpful Summary: `repr(flow)`

Printing a Flow shows a preview with sample records and an estimated count:

```python
print(flow)
# <Penaltyblog Flow | n≈? | sample=[..., ...]>
```

If `len()` has been called (or `flow.cache()` used), the count will be included accurately.

## Materializing the Flow

## Materialize All Records: `.collect()`

The most common way to materialize your data is by calling `.collect()`, which runs the full pipeline and returns a list of dictionaries.

```python
records = events_flow.filter(...).assign(...).collect()
```

Behind the scenes, `.collect()` uses `itertools.tee()` to ensure the original `Flow` remains usable even after materialization. So you can safely call `.collect()` multiple times if needed - at the cost of some internal buffering.

This is ideal when you're ready to work with all records at once, or when handing data off to another tool (e.g., pandas).

## Insert Custom Logic Anywhere: `.pipe()`

Use `.pipe()` to insert your own logic into a `Flow` chain - this is perfect for custom filtering, integration with other libraries, or building reusable transformations.

```python
def process_shots(flow, min_xg=0.25):
    return (
        flow
        .filter(lambda r: r.get("type_name") == "Shot" and r.get("shot_xg", 0) >= min_xg)
        .assign(is_high_xg=True)
    )

# Chain your custom step using .pipe
high_xg = Flow(events).pipe(process_shots, min_xg=0.3)
results = high_xg.select("player_name", "shot_xg", "is_high_xg").collect()
```

`.pipe()` is chainable - your function should return a `Flow` if you want to keep chaining. But it can also return any result (like a DataFrame, plot, or summary) as a final output.

## Using `.pipe()` for External Libraries (e.g. pandas)

You can use `.pipe()` to bridge Flow with pandas or other tools:

```python
def team_summary(flow):
    df = flow.to_pandas()
    return df.groupby("team_name")["shot_xg"].agg(["sum", "mean", "count"])

summary_df = (
    Flow(events)
    .filter(lambda r: r.get("type_name") == "Shot")
    .pipe(team_summary)
)
```

This pattern keeps your pipeline modular and composable.

## Interoperability with pandas

Flow integrates cleanly with pandas - great for advanced analysis, plotting, or export.

`.to_pandas()`: Convert to a DataFrame

Use `.to_pandas()` to convert a `Flow` into a `pandas.DataFrame`. This materializes the full stream.

```python
df = Flow(events).filter(lambda r: r.get("period") == 1).to_pandas()
print(df.head())
```

## Quick Summary Stats: `.describe()`

This method materializes the data into a DataFrame and returns the result of `DataFrame.describe()`.

```python
Flow(events)
  .select("shot_xg", "location_x", "location_y")
  .describe()
```

You can customize the output by passing pandas-style arguments:

```python
# Show object/string field summaries
Flow(events).describe(include="object")

# Change percentiles
Flow(events).describe(percentiles=(0.1, 0.5, 0.9))
```

These utilities make it easy to blend the simplicity of `Flow` with the power of pandas, and to inspect or export your data as needed - without losing the benefits of lazy, composable pipelines.