# Utility, Inspection, & Interoperability

Beyond transforming data, **Flow** provides several tools to help you inspect, debug, and connect your pipelines to other libraries like pandas - all while preserving the lazy, composable nature of your workflow.


## Inspecting Your Flow

`Flow` includes methods for peeking into the stream, checking fields, or verifying structure. Note: most of these methods consume the stream, meaning you cannot reuse the same flow afterward unless you’ve explicitly materialized it.

### `.first()`: Peek at the First Record

Returns the first record in the stream. If the stream is empty, returns `None`.

```python
first_event = events_flow.first()
```

⚠️ This materializes the full stream via `.collect()`, so `events_flow` is now empty afterward. Use `.head(1)` for a shallow preview instead.

### `.last()`: Get the Final Record

```python
last_event = events_flow.last()
```

Also calls `.collect()`, consuming the entire stream.

### `.head(n)`: Safe Preview

Get the first `n` records without consuming the full stream.

```python
Flow(data).head(3).collect()
```

✅ Best choice for inspecting safely - consumes only the first `n` records.

### `.is_empty()`: Check for Records

```python
if events_flow.is_empty():
    print("No events found.")
```

⚠️ This materializes the full stream via .collect().

### `.keys(limit=None)`: Discover Field Names

Returns the union of all keys across up to limit records.

```python
events_flow.keys(limit=10)
events_flow.keys()      
```

Helpful for exploring semi-structured or nested JSON.

⚠️ This also materializes the stream - use `.head().collect()` if you only need a preview.

### `.len()`: Count the Records

Internally calls `.collect()` - flow is consumed. If you want a new reusable flow, call `.materialize()` first:

```python
flow = Flow(data)
materialized = flow.materialize()
print(len(materialized))           # safe
print(materialized.first())        # safe
```

## Materializing the Flow

### `.collect()`: Get All Records as a List

```python
records = events_flow.filter(...).assign(...).collect()
```

- Consumes the stream.
- Returns a list of dicts.

### `.materialize()`: Create a Reusable Flow

```python
f2 = events_flow.materialize()
```

- Also consumes the stream.
- Returns a new `Flow` backed by a list - safe for reuse and inspection.

## Forking a Flow: `.fork()`

Sometimes you want to branch a `Flow` into two different pipelines - for example, to compute two different summaries from the same source. `.fork()` gives you two independent one-shot `Flows` from the same stream.

```python
flow1, flow2 = flow.fork()

summary1 = flow1.filter(...).summary(...)
summary2 = flow2.filter(...).summary(...)
```

💡 This does not materialize the stream into memory. It uses internal teeing, so each branch can only be used once. If you need a reusable copy, use `.materialize()` instead.

### When to Use `.fork()`

- You want to branch without collecting everything into memory
- You know each fork will be used exactly once
- You want to avoid `.materialize()` for memory reasons

### When Not to Use It

- You plan to reuse or inspect the flows multiple times → use `.materialize()`
- You want to store or debug intermediate steps → `.collect()` or `.pipe()` is better

## Custom Logic and Pipelines

### `.pipe()`: Insert Your Own Function

Plug in custom functions to extend or encapsulate logic:

```python
def process_shots(flow, min_xg=0.25):
    return (
        flow
        .filter(lambda r: r.get("type_name") == "Shot" and r.get("shot_xg", 0) >= min_xg)
        .assign(is_high_xg=True)
    )

# Chain your custom step using .pipe
high_xg = Flow(events).pipe(process_shots, min_xg=0.3)
results = high_xg.select("player_name", "shot_xg", "is_high_xg").collect()
```

💡 If your function returns a `Flow`, you can keep chaining. If it returns something else (e.g., DataFrame), the pipeline ends there.

### Use `.pipe()` with External Tools

You can bridge into pandas (or anything else) via `.pipe()`:

```python
def team_summary(flow):
    df = flow.to_pandas()
    return df.groupby("team_name")["shot_xg"].agg(["sum", "mean", "count"])

summary_df = (
    Flow(events)
    .filter(lambda r: r.get("type_name") == "Shot")
    .pipe(team_summary)
)
```

## Interoperability with pandas

### `.to_pandas()`: Convert to DataFrame

```python
df = Flow(events).filter(lambda r: r.get("period") == 1).to_pandas()
print(df.head())
```

- Materializes the `Flow`
- Returns a pandas DataFrame

### `.describe()`: Summary Stats

Returns `pandas.DataFrame.describe()` on the full `Flow`.

```python
Flow(events)
  .select("shot_xg", "location_x", "location_y")
  .describe()
```

Supports typical pandas `.describe()` args:

```python
# Show object/string field summaries
Flow(events).describe(include="object")

# Change percentiles
Flow(events).describe(percentiles=(0.1, 0.5, 0.9))
```

### Summary

These utility methods let you:

- ✅ Inspect flows (`.first()`, `.keys()`, `.head()`)
- ✅ Materialize your data for reuse (`.collect()`, `.materialize()`)
- ✅ Debug and branch with `.pipe()`
- ✅ Export to pandas with `.to_pandas()` and `.describe()`

⚠️ Many inspection methods consume the flow. If you need to reuse the data, use `.materialize()` to create a repeatable copy.

## What’s Next?

Next, we’ll look at best practices for writing clean, maintainable `Flow` pipelines and working efficiently with large datasets.