# Summaries: Grouping and Aggregating Data

After cleaning and transforming individual records, a common next step in data analysis is to **summarize** information across different groups.

For example:
- Count the number of shots per team
- Compute the average xG per player
- Sum the number of passes in different zones of the pitch

`Flow` provides powerful tools for these “group by” operations.

## The Two-Step Pattern

1. **Group** your data with `.group_by(...)`
2. **Aggregate** each group with `.summary(...)`

These two steps work together to turn raw event-level data into rich summary tables.

## Grouping Records: `.group_by(...)`

Use `.group_by(...)` to define the key(s) you want to group by. This method takes one or more field names and returns a `FlowGroup` - a special object that holds sub-groups of records.



In [2]:
from pprint import pprint
from penaltyblog.matchflow import Flow

sample_records = [
    {
        "event_id": 1,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:01:30.500",
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "location": [60.1, 40.3],
        "pass_recipient_name": "Erling Haaland",
        "pass_outcome_name": "Complete",
        "team_name": "Manchester City",
    },
    {
        "event_id": 2,
        "type_name": "Shot",
        "player_name": "Erling Haaland",
        "shot_xg": 0.05,
        "shot_outcome_name": "Goal",
        "team_name": "Manchester City",
    },
    {
        "event_id": 3,
        "type_name": "Duel",
        "player_name": "Rodri",
        "duel_type_name": "Tackle",
        "duel_outcome_name": "Won",
        "team_name": "Manchester City",
    },
    {
        "event_id": 4,
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "pass_outcome_name": "Incomplete",
        "team_name": "Manchester City",
    },
    {
        "event_id": 5,
        "type_name": "Shot",
        "player_name": "Bukayo Saka",
        "shot_xg": 0.01,
        "shot_outcome_name": "Post",
        "team_name": "Arsenal",
    },
]

events_flow = Flow(sample_records)

grouped_by_team = events_flow.group_by("team_name")

grouped_by_team

<Penaltyblog Flow Group | n_groups=2 | sample_keys=[('Manchester City',), ('Arsenal',)]>

💡 `.group_by()` materializes the entire flow.

## Grouping by Multiple Fields

You can group by more than one key:

```python
grouped = Flow(sample_records).group_by("period", "type_name")
```

This would create groups like (1, "Pass"), (1, "Shot"), etc.

## Aggregating within Groups: `FlowGroup.summary()`

Once you have a `FlowGroup`, call `.summary(...)` to aggregate each group.

You pass keyword arguments:

- The **key** is the name of the new field
- The **value** defines how to compute it

You can use:

- A tuple: ("field", "agg") (e.g. ("shot_xg", "sum"))
- A string: "count" (for number of records)
- A custom function or lambda

### Example: Sum xG per team


In [4]:
results = (
    Flow(sample_records)
    .group_by("team_name")
    .summary(
        total_xg=("shot_xg", "sum"),
    )
)
pprint(results.collect())

[{'team_name': 'Manchester City', 'total_xg': np.float64(0.05)},
 {'team_name': 'Arsenal', 'total_xg': np.float64(0.01)}]


### Example: Sum xG and count shots per player

In [5]:
# First, filter for shots to make aggregation simpler
results = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .group_by("player_name")
    .summary(
        total_xg=("shot_xg", "sum"),
        number_of_shots="count",
    )
)

pprint(results.collect())

[{'number_of_shots': 1,
  'player_name': 'Erling Haaland',
  'total_xg': np.float64(0.05)},
 {'number_of_shots': 1,
  'player_name': 'Bukayo Saka',
  'total_xg': np.float64(0.01)}]


## Built-in Aggregation Functions:

You can use the following aggregation strings:

- **count** - number of records
- **sum**, **mean**, **min**, **max**, **median**, **std**, **var**
- **first**, **last** - first or last value of a field
- **any**, **all** - logical reductions
- **nunique** - number of unique values
- **prod** - product of values
- **mode** - most common value
- **range** - max - min

### Custom Aggregations

You can pass your own function to `.summary()`.

### Example: Shots on Target Percentage

In [6]:
player_shots_grouped = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .group_by("player_name")
)


def calculate_sot_percentage(records_in_group):
    on_target_outcomes = ["Goal", "Saved"]
    shots_on_target = 0
    total_shots = 0
    for record in records_in_group:
        if record.get("shot_outcome_name"):  # Ensure the field exists
            total_shots += 1
            if record.get("shot_outcome_name") in on_target_outcomes:
                shots_on_target += 1
    return (shots_on_target / total_shots) * 100 if total_shots > 0 else 0


player_sot_percentage_flow = player_shots_grouped.summary(
    sot_percentage=calculate_sot_percentage
)

pprint(player_sot_percentage_flow.head().collect())

[{'player_name': 'Erling Haaland', 'sot_percentage': 100.0},
 {'player_name': 'Bukayo Saka', 'sot_percentage': 0.0}]


## Aggregating Without Grouping: `Flow.summary()`

You can call `.summary()` directly on a `Flow` if you just want overall stats.

In [8]:
shots_flow = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .summary(
        total_match_xg=("shot_xg", "sum"),
        total_match_shots="count",
        avg_xg=("shot_xg", "mean"),
    )
    .assign(
        avg_xg=lambda e: round(e.get("avg_xg"), 2),
        total_match_xg=lambda e: round(e.get("total_match_xg"), 2),
    )
)

pprint(shots_flow.collect())

[{'avg_xg': np.float64(0.03),
  'total_match_shots': 2,
  'total_match_xg': np.float64(0.06)}]


💡 This also consumes the Flow, just like `.group_by()`. 

## Summary

- Use `.group_by()` to group records by field(s)
- Use `.summary()` to aggregate each group (or the whole flow)
- You can use built-in aggregates or custom functions
- Combine `.filter()`, `.group_by()`, and `.summary()` to answer most analytical questions

## What’s Next?

Next, we’ll look at joining datasets - like linking events with player metadata - using `.join()`. 
