# Summaries: Grouping and Aggregating Data

After cleaning and transforming individual records, a common next step in data analysis is to summarize information across different groups. For instance, you might want to count the total shots per team, find the average xG (Expected Goals) per player, or sum the number of passes made in different zones of the pitch. `Flow` provides powerful tools for these "group by" operations.

The core idea is a two-step process:

- **Group**: Define which field(s) you want to group your records by using `.group_by()`.
- **Aggregate**: Specify what calculations (aggregations like sum, mean, count) you want to perform on each group using `.summary()`.

## Grouping Records: `.group_by()`

The `.group_by(*keys)` method takes one or more field names as arguments. It processes the entire stream of records and collects them into distinct groups based on the unique combinations of values in the specified key field(s).

This method doesn't return a regular Flow object. Instead, it returns a special `FlowGroup` object. Think of `FlowGroup` as a collection of "sub-Flows," where each sub-Flow contains all the records belonging to one specific group.

### Example: Grouping events by team name


In [1]:
from pprint import pprint

from penaltyblog.matchflow import Flow

# Sample records that events_flow might contain:
sample_records = [
    {
        "event_id": 1,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:01:30.500",
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "location": [60.1, 40.3],
        "pass_recipient_name": "Erling Haaland",
        "pass_outcome_name": "Complete",
        "team_name": "Manchester City",
    },
    {
        "event_id": 2,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:01:32.100",
        "type_name": "Shot",
        "player_name": "Erling Haaland",
        "location": [85.5, 50.2],
        "shot_xg": 0.05,
        "shot_outcome_name": "Goal",
        "team_name": "Manchester City",
    },
    {
        "event_id": 3,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:02:05.000",
        "type_name": "Duel",
        "player_name": "Rodri",
        "duel_type_name": "Tackle",
        "duel_outcome_name": "Won",
        "team_name": "Manchester City",
    },
    {
        "event_id": 4,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:02:10.000",
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "location": [70.0, 25.0],
        "pass_recipient_name": "Jack Grealish",
        "pass_outcome_name": "Incomplete",
        "team_name": "Manchester City",
    },
    {
        "event_id": 5,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:03:00.000",
        "type_name": "Shot",
        "player_name": "Bukayo Saka",
        "location": [75.0, 60.0],
        "shot_xg": 0.01,
        "shot_outcome_name": "Post",
        "team_name": "Arsenal",
    },
]

pprint(sample_records[0])

{'event_id': 1,
 'location': [60.1, 40.3],
 'match_id': 123,
 'pass_outcome_name': 'Complete',
 'pass_recipient_name': 'Erling Haaland',
 'period': 1,
 'player_name': 'Kevin De Bruyne',
 'team_name': 'Manchester City',
 'timestamp': '00:01:30.500',
 'type_name': 'Pass'}


In [2]:
events_flow = Flow(sample_records)

# This step consumes events_flow and creates groups.
# grouped_by_team is a FlowGroup object, not a Flow.
grouped_by_team = events_flow.group_by("team_name")

# At this point, grouped_by_team internally holds something like:
# {
#   ("Team A",): [record1_teamA, record2_teamA, ...],
#   ("Team B",): [record1_teamB, record2_teamB, ...],
#   ...
# }
# where each list is essentially a mini-stream of records for that team.

*Important Note*: `.group_by() `needs to see all the records to form the groups, so it will materialize (consume) the `Flow` it's called on. The original events_flow would be exhausted after this operation.

### Example: Grouping by multiple keys: period and event type

In [3]:
events_flow = Flow(sample_records)

grouped_by_period_and_type = events_flow.group_by("period", "type_name")

# This would create groups for unique combinations like (1, "Pass"), (1, "Shot"), (2, "Pass"), etc.

## Aggregating Data within Groups: `FlowGroup.aggregate()`

Once you have a `FlowGroup` object (from `.group_by()`), you can perform aggregations on each group using its `.summary(**aggregates)` method.

The `.summary()` method takes keyword arguments where:

- The key is the name of the new summary field you want to create in your output.
- The value specifies how to calculate that summary. This can be:
    - A tuple (field_to_aggregate, agg_function_string): e.g., ("xg", "sum") to sum the xG.
    - A custom lambda or named function that takes a list of records (for that group) and returns a single aggregated value.
    - A string matching the name of a built in aggregate function that does not require arguments or a column name, e.g., "count".

The result of `FlowGroup.summary() `is a new `Flow` object, where each record represents one group and contains the grouping keys along with the newly calculated aggregate fields.

### Example: Summing xG per team


In [4]:
results = (
    Flow(sample_records)
    .group_by("team_name")
    .summary(
        total_xg=("shot_xg", "sum"),
    )
)
pprint(results.collect())

[{'team_name': 'Manchester City', 'total_xg': np.float64(0.05)},
 {'team_name': 'Arsenal', 'total_xg': np.float64(0.01)}]


### Example: Summing xG and counting shots per player

In [5]:
# First, filter for shots to make aggregation simpler
results = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .group_by("player_name")
    .summary(
        total_xg=("shot_xg", "sum"),
        number_of_shots="count",
    )
)

pprint(results.collect())

[{'number_of_shots': 1,
  'player_name': 'Erling Haaland',
  'total_xg': np.float64(0.05)},
 {'number_of_shots': 1,
  'player_name': 'Bukayo Saka',
  'total_xg': np.float64(0.01)}]


## Built-in Aggregation Functions:

Grouped `Flow`s supports many common aggregations, including:

- "count": Number of records in the group.
- "sum": Sum of values for a specified field.
- "mean" or "avg": Average of values for a specified field.
- "min": Minimum value for a specified field.
- "max": Maximum value for a specified field.
- "median": Median value for a specified field.
- "std": Standard deviation for a specified field.
- "var": Variance for a specified field.
- "first": The value of a specified field from the first record in the group (original order).
- "last": The value of a specified field from the last record in the group (original order).
- "any": Return True if any item in the group is True, otherwise, returns False.
- "all": Return True if all items in the group are True, otherwise, returns False.
- "nunique": Number of unique values for a specified field.
- "prod": Product of values for a specified field.
- "mode": The most common value for a specified field.
- "range": The range of values for a specified field.

You can also create your own custom aggregation functions and lambdas.

### Example: Using a custom function for aggregation

In [6]:
player_shots_grouped = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .group_by("player_name")
)

def calculate_sot_percentage(records_in_group):
    on_target_outcomes = ["Goal", "Saved"]
    shots_on_target = 0
    total_shots = 0
    for record in records_in_group:
        if record.get("shot_outcome_name"):  # Ensure the field exists
            total_shots += 1
            if record.get("shot_outcome_name") in on_target_outcomes:
                shots_on_target += 1
    return (shots_on_target / total_shots) * 100 if total_shots > 0 else 0


player_sot_percentage_flow = player_shots_grouped.summary(
    sot_percentage=calculate_sot_percentage
)

pprint(player_sot_percentage_flow.head().collect())

[{'player_name': 'Erling Haaland', 'sot_percentage': 100.0},
 {'player_name': 'Bukayo Saka', 'sot_percentage': 0.0}]


## Summarizing the Entire Stream: `.summary()`

If you don't need to group by any specific field but want to calculate aggregates over the entire `Flow`, you can use the `.summary`method directly on a `Flow` object without grouping first.

This is useful for getting overall statistics, like the total number of passes in a match or the average xG of all shots.

### Example: Getting total xG and total shots for the whole match

In [14]:
shots_flow = (
    Flow(sample_records)
    .filter(lambda r: r.get("type_name") == "Shot")
    .summary(
        total_match_xg=("shot_xg", "sum"),
        total_match_shots="count",
        avg_xg=("shot_xg", "mean"),
    )
)

pprint(shots_flow.collect())

[{'avg_xg': np.float64(0.030000000000000002),
  'total_match_shots': 2,
  'total_match_xg': np.float64(0.060000000000000005)}]


Like `.group_by()`, `.summary()` also materializes the `Flow` it's called on because it needs to process all records to compute the summary statistics. The result is a new `Flow` containing a single record with the summary statistics.

Grouping and aggregating are fundamental to extracting insights from event data. By combining `.filter()`, `.group_by()`, and `.summary()`, you can create powerful summaries tailored to your analytical questions. 
