# Basic Pipelines: Transforming Your Data

Once you have your data loaded into a `Flow`, the next step is usually to clean, reshape, and enrich it. `Flow` provides several methods that operate on each record individually as it streams through your pipeline. Remember, these operations are "lazy" – they define the work to be done, but the work only happens when you ask for the final results (e.g., with `.collect()`).


In [None]:
from pprint import pprint

from penaltyblog.matchflow import Flow

# Sample records that events might contain:
sample_records = [
    {
        "event_id": 1,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:01:30.500",
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "location": [60.1, 40.3],
        "pass_recipient_name": "Erling Haaland",
        "pass_outcome_name": "Complete",
    },
    {
        "event_id": 2,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:01:32.100",
        "type_name": "Shot",
        "player_name": "Erling Haaland",
        "location": [85.5, 50.2],
        "shot_xg": 0.05,
        "shot_outcome_name": "Goal",
    },
    {
        "event_id": 3,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:02:05.000",
        "type_name": "Duel",
        "player_name": "Rodri",
        "duel_type_name": "Tackle",
        "duel_outcome_name": "Won",
    },
    {
        "event_id": 4,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:02:10.000",
        "type_name": "Pass",
        "player_name": "Kevin De Bruyne",
        "location": [70.0, 25.0],
        "pass_recipient_name": "Jack Grealish",
        "pass_outcome_name": "Incomplete",
    },
    {
        "event_id": 5,
        "match_id": 123,
        "period": 1,
        "timestamp": "00:03:00.000",
        "type_name": "Shot",
        "player_name": "Bukayo Saka",
        "location": [75.0, 60.0],
        "shot_xg": 0.01,
        "shot_outcome_name": "Saved",
    },
]

pprint(sample_records[0])

{'event_id': 1,
 'location': [60.1, 40.3],
 'match_id': 123,
 'pass_outcome_name': 'Complete',
 'pass_recipient_name': 'Erling Haaland',
 'period': 1,
 'player_name': 'Kevin De Bruyne',
 'timestamp': '00:01:30.500',
 'type_name': 'Pass'}


## Selecting Fields: `.select()`
 
Sometimes, you don't need all the fields in your records. The `.select()` method lets you keep only the fields you're interested in. You provide the field names as arguments, and the flow will yield records containing only those fields.
 
### Example: Getting Player Names and Locations


In [2]:
player_locations_flow = Flow(sample_records).select(
    "player_name", "location"
)

pprint(player_locations_flow.head(1).collect())

[{'location': [60.1, 40.3], 'player_name': 'Kevin De Bruyne'}]


### Accessing Nested Fields

You can also access nested fields by using the `dot` notation shortcut, as shown in the example below.

In [5]:
example = {"a": {"b":{"c": 1}}}
flow = Flow(example).select("a.b.c")

pprint(flow.head(1).collect())


[{'c': 1}]


### Selecting Nested vs. Dotted Keys

By default, `Flow.select("a.b.c")` will only look up `record["a"]["b"]["c"]`. 

If your data actually contains keys with dots in them - for example:

```python
{"player.info": {"name.full": "Jane Doe"}}
```

Then you have two easy options to access the data:

1. Flatten the records:

```python
(
    Flow(records)
    .flatten()
    .select("player.info.name.full")
    .collect()
)
```

After `.flatten()`, nested dicts become true dotted keys:

```python
# before flatten
{"player.info": {"name.full": "Jane Doe"}}
# after flatten
{"player.info.name.full": "Jane Doe"}
```

2. Rename the fields:

```python
(
    Flow(records)
    .rename(**{"player.info": "player_info"})  
    .assign(name_full=lambda r: r["player_info"].get("name.full")) 
    .select("player_info", "name_full")
    .collect()
)
```

Both approaches keeps `.select()` fast and predictable, while still giving you full control over your field names.


## Filtering Records: `.filter()`

Often, you'll want to work with a subset of your data. The `.filter()` method allows you to keep only the records that meet a specific condition. You provide a function (often a lambda function but doesn't need to be) that takes a record (a dictionary) and returns `True` (to keep it) or `False` (to discard it).

### Example: Getting only Shot Events


In [7]:
shots_flow = Flow(sample_records).filter(lambda event: event.get("type_name") == "Shot")

# To see the result:
shot_records = shots_flow.collect()
for shot in shot_records:
    print(
        f"{shot.get('player_name')} had a shot with outcome: {shot.get('shot_outcome_name')}"
    )

Erling Haaland had a shot with outcome: Goal
Bukayo Saka had a shot with outcome: Saved


### Example: Getting Shots by a Specific Player

In [8]:
haaland_goals_flow = Flow(sample_records).filter(
    lambda event: event.get("type_name") == "Shot"
    and event.get("shot_outcome_name") == "Goal"
    and event.get("player_name") == "Erling Haaland"
)

pprint(haaland_goals_flow.collect())

[{'event_id': 2,
  'location': [85.5, 50.2],
  'match_id': 123,
  'period': 1,
  'player_name': 'Erling Haaland',
  'shot_outcome_name': 'Goal',
  'shot_xg': 0.05,
  'timestamp': '00:01:32.100',
  'type_name': 'Shot'}]


## Assigning New Fields or Modifying Existing Ones: `.assign()`

The `.assign()` method lets you add new fields (columns) to your records or change the values of existing ones. You provide keyword arguments where the key is the field name, and the value is a function that takes the record and returns the new value for that field.


### Example: Adding a "half" field based on "period"

In [4]:
half_flow = Flow(sample_records).assign(
    half=lambda event: "First" if event.get("period") == 1 else "Second"
)

pprint(half_flow.head(1).collect())

[{'event_id': 1,
  'half': 'First',
  'location': [60.1, 40.3],
  'match_id': 123,
  'pass_outcome_name': 'Complete',
  'pass_recipient_name': 'Erling Haaland',
  'period': 1,
  'player_name': 'Kevin De Bruyne',
  'timestamp': '00:01:30.500',
  'type_name': 'Pass'}]


### Example: Converting player name to uppercase (modifying existing)

In [5]:
uppercase_players_flow = Flow(sample_records).assign(
    player_name=lambda event: event.get("player_name", "").upper()
)

pprint(uppercase_players_flow.head(1).collect())

[{'event_id': 1,
  'location': [60.1, 40.3],
  'match_id': 123,
  'pass_outcome_name': 'Complete',
  'pass_recipient_name': 'Erling Haaland',
  'period': 1,
  'player_name': 'KEVIN DE BRUYNE',
  'timestamp': '00:01:30.500',
  'type_name': 'Pass'}]


## Renaming Fields: `.rename()`

Use `.rename()` to rename one or more fields in your records. This is especially useful when preparing data for analysis or flattening nested keys into more manageable names.

You pass key-value pairs where the key is the current field name and the value is the new name:

In [None]:
renamed = (
    Flow(sample_records)
    .rename(match_id="id", type_name="event_type")
    .select("id", "event_type")
)

pprint(renamed.collect()[0])

{'event_type': 'Pass', 'id': 123}


## Expanding Lists: `.explode()`

If you have fields that are lists, they can be exploded into new records using `.explode()`.

In [10]:
example_record = {
    "event_id": 30,
    "involved_players": ["Player X", "Player Y"],
    "roles": ["Passer", "Receiver"],
}

involved_roles_flow = Flow([example_record]).explode(key="involved_players")

pprint(involved_roles_flow.collect())

[{'event_id': 30,
  'involved_players': 'Player X',
  'roles': ['Passer', 'Receiver']},
 {'event_id': 30,
  'involved_players': 'Player Y',
  'roles': ['Passer', 'Receiver']}]


### Expanding Multiple Lists Together: `.explode_multi()`

If you have multiple fields that are lists of the same length and you want to "unzip" them together into new records, `.explode_multi()` is useful.


In [11]:
example_record = {
    "event_id": 30,
    "involved_players": ["Player X", "Player Y"],
    "roles": ["Passer", "Receiver"],
}

involved_roles_flow = Flow([example_record]).explode_multi(
    keys=["involved_players", "roles"]
)

pprint(involved_roles_flow.collect())

[{'event_id': 30, 'involved_players': 'Player X', 'roles': 'Passer'},
 {'event_id': 30, 'involved_players': 'Player Y', 'roles': 'Receiver'}]


### Splitting an Array Field into Multiple New Fields: `.split_array()`

If a field contains a list/array of a fixed (or maximum) expected length, and you want to assign each element to its own new named field, use `.split_array()`.

In [None]:
split_location_flow = (
    Flow(sample_records)
    .split_array(key="location", into=["start_x", "start_y"])
    .head(1)
)

pprint(split_location_flow.collect())

[{'event_id': 1,
  'location': [60.1, 40.3],
  'match_id': 123,
  'pass_outcome_name': 'Complete',
  'pass_recipient_name': 'Erling Haaland',
  'period': 1,
  'player_name': 'Kevin De Bruyne',
  'start_x': 60.1,
  'start_y': 40.3,
  'timestamp': '00:01:30.500',
  'type_name': 'Pass'}]


If the array is shorter than the `into` list, remaining new fields get `None`. If longer, extra array elements are ignored.

These basic pipeline methods form the building blocks for most of your data preparation tasks. By chaining them together, you can create powerful and readable transformations. 

In the next chapter, we'll look at how to group your data and calculate aggregate statistics.