# Why Nested Data Isn’t a Problem - It’s the Point

In football (soccer) analytics, the default approach to working with data is to flatten it into tables. Whether it's passing networks, xG chains, or player event logs, we often reach for `pandas` or SQL to bring structure to chaos. But what if we're flattening too early - or even unnecessarily?

The `Flow` engine in `penaltyblog` takes a different path. Instead of reducing everything to a rigid table, it treats nested JSON as a first-class citizen - where it belongs, a feature, not a problem.

This article explores why embracing nested data opens up powerful new workflows, especially for clubs and analysts working with real-world, messy, event-based football data.

## 🌎 The Nature of Football Data

Football data is inherently nested:

- A "pass" might contain a start and end location, a pressure flag, and a list of tags.
- A "shot" might include multiple qualifiers, an assist type, and a freeze frame of defenders.
- A "match" contains players, teams, events, substitutions, and metadata - all deeply structured.

When we flatten this data:

- We lose structure.
- We increase risk of key collisions (e.g. `player.name` vs `team.name`).
- We make it harder to model and reason about what’s actually happening.

Flattening too early leads to brittle pipelines and constant cleanup.

## 🌬️ Why Nested Data is a Feature

### 1. It reflects the real world

Nested structures mirror the natural hierarchy of football:

- Matches contain events
- Events have players, contexts, and outcomes
- Actions have tags, timestamps, and spatial data

Keeping this structure lets you work with the game as it’s played, not just as rows in a table.

### 2. It's schema-flexible

Different providers (Opta, StatsBomb, Wyscout) use different formats. Trying to flatten these into a single table leads to endless exceptions.

A pipeline that embraces nesting can adapt:

```python
flow.select("player.name", "location.x", "location.y")
```

Without caring if `player` is a dict or a flat field. This gives `Flow` the ability to **ingest, normalize, and transform** without overfitting.

### 3. It's analysis-friendly

Flattening forces premature decisions:

- Do I include all tags or just the first?
- How do I encode location - tuple, string, x/y?
- What if a freeze-frame includes 10 defenders?

Keeping the data nested lets you:

- Loop through freeze frames when needed
- Extract only meaningful tags
- Plot raw coordinates without munging

**You defer decisions until they actually matter.**

## 🌟 Flow: A Query Engine for Nested JSON

Instead of flattening:

- Just point `Flow` at your folder of JSON files
- Chain transformations lazily (filter, assign, group_by)
- Select nested fields naturally

Output to dashboards, notebooks, or summaries without needing pandas.

```python
flow = (
    Flow.from_folder("data/events/") 
    .filter(lambda r: r["type"] == "Shot") 
    .assign(xT=lambda r: model.predict(r)) 
    .select("player.name", "xT", "location") 
    .to_json("shots.json")
)
```

This turns your raw event data into a **queryable, schema-aware** stream, not a rigid table.

## 📊 When Flattening Still Helps

Flattening isn't evil, it’s just not always the right first step. Use it when:

- You’re building reports or exports for BI tools
- You’ve standardized your schema
- You need fast vectorized ops (e.g. model training)

Even then, with `Flow` you can defer flattening until the end:

```python
flow.filter(...).flatten().to_pandas()
```

## 🚀 Final Thought: Let the Structure Work For You

Football data is complex because the game is complex. Embracing nested formats doesn’t just make life easier, it makes your analysis more robust, expressive, and future-proof.

With `Flow`, you don’t fight the data. You query it, shape it, and explore it - on your terms, in its native structure.

Welcome to the JSON lake era.