# Module 04: More on Tidy Data



With this I plan to convince you that **tidy data** is the way to go.

 

This notebook demonstrates:



1. How to convert *wide* to **tidy** format.

2. How **filtering, grouping, and scaling** is more complicated in **wide** format compared to **tidy**.

3. How **visualization** in Altair is also simpler with **tidy** data.



---

## 1. Imports and Setup

In [1]:
import pandas as pd

## 2. Creating the Example Dataset (Wide Format)



We start with a **wide** DataFrame: each region (`North`, `South`, `East`, `West`) is in its own column.

In [2]:
wide_df = pd.DataFrame({
    "Year":  [2020, 2021, 2022, 2023],
    "North": [100, 150, 200, 250],
    "South": [ 90, 130, 170, 220],
    "East":  [ 80, 120, 160, 210],
    "West":  [ 70, 110, 150, 200]
})

display(wide_df)


Unnamed: 0,Year,North,South,East,West
0,2020,100,90,80,70
1,2021,150,130,120,110
2,2022,200,170,160,150
3,2023,250,220,210,200


## 3. Converting Wide to Tidy



We use `pd.melt()` to **unpivot** the data so that each row represents a `(Year, Region, Sales)` combination.

In [3]:
tidy_df = wide_df.melt(
    id_vars=["Year"], 
    var_name="Region", 
    value_name="Sales"
)

display(tidy_df)


Unnamed: 0,Year,Region,Sales
0,2020,North,100
1,2021,North,150
2,2022,North,200
3,2023,North,250
4,2020,South,90
5,2021,South,130
6,2022,South,170
7,2023,South,220
8,2020,East,80
9,2021,East,120


# 4. Filtering & Grouping



We'll do two demonstrations:

1. **Filtering** data for a specific region and year range.

2. **Scaling** each region's values by its mean sales.



---

## 4.1 Filtering for the East region where `Year > 2021`

### 4.1.1 Wide Format (Complicated)



Since there's **no direct "Region" column**, we must manually pick the column (`East`) and rename it:

In [4]:
east_after_2021_wide = wide_df.loc[wide_df["Year"] > 2021, ["Year", "East"]]
east_after_2021_wide = east_after_2021_wide.rename(columns={"East": "Sales"})
display(east_after_2021_wide)


Unnamed: 0,Year,Sales
2,2022,160
3,2023,210


### 4.1.2 Tidy Format (Easy)



Just use `query("Region == 'East' and Year > 2021")`:

In [5]:
east_after_2021_tidy = tidy_df.query("Region == 'East' and Year > 2021")

display(east_after_2021_tidy)


Unnamed: 0,Year,Region,Sales
10,2022,East,160
11,2023,East,210


## 4.2 Grouping & Scaling by the Mean



**Task**: Divide each region's sales by that region's **mean** sales.



### 4.2.1 Wide Format



Compute the mean of each column and then **manually** scale:

In [6]:
import numpy as np

mean_north = wide_df["North"].mean()
mean_south = wide_df["South"].mean()
mean_east  = wide_df["East"].mean()
mean_west  = wide_df["West"].mean()

scaled_wide = wide_df.copy()
scaled_wide["North"] = scaled_wide["North"] / mean_north
scaled_wide["South"] = scaled_wide["South"] / mean_south
scaled_wide["East"]  = scaled_wide["East"]  / mean_east
scaled_wide["West"]  = scaled_wide["West"]  / mean_west

# a for could be used to avoid repetition, but it's still cumbersome

display(scaled_wide)


Unnamed: 0,Year,North,South,East,West
0,2020,0.571429,0.590164,0.561404,0.528302
1,2021,0.857143,0.852459,0.842105,0.830189
2,2022,1.142857,1.114754,1.122807,1.132075
3,2023,1.428571,1.442623,1.473684,1.509434


### 4.2.2 Tidy Format



A single `.groupby("Region")` and `.transform()` handles **all** regions:

In [7]:
scaled_tidy = tidy_df.copy()
scaled_tidy["Sales_Scaled"] = scaled_tidy.groupby("Region")["Sales"] \
                                         .transform(lambda x: x / x.mean())

display(scaled_tidy)


Unnamed: 0,Year,Region,Sales,Sales_Scaled
0,2020,North,100,0.571429
1,2021,North,150,0.857143
2,2022,North,200,1.142857
3,2023,North,250,1.428571
4,2020,South,90,0.590164
5,2021,South,130,0.852459
6,2022,South,170,1.114754
7,2023,South,220,1.442623
8,2020,East,80,0.561404
9,2021,East,120,0.842105


# 5. Visualization in Altair



We'll create a simple line plot of **Sales** over **Year** for each region.

You may need to install altair for this part of the tutorial
```
pip install altair
```

In [8]:
import altair as alt

## 5.1 Wide Format



### Option A: Layer each region's line separately

In [9]:
chart_north = alt.Chart(wide_df).mark_line(stroke='blue').encode(
    x="Year:O",
    y="North:Q"
).properties(title="North")

chart_south = alt.Chart(wide_df).mark_line(stroke='red').encode(
    x="Year:O",
    y="South:Q"
).properties(title="South")

chart_east = alt.Chart(wide_df).mark_line(stroke='green').encode(
    x="Year:O",
    y="East:Q"
).properties(title="East")

chart_west = alt.Chart(wide_df).mark_line(stroke='orange').encode(
    x="Year:O",
    y="West:Q"
).properties(title="West")

# compose the charts
chart_wide_layered = alt.layer(chart_north, chart_south, chart_east, chart_west).properties(
    width=400,
    height=300
)

# Adding the legends would be a bit more work

chart_wide_layered


- We had to **manually** define a chart for each column (region).

- Adding new regions or removing one requires extra lines of code.

- Legends and other customizations would be more complex.



### Option B: Use `transform_fold` to pivot columns inside Altair



This is effectively using Altair to do what `melt()` does, but you must list **all** columns:

In [10]:
chart_wide_pivot = (
    alt.Chart(wide_df)
    .transform_fold( # this is like pd.melt()!!!!
        fold=["North","South","East","West"],  # must list every region
        as_=["Region","Sales"]
    )
    .mark_line()
    .encode(
        x="Year:O",
        y="Sales:Q",
        color="Region:N"
    )
    .properties(
        width=400,
        height=300
    )
)

chart_wide_pivot


## 5.2 Tidy Format (So Much Simpler)



A single command, no need to list the regions. Altair **automatically** creates multiple lines, coloring by `Region`.

In [12]:
chart_tidy = alt.Chart(tidy_df).mark_line().encode(
    x="Year:O",
    y="Sales:Q",
    color="Region:N"
).properties(
    width=400,
    height=300
).interactive()

chart_tidy


# 6. Final Comparison


| **Action**               | **Wide Format**                                                       | **Tidy Format**                                                  |
|:-------------------------|:----------------------------------------------------------------------|:-----------------------------------------------------------------|
| **Filtering**            | Must pick & rename columns, no direct 'Region' column                 | Simple `query("Region=='X'")`                                    |
| **Grouping or Scaling**  | Repeat operations for each column (or do complex loops)               | Single `.groupby("Region")` call                                 |
| **Adding New Categories**| Must add **new columns** and update code                              | Rows just **grow**, existing code continues to work              |
| **Visualization**        | Either layer each column manually or use `transform_fold`             | One-liner with `color="Region:N"`                                |
| **Code Complexity**      | High (lots of repeated steps, listing columns)                        | Low (concise, flexible)                                          |


**Takeaway**: 

- **Tidy format** is recommended for most data science tasks because it avoids repetitive code, makes grouping/filtering straightforward, and integrates smoothly with visualization libraries like Altair, seaborn, etc.

- **Wide format** can be okay for quick tasks or certain machine learning APIs, but it often becomes cumbersome when you need to filter, group, scale, or plot multiple categories.



**Whenever possible, convert to tidy** to save time and headaches!