# Principles of Data Visualization & Finishing plotnine  v.ekc-c

This notebook wraps up the **Grammar of Graphics** in plotnine and introduces core **principles of effective data visualization**.  
We will practice: **statistical transformations**, **layer-specific mappings**, **themes**, and **critique of bad plots**.

| Section | Topic |
|---------|-------|
| 1 | Setup & Warm-Up |
| 2 | Statistical Transformations (`stat`) |
| 3 | Layer-Specific Aesthetic Mappings |
| 4 | Themes — Polish the Look |
| 5 | 🔬 Principles of Data Visualization — Critique Lab |
| 6 | Open Exploration — Diamonds |
| Appendix | Quick Reference |


---
## 1. Setup & Warm-Up

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
from plotnine.data import *
import warnings
warnings.filterwarnings('ignore')


### 🔬 Warm-Up — Grammar Recap

Without peeking at previous notebooks, write **one plotnine plot** using the `midwest` dataset that:
- Uses `percollege` (x) and `percprof` (y)
- Colors points by `state`
- Adds facets by `state`
- Adds a smooth trend line per state


In [None]:
# Your recall plot here


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
(ggplot(midwest, aes('percollege', 'percprof', color='state'))
 + geom_point(alpha=0.5)
 + stat_smooth()
 + facet_wrap('state')
 + ggtitle('College vs Prof by State')
 + theme_bw()
).draw()
```

</details>

---
## 2. Statistical Transformations (`stat`)

### 📋 Board Reference

| Stat | What it does | Common pairing |
|------|-------------|----------------|
| `stat_smooth()` | Fits a trend line (LOESS or linear) | `geom_point()` |
| `stat_bin(bins=N)` | Bins continuous data into groups | `geom_bar()` or histogram |
| `stat_count()` | Counts rows per category | `geom_bar()` on discrete x |
| `geom_bar(stat='summary', fun_y=np.mean)` | Bar heights = group mean | discrete x, numeric y |

**Key idea:** Every geom has a *default stat*; `stat_*` functions let you be explicit or swap them.


In [None]:
# stat_smooth — draws a trend line per group
(ggplot(midwest, aes('percollege', 'percprof', color='state'))
 + geom_point(alpha=0.4)
 + stat_smooth()
).draw()


In [None]:
# stat_bin — histogram with explicit bins
(ggplot(midwest, aes('percollege'))
 + stat_bin(geom='bar', bins=20)
).draw()


In [None]:
# geom_bar with stat='summary' — bar height = group mean
(ggplot(midwest, aes(x='state', y='poptotal'))
 + geom_bar(stat='summary', fun_y=np.mean)
).draw()


In [None]:
# Verify: what does the bar height represent?
midwest.groupby('state')['poptotal'].mean()


---
### 🔬 Explore 1 — Stat Layers

1. **Trend lines by group**: Plot `perchsd` (% high school diploma) vs `percollege`.  
   Color by `state`. Add `stat_smooth()` — do all states follow the same trend?

2. **Manual histogram**: Use `stat_bin(geom='bar', bins=30)` to plot the distribution of `percbelowpoverty`.  
   Then swap `bins` to 5 and 50 — how does the choice affect interpretation?

3. **Bar of means**: Make a bar plot showing mean `percprof` per `state` using `stat='summary'`.  
   Which state has the highest average % professionals?

4. **Bonus**: Layer `geom_point()` *under* `stat_smooth()` on the same plot. Order matters — try both ways.


In [None]:
# 1. perchsd vs percollege with smooth by state


In [None]:
# 2. percbelowpoverty histogram — try bins=5, 30, 50


In [None]:
# 3. bar of mean percprof per state


In [None]:
# 4. Bonus: layer order experiment


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Ordering geom layers matters! Later layers are drawn on top.*

```python
# 1. Trend lines by state
(ggplot(midwest, aes('percollege', 'perchsd', color='state'))
 + geom_point(alpha=0.3)
 + stat_smooth()
 + ggtitle('High School vs College Graduation by State')
).draw()

# 2. Bins comparison
for n in [5, 30, 50]:
    (ggplot(midwest, aes('percbelowpoverty'))
     + stat_bin(geom='bar', bins=n)
     + ggtitle(f'Poverty distribution — {n} bins')
    ).draw()

# 3. Mean percprof per state
(ggplot(midwest, aes(x='state', y='percprof'))
 + geom_bar(stat='summary', fun_y=np.mean, fill='steelblue')
 + ggtitle('Mean % Professionals by State')
).draw()

# 4. Layer order — points under vs over smooth
(ggplot(midwest, aes('percollege', 'percprof'))
 + geom_point(color='gray', alpha=0.4)   # under
 + stat_smooth(color='red')
).draw()
```

</details>

---
## 3. Layer-Specific Aesthetic Mappings

### 📋 Board Reference

```
ggplot(df, aes(...))   ← global: applies to ALL layers
  + geom_point(aes(...))  ← local: overrides/adds for this layer only
  + stat_smooth()          ← inherits global aes unless overridden
```

| Pattern | Effect |
|---------|--------|
| `aes(color='state')` at top level | EVERY layer colored by state |
| `geom_point(aes(color='state'))` | Only points are colored; smooth is one line |
| Mix both | Fine-grained control per layer |


In [None]:
# Top-level color → smooth also splits by state (5 lines)
(ggplot(midwest, aes('percollege', 'percprof', color='state'))
 + geom_point(alpha=0.4)
 + stat_smooth()
).draw()


In [None]:
# Layer-specific color → ONE overall smooth line
(ggplot(midwest, aes('percollege', 'percprof'))
 + geom_point(aes(color='state'), alpha=0.4)
 + stat_smooth(color='black', size=1)
).draw()


---
### 🔬 Explore 2 — Layer Mappings

1. Make a scatter of `percollege` vs `percbelowpoverty`.  
   Color the **points** by `state`, but add **one overall** regression line (black, `stat_smooth`).

2. Now flip it: put `color='state'` at the top level and add `stat_smooth()`.  
   How many trend lines appear? Why?

3. **Facets + layer maps**: Add `facet_wrap('state')` to your plot from #1.  
   Move the color to the top level. Does the smooth still split?

4. **Bonus**: Use `geom_point(aes(size='poptotal'))` layered with a *color* at the global level.  
   What does the plot communicate?


In [None]:
# 1. Points by state, one smooth


In [None]:
# 2. Color at top level — how many smooth lines?


In [None]:
# 3. Facet + layer color experiment


In [None]:
# 4. Bonus: size aesthetic mapping


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. Points colored by state, ONE smooth
(ggplot(midwest, aes('percollege', 'percbelowpoverty'))
 + geom_point(aes(color='state'), alpha=0.5)
 + stat_smooth(color='black')
 + ggtitle('College Education vs Poverty Rate')
).draw()

# 2. Color at top → 5 smooth lines
(ggplot(midwest, aes('percollege', 'percbelowpoverty', color='state'))
 + geom_point(alpha=0.5)
 + stat_smooth()                           # one per group
).draw()

# 3. Facet + layer color
(ggplot(midwest, aes('percollege', 'percbelowpoverty', color='state'))
 + geom_point(alpha=0.4)
 + stat_smooth()
 + facet_wrap('state')
).draw()

# 4. Size + color
(ggplot(midwest, aes('percollege', 'percbelowpoverty', color='state'))
 + geom_point(aes(size='poptotal'), alpha=0.4)
 + ggtitle('College vs Poverty, sized by population')
).draw()
```

</details>

---
## 4. Themes — Polish the Look

### 📋 Board Reference

| Built-in theme | Style |
|----------------|-------|
| `theme_gray()` | Default ggplot gray background |
| `theme_bw()` | White background, black grid |
| `theme_classic()` | Clean, minimal axes |
| `theme_minimal()` | No background, light grid |
| `theme_void()` | No axes, no grid |

**Fine-grained `theme()` options:**
```python
theme(
    axis_text_x  = element_text(angle=45, hjust=1),
    axis_title_x = element_text(size=18),
    plot_title   = element_text(size=20, face='bold'),
    legend_position = 'none'        # or 'top', 'bottom', 'left', 'right'
)
```


In [None]:
# Built-in theme + fine-grained tweaks
(ggplot(midwest, aes(x='percollege'))
 + geom_histogram(bins=25, fill='steelblue', color='white')
 + ggtitle('Distribution of College Graduation Rate')
 + theme_bw()
 + theme(
     axis_text_x  = element_text(angle=45, hjust=1),
     axis_title_x = element_text(size=16),
     axis_title_y = element_text(size=16),
     plot_title   = element_text(size=20)
 )
).draw()


---
### 🔬 Explore 3 — Themes

1. Take your favorite plot from **Explore 1 or 2** and apply **three different built-in themes**.  
   Which one do you prefer and why?

2. Make a bar chart of `state` vs mean `percprof` (from Explore 1).  
   Add `theme(axis_text_x = element_text(angle=45, hjust=1))` — why is this useful here?

3. Make a scatter plot with:
   - `theme_classic()`  
   - `ggtitle` of your choice  
   - A **legend hidden** (`legend_position='none'`)  
   - Point `size` scaled by `poptotal`

4. **Bonus**: Change `plot_background = element_rect(fill='lightblue')` — customize the background!


In [None]:
# 1. Three theme comparison


In [None]:
# 2. Bar chart + rotated x labels


In [None]:
# 3. theme_classic + hidden legend


In [None]:
# 4. Bonus: custom background


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. Theme comparison
base = (ggplot(midwest, aes('percollege', 'percprof', color='state'))
        + geom_point(alpha=0.5)
        + ggtitle('percollege vs percprof'))

for t in [theme_gray(), theme_bw(), theme_classic()]:
    (base + t).draw()

# 2. Rotated labels on bar chart
(ggplot(midwest, aes(x='state', y='percprof'))
 + geom_bar(stat='summary', fun_y=np.mean, fill='coral')
 + theme_bw()
 + theme(axis_text_x = element_text(angle=45, hjust=1))
 + ggtitle('Mean % Professionals by State')
).draw()

# 3. Classic theme, no legend, size mapping
(ggplot(midwest, aes('percollege', 'percbelowpoverty', color='state'))
 + geom_point(aes(size='poptotal'), alpha=0.5)
 + theme_classic()
 + theme(legend_position='none')
 + ggtitle('College vs Poverty (sized by population)')
).draw()

# 4. Bonus: custom background
(ggplot(midwest, aes('percollege', 'percprof', color='state'))
 + geom_point()
 + theme(plot_background = element_rect(fill='lightyellow'))
).draw()
```

</details>

---
## 5. 🔬 Principles of Data Visualization — Critique Lab

### 📋 Board Reference — What makes a good plot?

| Principle | Guideline |
|-----------|-----------|
| **Truth** | Don't distort scales or truncate axes misleadingly |
| **Clarity** | One clear message; avoid chartjunk |
| **Accessibility** | Use colorblind-friendly palettes; label clearly |
| **Comparability** | Use consistent axes when comparing panels |
| **Right geom** | Use line plots for continuous trends; bar charts for counts/categories |

The plots below have **deliberate problems**. Identify the issue and fix each one.


### 🔬 Critique 1 — A confusing scatter plot

What is wrong with this plot? Fix it.


In [None]:
# BAD PLOT — run to see the problem, then fix it below
from plotnine.data import diamonds
(ggplot(diamonds, aes(x='x', y='y'))
 + geom_line()
).draw()


**What's wrong?** *(double-click to edit)*

*Your answer here*

In [None]:
# Your fixed version


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*The column names 'x' and 'y' refer to physical dimensions of the diamond, not plot axes!*

```python
# Problems: (1) x and y are ambiguous column names — not clear they mean diamond dimensions
# (2) geom_line is wrong for this data — no meaningful x→y sequence
# (3) some extreme outliers (errors in data) are not filtered
# Fix:
diamonds_clean = diamonds[(diamonds['x'] > 0) & (diamonds['y'] > 0) & (diamonds['y'] < 20)]
(ggplot(diamonds_clean, aes(x='x', y='y'))
 + geom_point(alpha=0.05, color='steelblue')
 + labs(title='Diamond Length vs Width', x='Length (mm)', y='Width (mm)')
 + theme_bw()
).draw()
```

</details>

### 🔬 Critique 2 — Color overload bar chart

What is wrong with this plot? Fix it.


In [None]:
# BAD PLOT
avg_price = diamonds.groupby('clarity').price.mean().reset_index()
(ggplot(avg_price, aes(x='clarity', y='price', fill='clarity'))
 + geom_bar(stat='identity', color='r')
 + geom_text(label=avg_price.clarity)
 + theme_classic()
).draw()


**What's wrong?** *(double-click to edit)*

*Your answer here*

In [None]:
# Your fixed version


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Redundant encoding (same info in x AND fill) adds noise, not info. Pick one.*

```python
# Problems: (1) redundant color (x and fill both encode clarity — unnecessary)
# (2) red border is distracting
# (3) text labels repeat x-axis info — cluttered
# Fix:
avg_price = diamonds.groupby('clarity').price.mean().reset_index()
(ggplot(avg_price, aes(x='clarity', y='price'))
 + geom_bar(stat='identity', fill='steelblue')
 + labs(title='Average Diamond Price by Clarity',
        x='Clarity Grade', y='Mean Price (USD)')
 + theme_bw()
 + theme(axis_text_x=element_text(angle=30, hjust=1))
).draw()
```

</details>

### 🔬 Critique 3 — Unhelpful grouping

What is wrong with this matplotlib plot? Rewrite it as a plotnine plot.


In [None]:
# BAD PLOT
ideal  = diamonds[diamonds.cut == 'Ideal']
prem   = diamonds[diamonds.cut == 'Premium']
good   = diamonds[diamonds.cut == 'Good']
vgood  = diamonds[diamonds.cut == 'Very Good']
fair   = diamonds[diamonds.cut == 'Fair']

plt.plot('carat', 'price', 'r.', data=ideal)
plt.plot('carat', 'price', 'm.', data=prem)
plt.plot('carat', 'price', 'y.', data=good)
plt.plot('carat', 'price', 'w.', data=vgood)
plt.plot('carat', 'price', 'k.', data=fair)
plt.show()


**What's wrong?** *(double-click to edit)*

*Your answer here*

In [None]:
# Your improved plotnine version


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Grammar of Graphics lets plotnine handle grouping automatically — no manual subsetting needed!*

```python
# Problems: (1) yellow points invisible on white background
# (2) no legend labels
# (3) no axis labels or title
# (4) 54k overlapping points — need alpha
# Fix with plotnine:
(ggplot(diamonds, aes('carat', 'price', color='cut'))
 + geom_point(alpha=0.1, size=0.5)
 + labs(title='Carat vs Price by Cut Quality',
        x='Carat', y='Price (USD)', color='Cut')
 + scale_color_brewer(type='qual', palette='Set1')
 + theme_bw()
).draw()
```

</details>

---
## 6. Open Exploration — Full plotnine Grammar

Use the `diamonds` dataset for creative, full-grammar plots.


In [None]:
diamonds.head(3)

**Prompt 1 — Stats meet scales**: Create a histogram of `price` faceted by `cut`.  
Use `scale_x_log10()` to tame the skewed distribution.  
Does the log scale reveal differences that are hidden on the raw scale?


In [None]:
# Prompt 1


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
(ggplot(diamonds, aes('price'))
 + geom_histogram(bins=40, fill='steelblue', color='white')
 + scale_x_log10()
 + facet_wrap('cut', ncol=2)
 + labs(title='Price Distribution by Cut (log scale)', x='Price (log USD)')
 + theme_bw()
).draw()
```

</details>

**Prompt 2 — Layer craft**: Plot `carat` (x) vs `price` (y).  
- Color **points** by `cut` (layer-specific)  
- Add **one overall** smooth trend line (black)  
- Make points semi-transparent (alpha=0.1)  
- Apply your favorite built-in theme


In [None]:
# Prompt 2


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
(ggplot(diamonds, aes('carat', 'price'))
 + geom_point(aes(color='cut'), alpha=0.1, size=0.5)
 + stat_smooth(color='black', size=1.5)
 + scale_color_brewer(type='qual', palette='Dark2')
 + labs(title='Carat vs Price with Overall Trend',
        x='Carat', y='Price (USD)', color='Cut')
 + theme_classic()
).draw()
```

</details>

**My interpretation:** *(double-click to edit)*

*What story does this plot tell?*


---
## Appendix — plotnine Quick Reference

```python
(ggplot(df, aes(x='col', y='col2', color='cat'))  # data + aesthetics
 + geom_point(alpha=0.5)            # geometry
 + stat_smooth()                    # statistical transformation
 + facet_wrap('col')                # facets
 + scale_x_log10()                  # scale
 + theme_bw()                       # theme preset
 + theme(axis_text_x = element_text(angle=45, hjust=1))  # fine-tune
 + ggtitle('Title') + xlab('X') + ylab('Y')
).draw()
```

**Common stats:**  
`stat_smooth()` · `stat_bin(bins=N)` · `stat_count()` · `geom_bar(stat='summary', fun_y=np.mean)`

**Built-in themes:**  
`theme_gray()` · `theme_bw()` · `theme_classic()` · `theme_minimal()` · `theme_void()`

**Colorblind palettes:**  
`scale_color_brewer(type='qual', palette='Dark2')` · `scale_color_brewer(type='qual', palette='Set1')`  
`scale_fill_brewer(...)` · `scale_color_manual(values=['#E69F00','#56B4E9',...])`
