# Grammar of Graphics, continued
v.ekc-c

Last class you learned the core grammar: `data + aes + geom + facet + stat`. Today we complete the picture with **scales** and **themes**, practice **layer-specific mappings**, and do open-ended exploration on a real dataset.

**Today's sections:**
1. Setup & Warm-Up
2. Scales — control how data maps to visual values
3. More Geoms — expanding your toolkit
4. Stats — summaries on top of raw data
5. Layer-Specific Mappings — different rules for different layers
6. Themes — polishing the final look
7. Open Exploration — diamonds

> **Installation note:** If you get import errors:
> ```
> conda activate data271
> python -m pip install plotnine
> ```
> Restart kernel and try again.

---
## 1. Setup & Warm-Up

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
from plotnine.data import midwest, diamonds
import warnings
warnings.filterwarnings('ignore')

df = midwest

### ETL
1. Inspect dataset
2. Missing data
3. Data types
4. Summary statistics

In [None]:
df.sample(5)

In [None]:
df.info()
df.describe()

In [None]:
df.select_dtypes('number').columns

In [None]:
df.select_dtypes(object).columns

### Review: build a new column `poverty_z`

Can you add a new feature called `poverty_z`, which is the **z-score** of the `percbelowpoverty`

Recall, the equation for **z-score**

$$ z_i = \frac{x_i - \bar{x}}{\sigma}$$

We can use broadcasting to do math operations on entire columns! Recall, here are the useful math methods you will need:
- `df.col_name.mean()`
- `df.col_name.std()`

#### Answer

In [None]:
df['poverty_z'] = (df.percbelowpoverty - df.percbelowpoverty.mean()) / df.percbelowpoverty.std()

df[['county', 'state', 'percbelowpoverty', 'poverty_z']].head(8)

---
## 2. Scales — Control the Mapping

### 📋 Board Reference

Scales control how data values translate to visual values (colors, sizes, axis ranges). Every aesthetic has a corresponding scale.

| Scale function | What it controls |
|---|---|
| `scale_color_cmap_d('Set1')` | Discrete color palette (from matplotlib) |
| `scale_color_cmap('Set1')` | Continuous color palette (from matplotlib) |
| `scale_color_manual(['r','b','g','m','c'])` | Fully custom color list |
| `scale_color_brewer(type='div', palette='RdBu')  ` | Discrete diverging palette |
| `scale_color_distiller(type='div', palette='RdBu')` | Continuous diverging palette |
| `scale_fill_cmap_d('Pastel1')` | Discrete fill palette |
| `scale_x_continuous(limits=(a,b))` | X-axis range |
| `scale_y_continuous(limits=(a,b))` | Y-axis range |
| `scale_x_log10()` | Log-transform the x-axis |

> **Browse palettes:** [matplotlib colormap reference](https://matplotlib.org/stable/users/explain/colors/colormaps.html#qualitative)

![image.png](attachment:cb3fdd0a-6b7c-4c48-9b09-1d21476ea4e3.png)

In [None]:
# Default colors
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.6)
).draw()

In [None]:
# Change color palette + restrict axis ranges
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.6)
 + scale_color_cmap_d('Set1')
 + scale_x_continuous(limits=(5, 50))
 + scale_y_continuous(limits=(0, 25))
).draw()

### Diverging color scales: 

- continuous features: `scale_color_distiller(type='div', palette='RdBu')`
- discrete features: `scale_color_brewer(type='div', palette='RdBu') `

Now, let's apply our color palette to our newly made feature `poverty_z`

In [None]:
# Change color palette + restrict axis ranges
(ggplot(data = df)
 + aes(x='percollege', y='percbelowpoverty', color='poverty_z')
 + geom_point(size=2, alpha=0.6)
 + scale_color_cmap('Set1')
 + labs(title='Poverty Z-Score\ncmap = "Set1"',
        color='Poverty z-score')
)

In [None]:
# Change color palette + restrict axis ranges
(ggplot(df, aes(x='percollege', y='percbelowpoverty', color='poverty_z'))
 + geom_point(size=2, alpha=0.8)

 # we can select the range of the colors with limites=(a,b)
 #   direction = -1 reverses the direction from red>blue --> blue>red
 + scale_color_distiller(type='div', palette='RdBu', limits=(-2, 2), direction=-1)
 + labs(title='Poverty Z-Score \nscale_color_distiller(), RdBu diverging palette',
        color='Poverty z-score')
 + theme_gray()

 # change the theme so the white points stick out more
 #+ theme(panel_background=element_rect(fill='lightgray'))
).draw()

---
### Try It 2: Scales

1. Make the `percollege` vs `percprof` scatter and try **three different color palettes**: `'Set2'`, `'Accent'`, and `'Dark2'`. Which do you prefer and why?

2. Now plot `poptotal` (x) vs `percbelowpoverty` (y). The x-axis will be heavily skewed. Add `scale_x_log10()` — what does a log scale reveal that a linear scale hides?

3. Use `scale_color_manual()` to set your own colors... one per state. Try to pick a colorblind-friendly set.

![image.png](https://matplotlib.org/stable/_images/sphx_glr_colormaps_006.png)
<img src="https://r-graph-gallery.com/38-rcolorbrewers-palettes_files/figure-html/thecode-1.png" alt="brewer" width="600"/>

If you want to play with diverging color palettes, [check it out here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html)

## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)

In [None]:
# 1. Try three color palettes


In [None]:
# 2. poptotal vs percbelowpoverty + scale_x_log10()



In [None]:
# 3. Scale_color_manual() with your own colors



<details>
<summary>💡 One approach — click to peek</summary>
<br>

*The Wong colorblind-safe palette: #E69F00, #56B4E9, #009E73, #F0E442, #0072B2*

```python
# 1.
from IPython.display import display
for palette in ['Set2', 'Accent', 'Dark2']:
    p=(ggplot(df, aes(x='percollege', y='percprof', color='state'))
     + geom_point(alpha=0.6)
     + scale_color_cmap_d(palette)
     + ggtitle(palette)
    ).draw()
    display(p)

# 2.
(ggplot(df, aes(x='poptotal', y='percbelowpoverty', color='state'))
 + geom_point(alpha=0.6)
 + scale_x_log10()
 +labs(x='log scale of poptotal')
).draw()

# 3. Funs!
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.7)
 + scale_color_manual(['#E69F00','#56B4E9','#009E73','#F0E442','#0072B2'])
).draw()
```

</details>

There is much more you can explore with color!
![colors](https://plotnine.org/guide/aesthetic-specification_files/figure-html/cell-3-output-1.png)

---
## 3. More Geoms!

| `geom_*` | Needs in `aes()` | What you get |
|---|---|---|
| `geom_bar()` | `x=` | Bar chart — **auto counts** rows |
| `geom_col()` | `x=`, `y=` | Bar chart — **you supply** the heights |
| `geom_histogram()` | `x=` | Histogram (bins a continuous variable) |
| `geom_boxplot()` | `x=`, `y=` | Box-and-whisker per group |
| `geom_violin()` | `x=`, `y=` | Distribution shape per group |
| `geom_density()` | `x=` | Smooth density curve |

**When to use `geom_bar` vs `geom_col`:**
- Raw data (one row per observation) → `geom_bar()` (it counts for you)
- Already-aggregated data (one row per group) → `geom_col()`

In [None]:
# geom_bar — counts rows automatically
print(df.state.value_counts())

(ggplot(df, aes(x='state'))
 + geom_bar(fill='steelblue')
 + labs(title='geom_bar(): automatically count')
).draw()

In [None]:
# geom_col — when you already have the summary values

# calculate the mean percentages per state
#   we use .reset_index() to get a column named `state`
#   rather than the index being named `state`
mean_college = df.groupby('state')['percollege'].mean().reset_index()   # < this is a DataFrame!
mean_college.columns = ['state', 'mean_percollege']
print(mean_college)


(ggplot(mean_college, aes(x='state', y='mean_percollege', fill='state'))
 + geom_col()
 + labs(title='geom_col(): calculated summary values')
).draw()

In [None]:
# geom_density — smooth distribution curve, good for overlapping groups

(ggplot(df, aes(x='percollege', color='state', fill='state'))
 + geom_density(alpha=0.3)
 + labs(title='geom_denisty()')
).draw()

---
### Try it 3: Geoms!

1. Compute the mean `percbelowpoverty` per state (use `groupby`), then plot it as a `geom_col()`. Map `state` to `fill`. Sort the bars from highest to lowest poverty using `scale_x_discrete(limits=[...])` with your sorted state list.

2. Overlay `geom_boxplot()` AND `geom_jitter()` on the same plot: `percollege` by `state`. The raw points show individual counties; the box shows the summary. Set `alpha=0.3` on the jitter so it doesn't overwhelm.

3. Use `geom_density()` to compare the distribution of `percollege` across states. Which state has the most unusual distribution shape?

## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)


In [None]:
# 1. geom_col — mean percbelowpoverty per state, sorted


In [None]:
# 2. geom_boxplot + geom_jitter layered together
# y=`percollege` by x=`state`



In [None]:
# 3. geom_density of percollege by state


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Layering geom_boxplot + geom_jitter together is a great EDA trick — you see both the summary and the raw data at once.*

```python
# 1.
poverty_mean = df.groupby('state')['percbelowpoverty'].mean().reset_index()
poverty_mean.columns = ['state','mean_poverty']
state_order = poverty_mean.sort_values('mean_poverty', ascending=False)['state'].tolist()

(ggplot(poverty_mean, aes(x='state', y='mean_poverty', fill='state'))
 + geom_col()
 + scale_x_discrete(limits=state_order)
).draw()

# 2.
(ggplot(df, aes(x='state', y='percollege', fill='state'))
 + geom_boxplot()
 + geom_jitter(alpha=0.3, width=0.2)
).draw()

# 3.
(ggplot(df, aes(x='percollege', color='state', fill='state'))
 + geom_density(alpha=0.3)
).draw()
```

</details>

---
## 4. Stats: Summaries Overlaid on Raw Data

| `stat_*` | What it computes | Common use |
|---|---|---|
| `stat_smooth()` | LOESS smoothing curve + confidence band | Scatter plots |
| `stat_smooth(method='lm')` | Linear regression line | Scatter plots |
| `stat_bin(bins=n)` | Bin counts for continuous variable | With `geom_bar` |
| `stat_count()` | Row counts per category | Default for `geom_bar` |
| `geom_bar(stat='summary', fun_y=fn)` | Any aggregation per group | Bar charts |

In [None]:
# stat_smooth — LOESS curve per group when color is mapped
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.4)
 + stat_smooth(method='lm', se=False)
).draw()

In [None]:
# geom_bar with stat='summary' — bars showing group means

# Verify: this is what the plot is computing
meanpoptotal_bystate=df.groupby('state')['poptotal'].mean()
print(meanpoptotal_bystate)

(ggplot(df, aes(x='state', y='poptotal'))
 + geom_bar(stat='summary', fun_y=np.mean)
).draw()

---
### Try it 4: Stat Layers

1. Make a scatter of `perchsd` (% high school degree) vs `percollege`. Add both a `stat_smooth()` (LOESS) and a `stat_smooth(method='lm')` line. Color them differently by putting `color=` inside each stat call's `aes()`.

2. Use `geom_bar(stat='summary', fun_y=np.mean)` to plot the **mean** `percollege` per state. Then add `geom_bar(stat='summary', fun_y=np.median)` in a second call — do mean and median tell different stories?

3. **Interpretation question:** when would you prefer `stat_smooth(method='lm')` over the default LOESS smooth? Write your answer in the cell below.

## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)

In [None]:
# 1. perchsd vs percollege + two smooth lines


In [None]:
# 2. Bar chart of mean vs median percollege per state



**Your answer to Q3:** *(double-click to edit)*

<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# 1. Two smooth lines with different colors
(ggplot(df, aes(x='perchsd', y='percollege'))
 + geom_point(alpha=0.3)
 + stat_smooth(aes(color='LOESS'), se=True)
 + stat_smooth(aes(color='Linear'), method='lm', se=False)
).draw()

# 2.
(ggplot(df, aes(x='state', y='percollege'))
 + geom_bar(stat='summary', fun_y=np.mean, fill='steelblue', alpha=0.7)
).draw()
# Add title to distinguish
(ggplot(df, aes(x='state', y='percollege'))
 + geom_bar(stat='summary', fun_y=np.median, fill='coral', alpha=0.7)
).draw()
```

</details>

---
## 5. Layer-Specific Mappings

Mappings in `aes()` at the **top level** apply to all layers. Mappings inside a specific `geom_*(aes(...))` apply **only to that layer**.

| Placement | Scope |
|---|---|
| `ggplot(df, aes(color='state'))` | All layers inherit `color='state'` |
| `geom_point(aes(color='state'))` | Only points get colored by state |

This lets you have, for example, colored points with a **single** overall smooth line:

In [None]:
# Top-level color → smooth line also splits by state (one line per state)
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.5)
 + stat_smooth(se=False)
).draw()

In [None]:
# Layer-specific color → points colored by state, ONE overall smooth line
(ggplot(df, aes(x='percollege', y='percprof'))
 + geom_point(aes(color='state'), alpha=0.5)
 + stat_smooth(se=False, color='black')
).draw()

---
### Try it MORE 5: aes layer mappings

1. Make a scatter of `percollege` vs `percbelowpoverty` where:
   - Points are colored by **`state`** (layer-specific)
   - A **single** linear smooth line (no grouping, `color='black'`)

2. Now flip it: what happens if you move `color='state'` to the top-level `aes()`? How many smooth lines do you get? Is that more or less useful?

3. **Challenge:** can you make a plot with:
   - Jittered points of `percollege` by `state` and colored by `state`
   - Boxes (boxplot) of `percollege` and colored by `factor(inmetro)` — metro vs rural
   - Both on the same axes

   *Tip: one geom gets `aes(color='state')`, the other gets `aes(fill='factor(inmetro)')`*


   ## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)

In [None]:
# 1. Points colored by state, one black smooth line


In [None]:
# 2. Same but color at top level — how many smooth lines?


In [None]:
# 3. Challenge: two geoms with different aesthetic mappings




<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Layer-specific mappings are powerful — they let you show group patterns in one layer while keeping an overall summary in another.*

```python
# 1.
(ggplot(df, aes(x='percollege', y='percbelowpoverty'))
 + geom_point(aes(color='state'), alpha=0.5)
 + stat_smooth(method='lm', color='black', se=False)
).draw()

# 2. — five smooth lines, one per state
(ggplot(df, aes(x='percollege', y='percbelowpoverty', color='state'))
 + geom_point(alpha=0.5)
 + stat_smooth(method='lm', se=False)
).draw()

# 3. Challenge
(ggplot(df, aes(x='state', y='percollege'))
 + geom_boxplot(aes(fill='factor(inmetro)'), alpha=0.6)
 + geom_jitter(aes(color='state'), alpha=0.3, width=0.2)
).draw()
```

</details>

---
## 6. Themes — Polishing the Final Look

### 📋 Board Reference

Themes control non-data visual elements: fonts, backgrounds, grid lines, tick labels.

**Built-in theme presets:**

| Function | Style |
|---|---|
| `theme_gray()` | Default gray background |
| `theme_bw()` | Black & white, clean |
| `theme_classic()` | No grid, minimal |
| `theme_minimal()` | Minimal grid, no background |
| `theme_void()` | No axes, no grid — maps use this |

**Fine-grained control with `theme(...)`:**

| Argument | Controls |
|---|---|
| `figure_size=(w,h)` | Figure dimensions |
| `axis_text_x=element_text(angle=45, hjust=1)` | Rotate x tick labels |
| `axis_title_x=element_text(size=14)` | Axis label font size |
| `plot_title=element_text(size=16, face='bold')` | Title styling |
| `legend_position='bottom'` | Legend location |

In [None]:
# Built-in theme + fine-grained tweaks + ggtitle
(ggplot(df, aes(x='percollege'))
 + geom_histogram(bins=25, fill='steelblue', color='white')
 + ggtitle('Distribution of College-Educated Adults')
 + theme_bw()
 + theme(axis_text_x  = element_text(angle=45, hjust=1),
         axis_title_x = element_text(size=13),
         plot_title   = element_text(size=15))
).draw()

---
### Tryyy meee 5: Theme Your Plot

Take ANY plot you made earlier today and polish it as if you were putting it in a report:

1. Add a title with `ggtitle('...')`
2. Apply a built-in theme (try at least 2 and pick your favorite)
3. Adjust at least one `theme()` argument — font size, label angle, legend position, etc.
4. Change the color palette with `scale_color_cmap_d()` or `scale_fill_cmap_d()`

Share your best plot with the class!

## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)

In [None]:
# Your polished plot here


<details>
<summary>💡 One approach — click to peek</summary>
<br>

*There's no single right answer here — focus on making your plot readable and visually clean.*

```python
# Example — polished version of the faceted scatter
(ggplot(df, aes(x='percollege', y='percprof', color='state'))
 + geom_point(alpha=0.5)
 + stat_smooth(method='lm', se=False)
 + facet_wrap('state', nrow=1)
 + scale_color_cmap_d('Set2')
 + ggtitle('College vs Professional Rate by Midwest State')
 + theme_bw()
 + theme(figure_size=(13, 4),
         plot_title=element_text(size=14),
         axis_text_x=element_text(angle=30, hjust=1))
).draw()
```

</details>

---
## 7. 🔬 Open Exploration — Diamonds

Use the full grammar on a richer dataset. The `diamonds` dataset has ~54,000 diamonds.

| Column | Description |
|---|---|
| `price` | Price in USD |
| `carat` | Weight |
| `cut` | Fair < Good < Very Good < Premium < Ideal |
| `color` | D (best) → J (worst) |
| `clarity` | I1 (worst) → IF (best) |
| `depth`, `table` | Physical proportions |

## [>drop ur code here!<](https://hackmd.io/@WcHbReNISVuZAGZxE0Nnrw/H1P5yqyKWg/edit)

In [None]:
diamonds.head(3)

**Prompt 1 — Distributions & Scales:**

Make a histogram of `price` faceted by `cut`. Then try `scale_x_log10()` on the x-axis — does the log scale reveal something the linear scale hides? Polish the plot with a theme and title.

In [None]:
# Prompt 1


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
(ggplot(diamonds, aes(x='price', fill='cut'))
 + geom_histogram(bins=30, color='white')
 + scale_x_log10()
 + facet_wrap('cut')
 + ggtitle('Diamond Price Distribution by Cut (log scale)')
 + theme_minimal()
).draw()
```

</details>

**Prompt 2 — Layering:**

Plot `carat` (x) vs `price` (y). Use layer-specific mappings so that:
- Points are colored by `cut` (in `geom_point`'s `aes()`)
- A **single** linear smooth line runs through all the data

Then try the opposite: move `color='cut'` to the top-level `aes()`. How does the smooth line change?

In [None]:
# Prompt 2


<details>
<summary>💡 One approach — click to peek</summary>
<br>

```python
# Layer-specific color — one overall smooth line
(ggplot(diamonds, aes(x='carat', y='price'))
 + geom_point(aes(color='cut'), alpha=0.15)
 + stat_smooth(method='lm', color='black', se=True)
).draw()

# Top-level color — one smooth line per cut
(ggplot(diamonds, aes(x='carat', y='price', color='cut'))
 + geom_point(alpha=0.1)
 + stat_smooth(method='lm', se=False)
).draw()
```

</details>

**Prompt 3 — Full Grammar:**

Build one plot that uses **all 7 layers**:
data, aesthetics, geometry, facets, stats, scales, theme.

Write a 2-sentence interpretation below your plot: what story does it tell, and what would you investigate next?

In [None]:
# Prompt 3 — full grammar plot


**My interpretation:** *(double-click to edit)*

<details>
<summary>💡 One approach — click to peek</summary>
<br>

*Try your own story — the best plots come from your own curiosity!*

```python
# One example — there are many good answers
(ggplot(diamonds, aes(x='carat', y='price', color='cut'))
 + geom_point(alpha=0.15)                          # 3. geometry
 + stat_smooth(method='lm', se=False)              # 5. stats
 + facet_wrap('clarity', nrow=2)                   # 4. facets
 + scale_color_cmap_d('Set1')                      # 6. scales
 + ggtitle('Carat vs Price by Clarity & Cut')
 + theme_bw()                                      # 7. theme
 + theme(figure_size=(14, 7),
         plot_title=element_text(size=14))
).draw()
# Story: within each clarity grade, carat drives price up strongly.
# Better-clarity diamonds (VVS1, IF) command steeper price-per-carat premiums.
```

</details>

---
## Appendix — Full Grammar Quick Reference

```python
(ggplot(df, aes(x='col1', y='col2'))     # 1. data + 2. aesthetics
 + geom_point(alpha=0.5)                  # 3. geometry
 + facet_wrap('cat_col', nrow=2)          # 4. facets
 + stat_smooth(method='lm', se=False)     # 5. stats
 + scale_color_cmap_d('Set2')             # 6. scales
 + ggtitle('My Plot Title')               #    title
 + theme_bw()                             # 7. theme preset
 + theme(figure_size=(10, 5),             #    fine-grained theme
         axis_text_x=element_text(angle=45, hjust=1))
).draw()
```

### Layer-specific mapping pattern
```python
# Color applies only to points, not to the smooth line
(ggplot(df, aes(x='col1', y='col2'))
 + geom_point(aes(color='cat_col'), alpha=0.5)
 + stat_smooth(color='black', se=False)
).draw()
```