# Grammar of Graphics with plotnine
v.ekc-c

We have been building plotting skills with Matplotlib. Today we switch to **plotnine**, a Python library that implements the *Grammar of Graphics* — the same system behind R's `ggplot2`. The grammar gives you a principled, layered vocabulary for building any plot.

**Sections:**
1. Setup & Data
2. Basics — `ggplot` + `aes` + `geom`
3. Aesthetic Mappings (`aes`)
4. Scales
5. Geometric Objects (`geom`)
6. Facets
7. Statistical Transformations (`stat`)
8. Layer-Specific Mappings
9. Themes
10. Activity

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
from plotnine.data import *
import warnings 
warnings.filterwarnings('ignore') 

**NOTE** If you get errors when you run the cell above, go to the terminal and type the following
```python
pip install plotnine
pip install matplotlib==3.8.3
```

Then come back to this notebook and try again. (You might have to restart your kernel). 

In [None]:
df = midwest
df

---
## 2. Basics — `ggplot` + `aes` + `geom`

Every plotnine figure starts with three things:
- **`ggplot(data, aes(...))`** — binds a DataFrame and declares which columns map to which visual channels
- **`+ geom_*()`** — adds a geometric layer (points, bars, lines, …)
- **`.draw()`** — renders the figure (like `plt.show()`)

| Component | Purpose | Example |
|---|---|---|
| `ggplot(df, aes(...))` | Bind data + declare mappings | `ggplot(df, aes(x='col1', y='col2'))` |
| `geom_point()` | Scatter plot layer | `+ geom_point()` |
| `geom_bar()` | Bar chart layer | `+ geom_bar()` |
| `geom_histogram()` | Histogram layer | `+ geom_histogram(bins=20)` |
| `.draw()` | Render the figure | `.draw()` |

In [None]:
# Visualize the relationship between the percent who went to college, and the percent who got a professional degree
# With matplotlib
plt.scatter('percollege','percprof',data = df)
plt.xlabel('percollege')
plt.ylabel('percprof')
plt.show()

The rest of the plotnine documentation is [here](https://plotnine.org/reference/) (including examples). 

In [None]:
# Do the same with a plotnine ggplot
(ggplot(df, aes(x='percollege',y='percprof'))
+geom_point())

In [None]:
# to make the output pretty, use .draw() (kind of like .show() in matplotlib)
(ggplot(df, aes(x='percollege',y='percprof'))
+geom_point()).draw()

### ✏️ Check-in 1 — ggplot Basics

Using the `df` (midwest) DataFrame:

1. Write a plotnine scatter plot of `percollege` (x) vs `percbelowpoverty` (y). Remember to call `.draw()`.
2. Add a `geom_histogram()` layer (separate plot) showing the distribution of `percbelowpoverty` with 15 bins.

In [None]:
# 1. Scatter: percollege vs percbelowpoverty


In [None]:
# 2. Histogram of percbelowpoverty


#### Answer

In [None]:
(ggplot(df, aes(x='percollege', y='percbelowpoverty'))
+geom_point()).draw()

In [None]:
(ggplot(df, aes(x='percbelowpoverty'))
+geom_histogram(bins=15)).draw()

---
## 3. Aesthetic Mappings (`aes`)
Does the relationship vary by state? Map a variable to `color` inside `aes()` — plotnine handles the legend automatically.

In [None]:
# Do this with matplotlib
df.state.unique()

In [None]:
# Separate our data
IL = df[df.state == 'IL']
IN = df[df.state == 'IN']
MI = df[df.state == 'MI']
OH = df[df.state == 'OH']
WI = df[df.state == 'WI']

In [None]:
# Create our scatter plots
plt.scatter('percollege','percprof',data = IL,label='IL')
plt.scatter('percollege','percprof',data = IN,label='IN')
plt.scatter('percollege','percprof',data = MI,label='MI')
plt.scatter('percollege','percprof',data = OH,label='OH')
plt.scatter('percollege','percprof',data = WI,label='WI')
plt.xlabel('percollege')
plt.ylabel('percprof')
plt.legend()
plt.show()

In [None]:
# do the same thing with a plotnine ggplot
(ggplot(df, aes('percollege','percprof',color = 'state'))
+geom_point()).draw()

---
## 4. Scales
Maybe we don't like the default color scale. Scales let you control how data values map to visual properties (colors, sizes, axis ranges).

| Scale function | What it controls |
|---|---|
| `scale_color_cmap_d('Set1')` | Discrete color palette from matplotlib |
| `scale_color_manual([...])` | Fully custom color list |
| `scale_x_continuous(limits=(a,b))` | X-axis range |
| `scale_y_continuous(limits=(a,b))` | Y-axis range |

In [None]:
# Change color scale with matplotlib
(ggplot(df, aes('percollege','percprof',color = 'state'))
+geom_point()
+scale_color_cmap_d('Set1')).draw() 

In [None]:
# Or you can change the color scale manually
(ggplot(df, aes('percollege','percprof',color = 'state'))
+geom_point()
+scale_color_manual(['r','b','g','m','c'])).draw()

In [None]:
# Other scales can be changed too
(ggplot(df, aes('percollege','percprof',color = 'state'))
+geom_point()
+scale_color_cmap_d('Set1')
+scale_x_continuous(limits = (0,60))
+scale_y_continuous(limits = (-10,30))
).draw()

### ✏️ Check-in 2 — Aesthetic Mappings & Scales

1. Make a scatter plot of `percollege` vs `percprof`, mapping `state` to **both** `color` and `shape` (`fill`).
2. Change the color palette to `'tab10'` using `scale_color_cmap_d()`.
3. Limit the x-axis to `(10, 50)` using `scale_x_continuous()`.

In [None]:
# 1. Color + shape mapped to state


In [None]:
# 2 & 3. Add scale_color_cmap_d and scale_x_continuous


#### Hint

Chain each layer with `+`. For shape, try `aes(color='state', shape='state')` inside `geom_point()`.

#### Answer

In [None]:
(ggplot(df, aes('percollege','percprof'))
+geom_point(aes(color='state', shape='state'))).draw()

In [None]:
(ggplot(df, aes('percollege','percprof'))
+geom_point(aes(color='state', shape='state'))
+scale_color_cmap_d('tab10')
+scale_x_continuous(limits=(10,50))).draw()

---
## 5. Geometric Objects (`geom`)
What if we wanted to visualize the number of counties in each state? We swap in a different `geom`.

In [None]:
# with matplotlib 
counties_per_state = df.state.value_counts()
plt.bar(counties_per_state.index, counties_per_state.values)
plt.show()

In [None]:
# with ggplot
(ggplot(df, aes(x='state'))
+geom_bar()).draw()

In [None]:
# Note that this uses a "count" statistical transformation by default
(ggplot(df, aes(x='state'))
+geom_bar()
+stat_count()).draw()

In [None]:
# To adjust the order of the bars, we adjust the x-axis scale
(ggplot(df, aes(x='state'))
+geom_bar()
+scale_x_discrete(limits = list(df.state.value_counts().index))).draw()

In [None]:
# Histograms (use geom_bar with a statistical transformation of binning)
(ggplot(midwest, aes(x='percollege')) 
+ geom_bar()
+ stat_bin(bins=20)).draw()

In [None]:
# Histograms (another way)
(ggplot(df, aes(x='percollege'))
+geom_histogram(bins=20)).draw()

---
## 6. Facets
In the plots above, points occasionally fell on top of each other. Facets split the visualization into a grid of small multiples — one panel per group.

| Function | Description |
|---|---|
| `facet_wrap('col')` | Wrap panels into rows/columns automatically |
| `facet_wrap('col', nrow=1)` | Force all panels into one row |
| `facet_grid(('row_col','col_col'))` | Explicit row × column grid |

In [None]:
# with matplotlib
fig, ax = plt.subplots(1,5,figsize = (12,3))
ax[0].scatter('percollege','percprof',data = IL)
ax[0].set_title('IL')
ax[1].scatter('percollege','percprof',data = IN)
ax[1].set_title('IN')
ax[2].scatter('percollege','percprof',data = MI)
ax[2].set_title('MI')
ax[3].scatter('percollege','percprof',data = OH)
ax[3].set_title('OH')
ax[4].scatter('percollege','percprof',data = WI)
ax[4].set_title('WI')
ax[0].set_xlabel('percollege')
ax[0].set_ylabel('percprof')
plt.show()

In [None]:
# with a plotnine ggplot
(ggplot(df,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state')).draw()

In [None]:
# adjusting the number of rows in your facetting
(ggplot(df,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state',nrow=1)).draw()

In [None]:
# change figure size
(ggplot(df,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state',nrow=1)
+theme(figure_size=(10,3))).draw()

In [None]:
# facet by more than one variable
(ggplot(df,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_grid(('category','state'))
+theme(figure_size=(8,16))).draw()

### ✏️ Check-in 3 — Facets

1. Make a faceted scatter plot of `percollege` vs `percprof`, one panel per `state`, arranged in **2 rows**. Color points by state.
2. Set the figure size to `(10, 6)` using `theme(figure_size=(10,6))`.

In [None]:
# 1. Faceted scatter, 2 rows


In [None]:
# 2. Add figure size theme


#### Answer

In [None]:
(ggplot(df, aes('percollege','percprof', color='state'))
+geom_point()
+facet_wrap('state', nrow=2)).draw()

In [None]:
(ggplot(df, aes('percollege','percprof', color='state'))
+geom_point()
+facet_wrap('state', nrow=2)
+theme(figure_size=(10,6))).draw()

---
## 7. Statistical Transformations (`stat`)
`stat_*` functions compute a summary (smoothing line, bin counts, group means) and overlay it on the plot.

In [None]:
# add statistical transformations
(ggplot(df,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state')
+stat_smooth()).draw()

In [None]:
# Using stats with bars
(ggplot(df,aes('percollege'))
+stat_bin(geom = 'bar',bins = 20)).draw()

In [None]:
# Using stats with bars
(ggplot(df,aes(x='state',y='poptotal'))
+stat_summary(geom="bar", fun_data="mean_se")).draw()

In [None]:
# check what the above plot is doing 
df.groupby('state')['poptotal'].mean()

---
## 8. Layer-Specific Mappings
You can apply aesthetic mappings to individual layers instead of the whole plot — useful when different geoms need different encodings.

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(df,aes('percollege','percprof'))
+geom_point(aes(color = 'state'))
+facet_wrap('state')
+stat_smooth()).draw()

---
## 9. Themes
Themes control non-data elements: font sizes, tick-label angles, background, etc. They don't change *what* is plotted, only *how it looks*.

In [None]:
(ggplot(df, aes(x='percollege'))
+geom_histogram()
+ ggtitle('Distribution of College Graduates')
+theme(axis_text_x  = element_text(angle = 45, hjust = 1),
      axis_title_x = element_text(size = 18),
      axis_title_y = element_text(size = 18),
      plot_title = element_text(size = 20))).draw()

---
## 10. Activity

The `plotnine` module includes a dataset called `diamonds` — prices and attributes of ~54,000 diamonds.

1. Create a grid of histograms showing the distribution of prices facetted by `cut` and `clarity`. Adjust the number of bins as needed.

2. Plot the number of carats vs the price. Think about how you might gain additional insights with additional aesthetic mappings, facetting, statistical transformations, etc. 