# Principles of Data Visualization
v.ekc-c

A plot is only as good as the design decisions behind it. Today we finish the Grammar of Graphics (`stat`, layer mappings, themes) and then apply *principles of data visualization* to critique and improve real plots.

**Sections:**
1. Setup
2. Statistical Transformations (`stat`)
3. Layer-Specific Mappings
4. Themes
5. Principles of Data Visualization
6. Activity — Critique & Improve

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
from plotnine.data import *
import warnings 
warnings.filterwarnings('ignore') 

**NOTE** If you get errors when you run the cell above, go to the terminal and type the following
```python
pip install plotnine
pip install matplotlib==3.8.3
```

Then come back to this notebook and try again. (You might have to restart your kernel). 

---
## 2. Statistical Transformations (`stat`)

`stat_*` functions transform your raw data before plotting — computing summaries like bin counts, smoothing lines, or group means.

| `stat` function | What it computes | Paired geom |
|---|---|---|
| `stat_smooth()` | Smoothing curve (LOESS/linear) | `geom_point()` |
| `stat_bin(bins=n)` | Bin counts for a continuous variable | `geom_bar()` |
| `stat_count()` | Count rows per category | `geom_bar()` |
| `geom_bar(stat='summary', fun_y=fn)` | Apply any aggregate function | — |

In [None]:
# add statistical transformations
(ggplot(midwest,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state')
+stat_smooth()).draw()

In [None]:
# Using stats with bars (continuous)
(ggplot(midwest,aes('percollege'))
 +geom_bar()
 +stat_bin(bins = 20)).draw()

In [None]:
# Using stats with bars (discrete)
(ggplot(midwest,aes('state'))
 +geom_bar()
 +stat_count()).draw()

In [None]:
# Since stats get paired with specific geoms, can place them together
(ggplot(midwest,aes('percollege'))
+stat_bin(geom = 'bar',bins = 20)).draw()

In [None]:
# Using stats with bars
(ggplot(midwest,aes(x='state',y='poptotal'))
 + geom_bar(stat='summary', fun_y=np.mean)).draw()

In [None]:
# check what the above plot is doing 
midwest.groupby('state')['poptotal'].mean()

### ✏️ Check-in 1 — Statistical Transformations

Using the `midwest` dataset:

1. Use `stat_smooth()` together with `geom_point()` to plot `percollege` vs `percpoverty` with a smoothing line. Color points by `state`.
2. Make a bar chart of **mean** `percbelowpoverty` per state using `geom_bar(stat='summary', fun_y=np.mean)`.

In [None]:
# 1. Scatter + smooth line, colored by state


In [None]:
# 2. Bar chart of mean percbelowpoverty per state


#### Answer

In [None]:
(ggplot(midwest, aes('percollege','percpoverty', color='state'))
+geom_point()
+stat_smooth()).draw()

In [None]:
(ggplot(midwest, aes(x='state', y='percbelowpoverty'))
+geom_bar(stat='summary', fun_y=np.mean)).draw()

---
## 3. Layer-Specific Mappings

Mappings inside `aes()` at the top level apply to **all layers**. Mappings inside a specific `geom_*(aes(...))` apply to **that layer only** — useful when you want a smooth line for all data but colored points per group.

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof', color = 'state'))
+geom_point()
+stat_smooth()).draw()

In [None]:
(ggplot(midwest,aes('percollege','percprof'))
+geom_point(aes(color = 'state'))
+stat_smooth()).draw()

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof', color = 'state'))
+geom_point()
+facet_wrap('state')
+stat_smooth()).draw()

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof'))
+geom_point(aes(color = 'state'))
+facet_wrap('state')
+stat_smooth()).draw()

### ✏️ Check-in 2 — Layer Mappings

1. Create a plot of `percollege` vs `percprof` where:
   - **Points** are colored by `state` (mapping in `geom_point`)
   - A **single** overall smooth line is drawn (no color grouping on `stat_smooth`)
2. Now flip it: apply `color='state'` to the **top-level** `aes()` so both the points and the smooth line are colored per state. What is the difference?

In [None]:
# 1. Color on geom_point only, one smooth line


In [None]:
# 2. Color on top-level aes, smooth lines per state


#### Hint

For Q1: put `aes(color='state')` **inside** `geom_point()`, not in the top-level `ggplot()` call.

For Q2: move `color='state'` into the top-level `aes()`.

#### Answer

In [None]:
(ggplot(midwest, aes('percollege','percprof'))
+geom_point(aes(color='state'))
+stat_smooth()).draw()

In [None]:
(ggplot(midwest, aes('percollege','percprof', color='state'))
+geom_point()
+stat_smooth()).draw()

---
## 4. Themes

Themes control the non-data visual elements: font sizes, axis text angles, background, etc. They don't change *what* is plotted — only *how it looks*.

| Argument | Controls |
|---|---|
| `axis_text_x=element_text(angle=45, hjust=1)` | Rotate x-axis tick labels |
| `axis_title_x=element_text(size=18)` | X-axis label font size |
| `plot_title=element_text(size=20)` | Title font size |
| `figure_size=(w, h)` | Overall figure dimensions |

In [None]:
(ggplot(midwest, aes(x='percollege'))
+geom_histogram()
+ggtitle('Distribution of College Graduates')
+theme(axis_text_x  = element_text(angle = 45, hjust = 1),
      axis_title_x = element_text(size = 18),
      axis_title_y = element_text(size = 18),
      plot_title = element_text(size = 20))).draw()

---
## 5. Principles of Data Visualization

Good visualizations are **truthful**, **functional**, **beautiful**, and **insightful**. Some common pitfalls to watch for:

| Principle | What to avoid |
|---|---|
| Represent data accurately | Truncated y-axes, 3D effects that distort perception |
| Choose the right chart type | Line plots for unordered categories, pie charts with many slices |
| Avoid overplotting | Too many overlapping points — use `alpha`, facets, or jitter |
| Use color purposefully | Redundant color that adds no information; non-colorblind-friendly palettes |
| Minimize chartjunk | Unnecessary gridlines, borders, decorations that don't encode data |

---
## 6. Activity — Critique & Improve

The `plotnine` module includes a `diamonds` dataset — prices and attributes of ~54,000 diamonds.

In [None]:
diamonds.head()

1. Use plotnine to recreate this figure.

![Diamonds Plot](diamonds_plot.png)

In [None]:
# Your code here


2. According to the principles of data visualization, what is wrong with the graph below? Adjust the ggplot so that it aligns with the principles of data visualization.  

In [None]:
(ggplot(diamonds, aes(x='x',y='y'))
       +geom_line()).draw()

3. According to the principles of data visualization, what is wrong with the graph below? Adjust the matplotlib graph, or create a ggplot so that it aligns with the principles of data visualization.  

In [None]:
ideal = diamonds[diamonds.cut == 'Ideal']
prem = diamonds[diamonds.cut == 'Premium']
good = diamonds[diamonds.cut == 'Good']
vgood = diamonds[diamonds.cut == 'Very Good']
fair = diamonds[diamonds.cut == 'Fair']

plt.plot('carat','price','r.',data = ideal)
plt.plot('carat','price','m.',data = prem)
plt.plot('carat','price','y.',data = good)
plt.plot('carat','price','w.',data = vgood)
plt.plot('carat','price','k.',data = fair)
plt.show()

4. According to the principles of data visualization, what is wrong with the graph below? Adjust the ggplot so that it aligns with the principles of data visualization.  

In [None]:
avg_price = diamonds.groupby('clarity').price.mean().reset_index()
(ggplot(avg_price,aes(x='clarity',y='price',fill = 'clarity')) 
 + geom_bar(stat='identity',color='r')
 + geom_text(label=avg_price.clarity)
 + theme_classic()).draw()