# Principles of Data Visualization
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec19_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
from plotnine.data import *
import warnings 
warnings.filterwarnings('ignore') 

**NOTE** If you get errors when you run the cell above, go to the terminal and type the following
```python
pip install plotnine
pip install matplotlib==3.8.3
```

Then come back to this notebook and try again. (You might have to restart your kernel). 

## Statistical transformations (stat)

In [None]:
# add statistical transformations
(ggplot(midwest,aes('percollege','percprof',color = 'state'))
+geom_point()
+facet_wrap('state')
+stat_smooth()).draw()

In [None]:
# Using stats with bars (continuous)
(ggplot(midwest,aes('percollege'))
 +geom_bar()
 +stat_bin(bins = 20)).draw()

In [None]:
# Using stats with bars (discrete)
(ggplot(midwest,aes('state'))
 +geom_bar()
 +stat_count()).draw()

In [None]:
# Since stats get paired with specific geoms, can place them together
(ggplot(midwest,aes('percollege'))
+stat_bin(geom = 'bar',bins = 20)).draw()

In [None]:
# Using stats with bars
(ggplot(midwest,aes(x='state',y='poptotal'))
 + geom_bar(stat='summary', fun_y=np.mean)).draw()

In [None]:
# check what the above plot is doing 
midwest.groupby('state')['poptotal'].mean()

## Layer-specific mappings

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof', color = 'state'))
+geom_point()
+stat_smooth()).draw()

In [None]:
(ggplot(midwest,aes('percollege','percprof'))
+geom_point(aes(color = 'state'))
+stat_smooth()).draw()

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof', color = 'state'))
+geom_point()
+facet_wrap('state')
+stat_smooth()).draw()

In [None]:
# Use different aesthetics for different parts of graphic
(ggplot(midwest,aes('percollege','percprof'))
+geom_point(aes(color = 'state'))
+facet_wrap('state')
+stat_smooth()).draw()

## Themes
There are several things we can adjust about the figure that don't fall into the specific grammar of graphics components. These often fall into the `theme` category. For example, the graph below shows how we can adjust the angle of the x-tick labels and the size of the axis labels. See the documentation for other options.

In [None]:
(ggplot(midwest, aes(x='percollege'))
+geom_histogram()
+ggtitle('Distribution of College Graduates')
+theme(axis_text_x  = element_text(angle = 45, hjust = 1),
      axis_title_x = element_text(size = 18),
      axis_title_y = element_text(size = 18),
      plot_title = element_text(size = 20))).draw()

## Activity

The `plotnine` module has several has a dataset called `diamonds`, a dataset containing the prices and other attributes of almost 54,000 diamonds.

In [None]:
diamonds.head()

1. Use plotnine to recreate this figure.

![Diamonds Plot](diamonds_plot.png)

In [None]:
# Your code here


2. According to the principles of data visualization, what is wrong with the graph below? Adjust the ggplot so that it aligns with the principles of data visualization.  

In [None]:
(ggplot(diamonds, aes(x='x',y='y'))
       +geom_line()).draw()

3. According to the principles of data visualization, what is wrong with the graph below? Adjust the matplotlib graph, or create a ggplot so that it aligns with the principles of data visualization.  

In [None]:
ideal = diamonds[diamonds.cut == 'Ideal']
prem = diamonds[diamonds.cut == 'Premium']
good = diamonds[diamonds.cut == 'Good']
vgood = diamonds[diamonds.cut == 'Very Good']
fair = diamonds[diamonds.cut == 'Fair']

plt.plot('carat','price','r.',data = ideal)
plt.plot('carat','price','m.',data = prem)
plt.plot('carat','price','y.',data = good)
plt.plot('carat','price','w.',data = vgood)
plt.plot('carat','price','k.',data = fair)
plt.show()

4. According to the principles of data visualization, what is wrong with the graph below? Adjust the ggplot so that it aligns with the principles of data visualization.  

In [None]:
avg_price = diamonds.groupby('clarity').price.mean().reset_index()
(ggplot(avg_price,aes(x='clarity',y='price',fill = 'clarity')) 
 + geom_bar(stat='identity',color='r')
 + geom_text(label=avg_price.clarity)
 + theme_classic()).draw()