# Week 4 Discussion


## Making static plots

## Plotnine

We will use the plotnine package, an implementation of ggplot2 for Python. Unlike packages we've seen so far, plotnine is not included with Anaconda. To install the package:

* On Windows, run `conda install -c conda-forge plotnine` in an Anaconda Prompt (find it in the start menu)
* On MacOS or Linux, run `conda install -c conda-forge plotnine` in the Terminal

You may have to restart Jupyter after installing. 

In [1]:
import plotnine as p9

p9.__version__

'0.12.4'

Our focus right now is _static_ visualization, where the visualization is a still image. So what packages should you actually use?

* __plotnine__ is convenient if you already know ggplot2. It's relatively new, so there are some bugs and missing features.

* __seaborn__ is designed specifically for making statistical plots. It's well-documented and stable. Most of the package's functions expect tidy data as input.

* __matplotlib__ is useful to know, since many other packages use matplotlib under the hood. That said, using matplotlib alone to create plots is painful; matplotlib is _low-level_, so it's flexible but simple plots may take [5 lines of code or more][ex]. The matplotlib PyPlot tools may be convenient if you already know MATLAB.

* __pandas__ provides built-in plotting functions, which can be convenient but are more limited than the packages above. They're also inconsistent about the expected format of the data.

We don't have time to exhaustively cover visualization packages for Python. You're welcome to explore other packages while doing the assignments for this class.

Later in the quarter, we'll see some of Python's _interactive_ visualization packages.

[ex]: https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/

See the [plotnine documentation](https://plotnine.readthedocs.io/en/latest/)! Also see the [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/) and the [ggplot2 cheatsheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). If you run into a bug, you may want to check for a work-around on the [plotnine bug tracker](https://github.com/has2k1/plotnine/issues).

In [None]:
import numpy as np
import pandas as pd

milk = pd.read_excel("../data/fluidmilk.xlsx", skiprows = 1)
milk.columns = milk.columns.str.replace('\n', '')
milk = milk.rename(columns=lambda df: df.strip(' 12'))
milk.columns.values[[0,2,3,5,6]] = np.array(['Year', 'Reduced', 'Low', 
                                             'Flavored Whole', 'Flavored Other'])
milk = milk[:-4] # get rid of the last four rows
milk = milk.drop(columns = 'Total')

milk['Year'] = pd.to_numeric(milk['Year'])

milk = milk.set_index("Year") 

milk1 = milk.stack()
milk1 = milk1.reset_index()
milk1.columns.values[[False, True, True]] = np.array(["Kind", "Sales"])

In [None]:
milk1.head(10)

In [None]:
milk2 = milk[['Whole', 'Reduced']]
milk2 = milk2.reset_index()
milk2.head()

The syntax of plotnine closely follows the syntax of R's ggplot2. In R, we would write

```r
ggplot(milk, aes(x = Year, y = Sales, color = Kind)) + geom_line() 
```

One important difference is that plotnine requires that we quote variable names.

In [None]:
(
    p9.ggplot(milk1, p9.aes(x = "Year", y = "Sales", color = "Kind")) 
    + p9.geom_line()
    + p9.labs(title = "US Milk Sales", y = "Sales (millions of pounds)")
)

In [None]:
(
    p9.ggplot(milk2, p9.aes(x = "Whole", y = "Reduced"))
    + p9.theme_classic() 
    + p9.geom_path(p9.aes(color = "Year", size = "Whole + Reduced"), linejoin = 'mitre')
    + p9.labs(title = "Whole per Reduced Milk Sales in US")
)

`plotnine` includes the familiar `p9.ggsave()` function for saving a visualization to an image file.

## Jupyter and matplotlib

Jupyter notebooks can display most static visualizations and some interactive visualizations. If you're going to use visualization packages that depend on matplotlib, it's a good idea to set up your notebook by running:

In [None]:
# Initialize matplotlib for jupyter: 
%matplotlib inline 

import matplotlib.pyplot as plt

# Change the size of the plot
plt.rcParams["figure.figsize"] = [5, 5]

Matplot cheat sheet [see here][link1] 

[link1]:https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf

## Plotting the Milk Dataset with seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Seaborn library is included as part of the Anaconda distribution.

See the [seaborn documentation](https://seaborn.pydata.org/)!

In [None]:
import seaborn as sns

sns.__version__

In seaborn, the __hue__ parameter determines which column in the data frame should be used for colour encoding. 

In [None]:
ax = sns.lineplot(x = "Year", y = "Sales", hue = "Kind", data = milk1)
ax.set_title("US Milk Sales")

If we want to adjust the size and layout, we have to learn more about matplotlib.

## The Basics of matplotlib

See the [matplotlib documentation](https://matplotlib.org/stable/users/index.html)!

First, let's change the size of the figures in the notebook. To do that, we need to go back to the code we used to initialize matplotlib, and adjust `rcParams`, matplotlib's default settings.

### Jargon

The most important thing to know is matplotlib's jargon:

* _Figure_: Container for plots.
* _Axes_: Container for components of a plot ("primitives"). In other words, an axes is a single plot.
* _Axis_: Container for components of an axis. An axis is a single axis.
* _Tick_: A container for tick marks on an axis.

All of the containers and the primitives are called _Artists_.

### Saving Figures

You can save figures to an image file with the `.savefig()` method.

You can also get the Figure that contains an Axes with the `.get_figure()` method. So to save our seaborn plot:

In [None]:
ax = sns.lineplot(x = "Year", y = "Sales", hue = "Kind", data = milk1)
ax.set_title("US Milk Sales")

In [None]:
plt.savefig('seabornplot.png') # saves current plot via matplotlib
ax.get_figure().savefig('output.png') # saves with seaborn

## Comparing Packages

Let's use the familiar dogs dataset to further compare the different plotting packages.

In [None]:
dogs = pd.read_csv("../data/dogs_full.csv")
dogs.head()

In [None]:
dogs.tail()

### Scatter Plots

Plot the number of dogs in each category.

In [None]:
# Plotnine

p = (p9.ggplot(dogs, p9.aes(x = "group"))
+ p9.geom_point(stat = "count"))
p + p9.labs(title = "Dog Groups", x = "Group", y = "Count")

In [None]:
# Seaborn
counts = dogs["group"].value_counts()
counts

In [None]:
ax = sns.stripplot(x = counts.index, y = counts)
ax.set(title = "Dog Groups", xlabel = "Group", ylabel = "Count") # prints and returns list 
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

In [None]:
ax = counts.plot(style = "o", rot = 45)
ax.set(title = "Dog Groups", xlabel = "Group", ylabel = "Count")

### Box Plots

Plot the distribution of dog longevity, grouped by category.

In [None]:
# Plotnine
( 
    p9.ggplot(dogs, p9.aes("group", "longevity")) 
    + p9.geom_boxplot()
    + p9.labs(title = "Dog Longevity", x = "", y = "Years")
)

In [None]:
# Seaborn

ax = sns.boxplot(x = "group", y = "longevity", data = dogs)
ax.set(title = "Dog Longevity", xlabel = "", ylabel = "Years")
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

In [None]:
# Pandas

ax = dogs.boxplot(by = "group", column = "longevity", rot = 45)
ax.set(title = "Dog Longevity", xlabel = "", ylabel = "Years")
# Hide grouping title Pandas adds.
ax.get_figure().suptitle("")

### Scatter Plots

Plot popularity against datadog score.

In [None]:
# Plotnine

(
    p9.ggplot(dogs, p9.aes("datadog", "popularity"))
    + p9.geom_point()
    + p9.labs(title = "Best in Show", x = "DataDog Score", y = "Popularity Rank")
    + p9.ylim(95, -5)
)

In [None]:
# Seaborn
ax = sns.regplot(x = "datadog", y = "popularity", data = dogs, 
                 fit_reg = False)
ax.set(title = "Best in Show", xlabel = "DataDog Score", ylabel = "Popularity Rank")
ax.set_ylim(reversed(ax.get_ylim()))

In [None]:
# Pandas

ax = dogs.plot.scatter(x = "datadog", y = "popularity")
ax.set(title = "Best in Show", xlabel = "DataDog Score", ylabel = "Popularity Rank")
ax.set_ylim(reversed(ax.get_ylim()))

### Image processing

In [None]:
# Automatically display matplotlib plots, so that we don't have to write `plt.show()`.
# Normally this should be in a cell at the top of the notebook.
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as img

# Dog image from https://unsplash.com/photos/jx_kpR7cvDc
dog = img.imread("../data/dog.png")
plt.imshow(dog)

type(dog)

In [None]:
dog.shape

In [None]:
dog_rg = dog.copy()
dog_rg[:,:,2] = 0
plt.imshow(dog_rg)

Depending on which package you use to load an image, pixels may be encoded as integers or floating point (decimal) numbers. The scikit-image package has [some documentation](http://scikit-image.org/docs/dev/user_guide/data_types.html) about what these numbers typically mean.

In [None]:
dog.dtype

Let's try to find all the green pixels and make them red.

How can we do this?

In [None]:
reddog_rgb = dog.copy()
#reddog_rgb[:, :, 0] = 1 # Set red channel to large value.
reddog_rgb[:, :, 1] = 0 # Set green channel to small value.
plt.imshow(reddog_rgb)

Break problems into small steps.

If you're trying to figure out how something works, test on small "toy" examples and draw pictures.

In [None]:
import numpy as np
import skimage as ski
import skimage.color

# Switch from (red, green, blue) to (hue, saturation, value).
# Hue is the color (from red to violet).
# Saturation is how colorful (from colorless to colorful).
# Value is how bright (from black to bright color).
dog_hsv = ski.color.rgb2hsv(dog)
plt.imshow(dog_hsv)

In [None]:
dog_hsv.shape

 Start by taking a small piece of the dog image, say the lower left corner.

In [None]:
grass = dog_hsv[-200:, :200, :] #dog_hsv[266-50:266, 0:50, :]
plt.imshow(ski.color.hsv2rgb(grass))

How can we figure out what "green" looks like in HSV?

In [None]:
mu = grass.mean(axis = (0, 1))
mu

We can preview this "green" by making a 2x2 swatch.

In [None]:
swatch = np.stack(4 * [mu]).reshape((2, 2, 3))
plt.imshow(ski.color.hsv2rgb(swatch))

Now we need to get all pixels with a hue "nearby" the grass mean.

How can we define "nearby"?

In [None]:
sd = grass.std(axis = (0, 1))
sd

In [None]:
tol = (mu[0] - 3 * sd[0], mu[0] + 3 * sd[0])
tol

In [None]:
reddog = dog_hsv.copy()
is_green = (tol[0] <= reddog[:, :, 0]) & (reddog[:, :, 0] <= tol[1])
reddog[is_green, 0] = 0
plt.imshow(ski.color.hsv2rgb(reddog))

What would happen if we tried this with RGB instead of HSV?