# Session 5 -- Plotting
Here we will cover some basic plotting capabilities that Python has. We will not go too in depth, given that one of the objectives of this course is to introduce PowerBI, a visualization technology that is much more user friendly to customize and easier to interface with a customer over. For this session, we will just go through the Notebook -- no need to follow along in Spyder.

## `plotnine`
The most popular plotting library of Python is called `matplotlib`, which comes with your Anaconda install. However, for this course, we will be using a library called `plotnine`. The reason for this is because R has an extremely popular plotting library called `ggplot2`, and `plotnine` is a way for Python users to get `matplotlib` power using syntax very similar to `ggplot2`.

`plotnine` does not come built into Python, nor did it come installed on Anaconda, but we can install it. Open the application called `Anaconda Navigator` from your Start Menu. 
![Anaconda Prompt](img/04_prompt.png)

Anaconda Prompt is simply your normal Command Prompt run with some parameters of where to start it and what environment to open in.

Enter the following commands:
> `conda activate [your environment name here]`

This will take your command prompt to the environment we have been working in.

Next, enter this command to install the `plotnine` library:
> `pip install plotnine`

Confirm any prompts and allow it to install.

If you want to follow along in Spyder, theres a recommended setting to change in order to have plots appear in new interactive windows. Otherwise, they appear inline on the console, which is very tiny and not fun to look at.

Naviate to the toolbar and change the following setting:
`Tools > Preferences > Ipython Console > Graphics > Backend > Automatic`
![Spyder Pref](img/00_spyder_pref.png)

`Tools > Preferences > Ipython Console > Graphics > Backend > Automatic`
![Spyder Pref](img/04_spyder_backend.png)

##  `sklearn`
We'll also be using a package called `sklearn`, which is a library full of various machine learning tools. We will only be using the `datasets` module from it, which contain some popular commonly used sample datasets used for demonstration purposes. 

# Imports
First come our imports. We will need `sklearn.datasets`, `statistics`, `pandas`, and `plotnine`:

In [1]:
import pandas as pd
from sklearn import datasets
import statistics
from plotnine import *

The `*` in the `plotnine` import represents importing everything. However, this gives us access to all of the functions and modules without having to specify them. You'll see later we use syntax like `geom_point` rather than `plotnine.geometry.geom_point`, and this preserves the similarity to `ggplot` syntax.

There's also a lot of warnings that will pop up due to some updates to different libraries that are being used behind-the-scenes, but they are nothing to worry about. We can use the following to suppress these warnings:

In [2]:
import warnings
warnings.filterwarnings('ignore')

Next, we'll load a dataset to demonstrate these plots on. We'll use the `iris` dataset:

In [3]:
iris = datasets.load_iris()

In [4]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

You'll notice if you look at the data, it is not a Dataframe. We'll have to take some steps to convert it.

Extra credit info:
We'll convert this by creating a data frame using the `data` attrubute of the dataset, and mapping that to the `feature_names` attribute of the dataset as the column names. If you look at the data, there's a couple other attributes called `target` and `target_names` that correspond to the species column of this dataset, where `target` is just numbers and `target_names` are species names that correspond to those numbers, which is kind of confusing, but I'm assuming they did that to keep all the data numeric. Luckily `pandas` has a function to map these too.

In [5]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Species'] = pd.Categorical.from_codes(iris.target,iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# `pandas` Plots
The `pandas` library actually has some built-in plotting abilities, just by calling `plot` with a `DataFrame`. The default is a line plot. Let's plot `sepal length` against `sepal width`:

In [None]:
df.plot(x='sepal length (cm)',y='sepal width (cm)', title='My Sepal Length-Width')

You'll see I also named the plot using the parameter `title`.

A line plot doesn't make much sense here -- let's do a scatter plot instead. We can do this by just adding on `scatter` to our function call:

In [None]:
df.plot.scatter(x='sepal length (cm)',y='sepal width (cm)', title='My Sepal Length-Width')

We can also plot just one variable, and the x-axis will just default to an index:

In [None]:
df.plot(y='sepal length (cm)')

# `plotnine` Plotting
But what about that stuff about `ggplot` and `plotnine`? While `pandas` has basic plotting abilities, there's not much flexibility. `plotnine`, on the other hand, can be customized pretty easily.

Plotting using `plotnine` goes through a specific work flow. First, let's create a window to plot on, and specify what data we want to use to plot.

In [None]:
scatter = ggplot(aes(x='sepal length (cm)',y='sepal width (cm)'),data=df)
scatter

This gives us a blank plot, but don't worry, this is expected!

The next step is to add some data points. We display these by using `geom_point`, which can be thought of as short for "geometric points". You'll see in a bit why they're are geometric, but first let's see an example:

In [None]:
scatter0 = scatter + geom_point(data=df)
scatter0

Alright, we have a plot! But what if we want to fit even more information into just one plot? What else can we do?

We can change the appearance of the points. Here's an example of changing the colors and shapes:

In [None]:
scatter1 = scatter + geom_point(aes(color='factor(Species)', shape='factor(Species)'),data=df)
scatter1

Now each we can see the distribution of each species. That `aes` that you see in the parameters for `geom_point` is short for `aesthetic`, and within this parameter, you can alter the appearance of the points.

Another aesthetic is labels. We can add titles and customize the axis-labels:

In [None]:
scatter1 = (scatter1 + xlab("Sepal Length") + ylab("Sepal Width") +
        ggtitle("Sepal Length-Width"))
scatter1

These all also work on one line. We also have in this command how to add different size weights, transparency, and lines.

In [None]:
scatter = (scatter +
           geom_point(aes(color = 'petal width (cm)', shape='factor(Species)', size='petal length (cm)'), alpha=0.5) +
           geom_vline(aes(xintercept = statistics.mean(df['sepal length (cm)'])), color="red", linetype="dashed") +
           geom_hline(aes(yintercept = statistics.mean(df['sepal width (cm)'])), color="red", linetype="dashed") +
           scale_color_gradient(low="yellow", high="red") +
           xlab("Sepal Length") +  ylab("Sepal Width") +
           ggtitle("Sepal Length-Width"))
scatter

# Boxplot
We can specify different types of plots as well. Here's a boxplot, using `pandas`:

In [None]:
df.boxplot(column=['sepal length (cm)'],by="Species")

And here's a boxplot in `plotnine`:

In [None]:
box = ggplot(aes(x='Species',y='sepal length (cm)'),data=df)
box = (box + geom_boxplot(aes(fill='Species')) +
  ylab("Sepal Length") + ggtitle("Iris Boxplot") +
  stat_summary(geom="point", shape=5, size=4))
box

We can also save plots. `plotnine` supports saving to `eps`, `pdf`, `pgf`, `png`, `ps`, `raw`, `rgba`, `svg`, and `svgz` filetypes. Use the following command to save:

In [None]:
box.save("boxplot.pdf", width=20,height=20,units="cm")

# Histogram
Finally, let's see an example of a histogram in both `pandas` and `plotnine`. Here's the `pandas` version:

In [None]:
df.hist(column="sepal width (cm)", bins=12)

And here it is in `plotnine`:

In [None]:
hist = ggplot(aes(x="sepal width (cm)"),data=df)
hist =(hist + geom_histogram(aes(fill='Species'),binwidth=0.2, color="black") +
  xlab("Sepal Width") +  ylab("Frequency") + ggtitle("Histogram of Sepal Width"))
hist

# Wrap-up
And that's it for plotting! As said at the beginning, this was a simple demonstration of Python's capacity to do plotting, which is very powerful itself, especially when done programmatically to do multiple calculations and analyses and to be able to save these plots to file. But we want to direct this course towards being able to interface with PowerBI, which actually has the capability to integrate Python and R scripts to load data as well.