### Data Visualization

#### `matplotlib` - from the documentation:
https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html

`matplotlib.pyplot` is a collection of command style functions that make matplotlib work like MATLAB. <br>
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.<br>
"axes" in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Call signatures::
```
    plot([x], y, [fmt], data=None, **kwargs)
    plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
```

Quick plot

The main usage of `plt` is the `plot()` and `show()` functions

In [None]:
plt.plot()
plt.show()

List

In [None]:
plt.plot([8, 24, 27, 42])
plt.ylabel('numbers')
plt.show()

In [None]:
# Plot the two lists, add axes labels
x=[4,5,6,7]
y=[2,5,1,7]



`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

In [None]:
plt.plot([3, 4, 9, 20], 'gs')
plt.axis([-1, 4, 0, 22])
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], 'b^--', linewidth=2, markersize=12)
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], color='blue', marker='^', linestyle='dashed', linewidth=2, markersize=12)
plt.show()

In [None]:
# Plot a list with 10 numbers with a magenta dotted line and circles for points.



In [None]:
import numpy as np

# evenly sampled time 
time = np.arange(0, 7, 0.3)
# gene expression
ge = np.arange(1, 8, 0.3)

# red dashes, blue squares and green triangles
plt.plot(time, ge, 'r--', time, ge**2, 'bs', time, ge**3, 'g^')
plt.show()

linestyle or ls	[ '-' | '--' | '-.' | ':' | 

In [None]:
lines = plt.plot([1, 2, 3])
plt.setp(lines)

In [None]:
names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)

plt.figure(figsize=(9, 3))

plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.hist(values1)
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
import pandas as pd

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris.head()

In [None]:
x1 = df_iris.petal_length
y1 = df_iris.petal_width

x2 = df_iris.sepal_length
y2 = df_iris.sepal_width

plt.plot(x1, y1, 'g^', x2, y2, 'bs')
plt.show()

#### Histogram

In [None]:
help(plt.hist)

In [None]:
n, bins, patches = plt.hist(df_iris.petal_length, bins=20,facecolor='#8303A2', alpha=0.8, rwidth=.8, align='mid')

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('number of plants')

#### Boxplot

In [None]:
help(plt.boxplot)

In [None]:
plt.boxplot(df_iris.petal_length)

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('petal length')

The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power. With great power, comes great responsibility. When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. Yes, you *can* make good figures with `matplotlib`, but you probably won't.

https://python-graph-gallery.com/matplotlib/

Pandas works off of `matplotlib` by default. You can easily start visualizing dataframs and series just by a simple command.

#### Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still a `matplotlib` plot</br></br>
Every plot that is returned from `pandas` is subject to `matplotlib` modification.

In [None]:
df_iris.plot.box()
plt.show()

In [None]:
# Plot the histogram of the petal lengths
# Plot the histograms of all 4 numerical characteristics in a plot




In [None]:
df_iris.groupby("species")['petal_length'].mean().plot(kind='bar')
plt.show()

In [None]:
df_iris.plot(x='petal_length', y='petal_width', kind = "scatter")
plt.savefig('output.png')

In [None]:
plt.savefig('output.png')

https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533

#### Multiple Plots

In [None]:
df_iris.petal_length.plot(kind='density')
df_iris.sepal_length.plot(kind='density')
df_iris.petal_width.plot(kind='density')

`matplotlib` allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)
# Plot all three columns from df in different subplots
# Rows first index (top-left)
plt.subplot(3, 1, 1)
df_iris.petal_length.plot(kind='density')
plt.subplot(3, 1, 2)
df_iris.sepal_length.plot(kind='density')
plt.subplot(3, 1, 3)
df_iris.petal_width.plot(kind='density')
# Some plot configuration
plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
plt.show()

In [None]:
# Temporary styles
with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot all three columns from df in different subplots
    # Rows first index (top-left)
    plt.subplot(3, 1, 1)
    df_iris.petal_length.plot(kind='density')
    plt.subplot(3, 1, 2)
    df_iris.sepal_length.plot(kind='density')
    plt.subplot(3, 1, 3)
    df_iris.petal_width.plot(kind='density')
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

In [None]:
# Plot the histograms of the petal length and width and sepal length and width 
# Display them on the columns of a figure with 2X2 subplots
# color them red, green, blue and yellow, respectivelly  



### `seaborn` - dataset-oriented plotting

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. <br>
It is built on top of matplotlib and closely integrated with pandas data structures.

https://seaborn.pydata.org/introduction.html<br>
https://python-graph-gallery.com/seaborn/

In [None]:
import seaborn as sns

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

However, you can always use `matplotlib`'s `plt.style`

In [None]:
#dir(sns)

In [None]:
sns.scatterplot(x='petal_length',y='petal_width',data=df_iris)
plt.show()

In [None]:
sns.scatterplot(x='petal_length',y='petal_width', hue = "species",data=df_iris)
plt.show()

#### Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
columns = ['petal_length', 'petal_width', 'sepal_length']

fig, axes = plt.subplots(figsize=(10, 10))
sns.violinplot(data=df_iris.loc[:,columns], ax=axes)
axes.set_ylabel('number')
axes.set_xlabel('columns', )
plt.show()

#### Distplot

In [None]:
sns.set(style='darkgrid', palette='muted')

# 1 row, 3 columns
f, axes = plt.subplots(4,1, figsize=(10,10), sharex=True)
sns.despine(left=True)

# Regular displot
sns.distplot(df_iris.petal_length, ax=axes[0])

# Change the color
sns.distplot(df_iris.petal_width, kde=False, ax=axes[1], color='orange')

# Show the Kernel density estimate
sns.distplot(df_iris.sepal_width, hist=False, kde_kws={'shade':True}, ax=axes[2], color='purple')

# Show the rug
sns.distplot(df_iris.sepal_length, hist=False, rug=True, ax=axes[3], color='green')

#### FacetGrid

In [None]:
sns.set()
columns = ['species', 'petal_length', 'petal_width']
facet_column = 'species'
g = sns.FacetGrid(df_iris.loc[:,columns], col=facet_column, hue=facet_column, col_wrap=5)
g.map(plt.scatter, 'petal_length', 'petal_width')

In [None]:
sns.relplot(x="petal_length", y="petal_width", col="species",
            hue="species", style="species", size="species",
            data=df_iris)
plt.show()

https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html

### `plotnine` - R ggplot2 in python

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots are easy to think about and then create, while the simple plots remain simple.



In [None]:
!pip install plotnine

https://plotnine.readthedocs.io/en/stable/

In [None]:
from plotnine import *

In [None]:
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point()

In [None]:
# add transparency - to avoid over plotting
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(alpha=0.7)

In [None]:
# change point size 
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(size = 0.7, alpha=0.7)

In [None]:
# more parameters 
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point() + scale_x_log10() + xlab("Petal Length")

In [None]:
n = "3"
ft = "length and width"
title = 'species : %s, petal : %s' % (n,ft)

ggplot(data=df_iris) +aes(x='petal_length',y='petal_width',color="species") + geom_point(size=0.7,alpha=0.7) + facet_wrap('~species',nrow=3) + theme(figure_size=(9,5)) + ggtitle(title)


In [None]:
p = ggplot(data=df_iris) + aes(x='petal_length') + geom_histogram(binwidth=1,color='black',fill='grey')
p

In [None]:
ggsave(plot=p, filename='hist_plot_with_plotnine.png')

http://cmdlinetips.com/2018/05/plotnine-a-python-library-to-use-ggplot2-in-python/ <br>
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

<img src = "https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf" width = "1000"/>

Use ggplot to plot the sepal_length in boxplots separated by species, add new axes labels and make the y axis values log10.

* Write a function that takes as a parameter a line of the dataframe and if the species is 
** setosa it returns the petal_length
** versicolor it returns the petal_width
** virginica it returns the sepal_length

Apply this function to every line in the dataset in a for loop and save the result in an array
Use ggplot to make a histogram of the values