### ==============================================================================
### SCRIPT NAME: 02_CS_Plotting V3
### PURPOSE: Data visualization and Variable selection
### PACKAGES NEEDED: os, numpy, pandas, seaborn,  matplotlib
### ==============================================================================

In this notebook you will be introduced to data vizualization in Python.

Whenever in doubt you can always take at some of these resources:

    help([function name]): Provides a detailed description of the function/
    online pandas documentation: https://pandas.pydata.org/pandas-docs/stable/?v=20200107131408
    online matplotlib documentation: https://matplotlib.org/contents.html?v=20200131112331
    online seaborn documentation: https://seaborn.pydata.org/

Note that whenever you see multiple consecutive question marks (like '???') you will have to enter something. Evaluating a cell can be done by clicking on the 'run' button at the top or by pressing shift + enter on a selected cell.

Good luck!

## Quick Links to the Exercises

[Exercise 1: Basic Plotting with Pandas](#exercise1)  
    
[Exercise 2: Main Seaborn Functions](#exercise2)  
- [displot() for distributions](#displot)  
- [relplot() for relationships](#relplot)
- [catplot() for categorical plots](#catplot) 

[Exercise 3: Pair Plots and Correlation Heatmaps](#exercise3)  


# Import required packages #

#### Import packages
    import

In [None]:
# These libraries are commonly used for Data Analytics.  They extend Python with new functions.
import numpy as np                    # support for large, multi-dimensional arrays and matrices
import pandas as pd                   # for manipulating flat files with for analytics (uses numpy)
import matplotlib.pyplot as plt       # Basic chart plotting
import seaborn as sns                 # More advanced and easier-to-use chart plotting (uses matplotlib)

# VISUALIZATION

# <a id='exercise1'>Exercise 1 - Basic Plotting with Pandas</a>

Histograms show the distribution of values for a particular variable of a dataset.  It allows us to see if there are outliers or any surprises in the distribution curve itself.

#### Read data
    pd.read_csv(): Loads the file into Python enviroment

In [None]:
# Read the csv file
data = pd.read_csv('data/cleaned_dataset.csv')

Take a look at the data

In [None]:
# Get the names of the columns
data.columns

In [None]:
# Use describe to see the descriptives
data.describe()

# Basic plotting using pandas functions

For basic visualization, pandas has some functions (building on matplotlib) that are very easy and quick to use, for example when cleaning and structuring data. For any DataFrame or Series object, we can add .plot() at the end

- **data.plot(kind='____')**

The **kind** argument decides which kind of plot from:
- line
- box
- bar
- scatter
- hist

Let's explore the variable **glass_temp_zone1**

In [None]:
data['glass_temp_zone1'].plot(kind='hist')

This function we can also give for example the figsize = width and height that we want:

In [None]:
data['glass_temp_zone1'].plot(kind='line', figsize=(14,4))

The default x-axis (if needed) is the index. Since we read in the data again from a csv, **we have to convert the timestamps to a DateTimeIndex again:**

In [None]:
data['time'] = pd.to_datetime(????)
data.index = data['time']

In [None]:
data['glass_temp_zone1'].plot(kind='line', figsize=(14,4))

If we pick many columns (or the entire table), each column will be used as a separate group

In [None]:
data[['glass_temp_zone1', 'glass_temp_zone2']].plot(kind=????, figsize=(14,4))

To visualize only a part of the data, we can cut it with timestamps if the index is a DateTimeIndex

    short_data = data.loc['start_time':'end_time']

In [None]:
data_B = data.loc['2021-02-12 14:00':'2021-02-12 16:00'].copy()

data_B[['glass_temp_zone1', 'glass_temp_zone2']].plot(figsize=(14,4))

# Plotting using seaborn

#### Seaborn builds on matplotlib and gives us a few very powerful visualization functions, all with a very similar syntax:

### sns.name-of-function(data=DataFrame, kind='plot-type', x='variable1', y='variable2', hue='grouping-var1',...)

We can use:
- **data**: the DataFrame (table) object that we usually work with
- **x**: variable on the x-axis
- **y**: variable on the y-axis
- **hue**: variable  to group into colors after
- **style**: variable to group into styles after (marker styles, line styles)
- **size**: variable to group into sizes after (marker size, line width)

If it's a high-level ("figure-level") function (relplot, displot, catplot), we can also use:
- **kind**: kind of plot ('scatter', 'hist', 'box', etc.)
- **row**: varaible to group into many subplots after, **row-wise**
- **col**: varaible to group into many subplots after, **column-wise**
- **height**: the height of each subplot
- **aspect**: ratio of width to height of each subplot

![](images/seaborn_structure.png)

### You can either use the low-level ("axis-level") functions like

In [None]:
sns.histplot(data=data, x='glass_temp_zone1')

### ...or high-level ("figure-level") functions like

In [None]:
sns.displot(data=data, kind='hist', x='glass_temp_zone1')

## In this notebook, we will mostly use the high-level (figure-level) functions because of the added features (like making many grouped subplots)

#### However, the syntax for both are similar, and if you want to customize a lot, the low-level axis functions might be more practical

# Quick data summary

#### The data is real is from Sekurit production of car glass windshields. Each line in the data table represents one produced unit.
#### Let's say that we produced a few batches according to a few heating recipes (=settings). What batch/recipe each glass is is available in two corresponding columns.

In [None]:
data[['batch', 'recipe', 'glass_ID']]

A first question we can ask ourselves is:

### Is there any difference between different recipes or batches?

#  <a id='exercise2'>Exercise 2: Main Seaborn functions</a>

# <a id='displot'>sns.displot(): Distribution plots like histograms</a>

    displot(data=data_object, kind='hist', x='my_column')
    kind can be:
        'hist'
        'kde'
        'ecdf'

To visualize groups of data in seaborn, all we have to do is to add the argument 

**hue**='name-of-group-variable'

Group the data after the 'recipe' column:

In [None]:
sns.displot(data=data, kind='hist', x='glass_temp_zone1', hue=????)

Sometimes it's hard to compare groups if they are overlapping too much. We can then split the data up into subfigures instead of colors with the argument 
- **row** or **col** ='name-of-group-variable'

What does the argument **col** do?
What happens if you set both **col** and **hue** to the same variable?

In [None]:
sns.displot(data=data, kind='hist', x='glass_temp_zone1', ???='recipe')




For histograms (kind='hist'), we can customize how many bars are used with the argument **bins**=50, and we can choose to add "kernel density estimate" for estimating the probability distribution with **kde**=True

In [None]:
sns.displot(data=data, kind='hist', x='glass_temp_zone1', col='recipe', bins=50, kde=True)

# <a id='relplot'>sns.relplot(): Relationships between two continous variables</a>

    relplot(data=data_object, kind='line', x='my_x_column', y='my_y_column')
    kind can be:
        'line'
        'scatter'

## Line plots
When using relplot(), we can specify which variables we want on the x and y axes. The most common line plot is probably the time series. If we have a DateTimeIndex, we enter x=data.index (or the name of another DateTime Format column)

Does the **hue** argument work here too?

In [None]:
sns.relplot(data=data, kind=????, x=data.index, y='glass_temp_zone1', hue='recipe')

The plot looks to small for this amount of data. displot()/relplot()/catplot() can all be adjusted with the arguments:
- **height**: the height of each subplot
- **aspect**: ratio of width to height of each subplot

To make a plot that has a good height already wider, we only have to change the **aspect**:

In [None]:
sns.relplot(data=data, kind='line', x=data.index, y='glass_temp_zone1', hue='recipe', aspect=???)

### Plotting the values in many columns with each other

Sometimes you want to compare different columns to eachother (instead of the same column grouped in different ways)

The seaborn functions do this when you **only give them a data-argument**

### What goes wrong in the plot below?

In [None]:
sns.relplot(data=data, kind='line')

Instead, use the list of glass temperature variables to plot only them together

In [None]:
glass_zones = ['glass_temp_zone1','glass_temp_zone2','glass_temp_zone3','glass_temp_zone4']

sns.relplot(data=data[?????], kind='line', aspect=2,)

# Scatterplots

If there is no inherent order between the data points on one axis (like chronological), a scatter plot might be more suitable.

Let's say we suspect that the quality output 'geometry_final' is dependent on the glass temperature that we have investigated:

In [None]:
sns.relplot(data=data, kind=????, x='glass_temp_zone1', y='geometry_final')

It looks like there is some correlation, but noisy. Could it help to split up the data?

#### Make one figure for each 'recipe', and one color for each 'batch'!

In [None]:
sns.relplot(data=data, kind='scatter', x='geometry_final', y='glass_temp_zone1', col=????, hue=????, height=4)

Relplots can also use the **size** argument to scale the marker size or line width with yet another variable. This works for categorical variables as well as continous.

**hue** can also be used for continous data, scaling a colormap between the minimum and maximum values.

In the following example, we're visualizing 4 different variables in the same scatter plot!

In [None]:
sns.relplot(data=data, 
            kind='scatter', 
            x='glass_temp_zone1', 
            y='geometry_final', 
            hue='glass_temp_zone4',
            size='cycle_time',
            height=10)

# <a id='catplot'>Categorical plots like boxplots, barplots</a>

So far, we could add hue/row/col to group data points, but we can also use categorical type plots to compare groups:

    catplot(data=data_object, kind='box', x='my_x_column', y='my_y_column')
    kind can be:
        'box'
        'bar'
        'strip'
        'swarm'
        'violin'

Try out the different kinds and see what's the difference!

In [None]:
sns.catplot(data=data, kind=????, x='batch', y='geometry_final', height=4, aspect=2)

As before, if we want to use different coulmns as categories, a data-argument and a list of the column names is enough for the seaborn function

In [None]:
sns.catplot(data=data[glass_zones], kind='box', aspect=1.5)

# <a id='exercise3'>Exercise 3: Pair Plots and correlation heatmaps</a>

We can use the scatter plot to see the relationship between two variables, but it is a bit time consuling to run all of the pairings.  Fortunately, the Pair Plot will do it for us.  

We have many variables available, but we are interested in seeing correlation plots only between some of those:

In [None]:
# Select only a couple of variables
variables = [
    'glass_temp_zone1',
    'glass_temp_zone2',
    'geometry_final',
    'regulation_temp',
    'energy_equivalent',
    'recipe',
    ]

data_for_pair = data[????]

In [None]:
# Remove missing values
data_for_pair = data_for_pair.dropna()

    sns.pairplot(): Creates a scatter plot between each pair of variables in the provided data

In [None]:
sns.pairplot(data_for_pair)

##### Create a grouped pairplot

Does **hue** work here as well?

The subplots are mirrored on the diagonal, so one half is redundant.
Use **corner**=True to just plot the lower half.

In [None]:
sns.pairplot(data_for_pair, ???='recipe', corner=True)

#### What do you see in the charts?  Are there any correlations that stand out?

Learn more about the Pair Plot here : https://seaborn.pydata.org/generated/seaborn.pairplot.html

## Correlation Heatmap

The Pair plot compared variables using scatter plots.  We could then visual inspect each plot and see if noticed any correlation, either positive or negative.  

A Correlation heatmap does something similar, but it uses colors to show the strength and direction of the correlation.  

#### Calculate correlations

    corr(): Creates correlation for each pair of variables

In [None]:
# Create a new dataset that contains the correlation strengths between all pairs of data
# We use the same variables as we already had for the pair plot
data_correlations = data_for_pair.corr()
data_correlations

#### Correlation heatmap
Calculate correlation for the variables in the data and plot a heatmap.

    sns.heatmap()

In [None]:
# Dispaly a heatmap plot using the dataset of pairwise correlations
sns.heatmap(????)

How does this compare with the Pair Plot we did earlier?  Let's do it again just to compare.

In [None]:
sns.pairplot(data_for_pair, height=1.5)

Learn more about the Heatmap plot here : https://seaborn.pydata.org/generated/seaborn.heatmap.html

# Improve formatting of correlation plot

We can do a lot to improve visual aspect of our plots. Let's look at the example.

#### Only show lower part of the plot
    mask
    np.zeros_like(): Return an array of zeros with the same shape and type as a given array
    np.triu_indices_from(): Return the indices for the upper-triangle of arr
    sns.heatmap()

In [None]:
mask = np.zeros_like(data_correlations)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(data_correlations, mask=mask)

#### Add labels 
    annot=True
    plt.subplots(figsize=(x,y)): increase figure size to x-by-y inches

In [None]:
plt.subplots(figsize=(5,5))
sns.heatmap(data_correlations, mask=mask, annot=True)

#### Change color scheme

Seaborn (and matplotlib) have a lot of good predefined colormaps that we can use:
[Here's the list of available colormaps ordered after categories](https://matplotlib.org/stable/tutorials/colors/colormaps.html)

#### Which of these do you think is best in this case?

![](./images/colormap_examples.png)

We can also fix the lower and upper limits of the color map with the arguments vmin/vmax.
Since we know that correlation coefficients go from -1 to 1, this makes sense to fix.

In [None]:
plt.subplots(figsize=(5,5))
sns.heatmap(data_correlations, mask=mask, annot=True, cmap=?????, vmin=-1, vmax=1);

In correlation heatmaps, we can include more variables and still have a pretty good overview of everything

In [None]:
data_correlations_all = data.corr()

mask = np.zeros_like(data_correlations_all)
mask[np.triu_indices_from(mask)] = True

plt.subplots(figsize=(15,15))
sns.heatmap(data_correlations_all, mask=mask, annot=True, cmap="coolwarm", vmin=-1, vmax=1)

## We can see that we have data that correlates with our output variable 'geometry_final'!

## If we want to know **how** they relate to eachother, we have to do some modelling!

# Bonus: customizing plots with matplotlib functions
A lot of work is done by seaborn to make the plots look nice directly, but what if we want to change things?

The standard we get from seaborn in this case is this:

In [None]:
sns.relplot(data=data, kind='line', x='time', y='glass_temp_zone1', aspect=2, hue='recipe')

We can adjust titles, labels, fontsizes, rotation, background, etc. to make it fit our needs better:

In [None]:
sns.relplot(data=data, kind='line', x='time', y='glass_temp_zone1', aspect=2, hue='recipe')

# COMMENTS
plt.title('Temperature in glass zone 1 for latest production', fontsize=20)

plt.xlabel('Timestamp at chamber 18 [MM-DD HH]', fontsize=15)
plt.xticks(rotation=45)

plt.ylabel(r'Temperature glass zone 1 [$^\circ$C]', fontsize=15) # With an r before a string, we can add LaTeX formatting

plt.ylim(651, 657) # "Zooming" in/out to defined limits in x and y

plt.grid(True)

# Bonus: defining your own color palettes

You can define your own palette of colours setting up 'palette' parameter:

In [None]:
my_palette = {'Recipe-A': 'grey', 'Recipe-B': 'red', 'Recipe-C':'grey'}
sns.catplot(data=data, kind='box', x='recipe', y='glass_temp_zone1', palette=my_palette);


'my_palette' is a 'dictionary' which is yet another type of object used in Python. We won't go into the details of dictionaries here, but if you're interested, you can read more about them here: https://www.w3schools.com/python/python_dictionaries.asp

#### We can also make a palette from a list of colors, without specifying which group they belong to. 

#### Why not make one for Saint-Gobain's color scheme?

In [None]:
SG_colors = ['#4DB1B3', '#0195D6', '#0F5299', '#C5284C', '#E83430', '#E66407',]

# n_colors says how many colors can be used, but more than 6 in this case and the colors will loop around
SG_palette = sns.color_palette(SG_colors, n_colors=20) 

# With set_palette(), we can choose this new palette as the default for all plots
sns.set_palette(SG_palette)
#sns.set_palette('deep') # Uncomment this line if you want to use the default palette again

sns.catplot(data=data, kind='bar', x='batch', y='geometry_final', height=4, aspect=2)

sns.relplot(data=data, kind='scatter', x='geometry_final', y='glass_temp_zone1', col='recipe', hue='batch', height=4,)

## Change seaborn style/theme

Seaborn has some different themes to automatically format all plots in another way. 
Uncomment one of these lines when you're done, execute, and execute the whole notebook again!

- set_context() sets things like the size of the labels, lines, and other elements of the plot, but not the overall style.
    - 'paper', 'notebook', 'talk', 'poster'
- set_style()  affects things like the color of the axes, whether a grid is enabled by default, and other aesthetics.
    - 'darkgrid', 'whitegrid', 'dark', 'white', 'ticks'

What changed? Does it look nicer?

The changes will stay until the Python kernel (notebook) is restarted, or it has to be changed back with a similar command.

In [None]:
#sns.set_theme(style="darkgrid")
#sns.set_context("talk")