### More plotting with `matplotlib` and `seaborn`

Today we continue to work with `matplotlib`, focusing on customization and using subplots.  Also, the `seaborn` library will be introduced as a second visualization library with additional functionality for plotting data.

In [None]:
#!pip install -U seaborn

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### `3D Plotting`

There are additional projections available including polar and three dimensional projections.  These can be accessed through the `projection` argument in the `axes` functions.

- [3d plotting](https://matplotlib.org/stable/gallery/mplot3d/index.html)

In [None]:
def f(x, y):
    return x**2 - y**2
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)

In [None]:
ax = plt.axes(projection = '3d')
ax.plot_wireframe(X, Y, f(X, Y))
ax.set_title('Using 3d projection');

In [None]:
r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
ax = plt.axes(projection = 'polar')
ax.plot(theta, r)
ax.set_title('Basic polar coordinate plot');

### `Gridspec`

If you want to change the layout and organization of the subplot the `Gridspec` object allows you to specify additional information about width and height ratios of the subplots.  Examples below are from the documentation [here](https://matplotlib.org/stable/gallery/userdemo/demo_gridspec03.html#gridspec-demo).

In [None]:
from matplotlib.gridspec import GridSpec

In [None]:
#helper for annotating
def annotate_axes(fig):
    for i, ax in enumerate(fig.axes):
        ax.text(0.5, 0.5, "ax%d" % (i+1), va="center", ha="center")
        ax.tick_params(labelbottom=False, labelleft=False)

In [None]:
fig = plt.figure()
fig.suptitle("Controlling subplot sizes with width_ratios and height_ratios")

gs = GridSpec(2, 2, width_ratios=[1, 2], height_ratios=[4, 1])
ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1])
ax3 = fig.add_subplot(gs[2])
ax4 = fig.add_subplot(gs[3])

annotate_axes(fig)

In [None]:
fig = plt.figure()
fig.suptitle("Controlling spacing around and between subplots")

gs1 = GridSpec(3, 3, left=0.05, right=0.48, wspace=0.05)
ax1 = fig.add_subplot(gs1[:-1, :])
ax2 = fig.add_subplot(gs1[-1, :-1])
ax3 = fig.add_subplot(gs1[-1, -1])

gs2 = GridSpec(3, 3, left=0.55, right=0.98, hspace=0.05)
ax4 = fig.add_subplot(gs2[:, :-1])
ax5 = fig.add_subplot(gs2[:-1, -1])
ax6 = fig.add_subplot(gs2[-1, -1])

annotate_axes(fig)

plt.show()

#### Exercise

Use `GridSpec` to write a function that takes in a column from a `DataFrame` (a `Series` object) and returns a 2 row 1 column plot where the bottom plot is a histogram and top is boxplot; similar to image below.




![](images/example_histbox.png)

#### Introduction to `seaborn`

The `seaborn` library is built on top of `matplotlib` and offers high level visualization tools for plotting data.  Typically a call to the `seaborn` library looks like:

```
sns.plottype(data = DataFrame, x = x, y = y, additional arguments...)
```

In [None]:
### load a sample dataset on tips
tips = sns.load_dataset('tips')
tips.head(2)

In [None]:
### boxplot of tips


In [None]:
### boxplot of tips by day


#### `hue`

The `hue` argument works like a grouping helper with `seaborn`.  Plots that have this argument will break the data into groups from the passed column and add an appropriate legend.

In [None]:
### boxplot of tips by day by smoker


#### `displot`

For visualizing one dimensional distributions of data.

In [None]:
### histogram of tips


In [None]:
### kde plot


In [None]:
### empirical cumulative distribution plot of tips by smoker


In [None]:
### using the col argument


In [None]:
#draw a histogram and a boxplot using seaborn on two axes
fig, ax = plt.subplots(1, 2, figsize = (15, 5))


#### `relplot`

For visualizing relationships.

In [None]:
### relplot of bill vs. tip
sns.relplot(data = tips, x = 'total_bill', y = 'tip')

In [None]:
### regression plot
sns.regplot(data = tips, x ='total_bill', y = 'tip', lowess = True )

In [None]:
### swarm
sns.swarmplot(data = tips, x = 'smoker', y = 'tip')

In [None]:
### violin plot
sns.violinplot(data = tips, x = 'smoker', y = 'tip')

In [None]:
### countplot
sns.countplot(data = tips, x = 'smoker');

1. Create a histogram of flipper length by species.  

In [None]:
penguins = sns.load_dataset('penguins')
penguins.head()

2. Create a scatterplot of bill length vs. flipper length colored by species.

3. Create a violin plot of each species mass split by sex.

#### Additional Plots

- `pairplot`
- `heatmap`

In [None]:
penguins = sns.load_dataset('penguins').dropna()

In [None]:
### pairplot of penguins colored by species
sns.pairplot(data = penguins, hue = 'species')

In [None]:
### housing data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame = True).frame
housing.head()

#### Plotting Correlations

Correlation captures the strength of a linear relationship between features.  Often, this is easier to look at than a scatterplot of the data to establish relationships, however recall that this is only a detector for *linear* relationships!

In [None]:
### correlation in data


In [None]:
### heatmap of correlations



#### Problems

Use the `diabetes` data below loaded from OpenML ([docs](https://www.openml.org/search?type=data&sort=runs&status=active&id=37)).  

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
diabetes = fetch_openml(data_id = 37).frame

In [None]:
diabetes.head()

1. Distribution of ages separated by class.

2. Heatmap of features.  Any strong correlations?

In [None]:
plt.figure(figsize = (20, 5))
sns.heatmap(diabetes.corr(), annot = True, cmap = 'BuPu')

3. **CHALLENGE**: 2 rows and 4 columns with histograms separated by class column.  Which feature has the most distinct difference between classes?

In [None]:
fig, ax = plt.subplots(2, 4, figsize = (20, 10))
for row in range(2):
    for col in range(4):
        pass

### 

#### EDA Case Study

**Introduction**

This case study aims to give an idea of applying EDA in a real business scenario. In this case study, we will develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

**Business Understanding**

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose we work for a consumer finance company which specializes in lending various types of loans to urban customers. We will have to use EDA to analyze the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,

All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

Approved: The Company has approved loan Application

Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

Refused: The company had rejected the loan (because the client does not meet their requirements etc.).

Unused offer: Loan has been cancelled by the client but on different stages of the process.

In this case study, we will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

**Business Objectives**

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

**Data Understanding**

This dataset has 3 files as explained below:

`application_data.csv` contains all the information of the client at the time of application.
The data is about whether a client has payment difficulties.

`previous_application.csv` contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

`columns_description.xlsx` is data dictionary which describes the meaning of the variables.


These files are located in the class repo in the `data` folder in a folder called `eda_case_data`.  I have also given a link to the data files in a drive folder [here](https://drive.google.com/drive/folders/1YpPjU4Y12MPrMdWaNDa6rS2MNirh2FG3?usp=sharing).

#### Deliverable

Your group should produce a brief presentation of your findings (5 - 10 slides).  In addition to your findings, your slides should include actionable insights as a result of your exploration connected to your visualizations.  You will submit this presentation prior to next class together with a notebook containing all your visualizations.