<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Required-Libraries" data-toc-modified-id="Import-Required-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Required Libraries</a></span></li><li><span><a href="#Read-data-and-output-basic-attribute-information" data-toc-modified-id="Read-data-and-output-basic-attribute-information-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read data and output basic attribute information</a></span></li><li><span><a href="#Some-more-data-analyis" data-toc-modified-id="Some-more-data-analyis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Some more data analyis</a></span></li><li><span><a href="#Basic-inspection-of-the-numerical-attribute-'age'" data-toc-modified-id="Basic-inspection-of-the-numerical-attribute-'age'-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic inspection of the numerical attribute 'age'</a></span></li><li><span><a href="#Basic-inspection-of-the-categorical-attribute-'job'" data-toc-modified-id="Basic-inspection-of-the-categorical-attribute-'job'-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Basic inspection of the categorical attribute 'job'</a></span></li><li><span><a href="#More-analysis-of-the-'age'-attribute" data-toc-modified-id="More-analysis-of-the-'age'-attribute-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>More analysis of the 'age' attribute</a></span></li><li><span><a href="#Outlier-analysis--of-the-'age'-attribute" data-toc-modified-id="Outlier-analysis--of-the-'age'-attribute-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Outlier analysis  of the 'age' attribute</a></span></li><li><span><a href="#A-single-scatter-plot" data-toc-modified-id="A-single-scatter-plot-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>A single scatter plot</a></span></li><li><span><a href="#Illustrate-Discretization-of-an-attribute-with-scikit-learn's-KBinsDiscretizer" data-toc-modified-id="Illustrate-Discretization-of-an-attribute-with-scikit-learn's-KBinsDiscretizer-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Illustrate Discretization of an attribute with scikit-learn's KBinsDiscretizer</a></span></li></ul></div>

### Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Read data and output basic attribute information

In [None]:
df = pd.read_csv('../data/bank-full.csv')
df.info()

### Some more data analyis

In [None]:
df_count = len(df)
print('Data set contains {0} entries'.format(df_count,2))

A few example data items

In [None]:
df.head()

### Basic inspection of the numerical attribute 'age'

In [None]:
age_values = df['age']

# range
min_value = age_values.min()
max_value = age_values.max()
print('Min age: ', min_value)
print('Max age: ', max_value)
print('Null Values: ', age_values.isnull().any())

Draw a histogram of 'age' values. This can be accomplished very easily with the 'seaborn' library.

In [None]:
fig, ax = plt.subplots(figsize = (20, 8))
sns.countplot(age_values)
ax.set_title('Age Distribution', fontsize=15)
sns.despine()

### Basic inspection of the categorical attribute 'job' 
First its range

In [None]:
print(df['job'].unique())

A histogram

In [None]:
fig, ax = plt.subplots(figsize = (13, 5))
sns.countplot(df['job'], ax = ax)
ax.set_title('Job Distribution', fontsize=15)
sns.despine(ax = ax)

### More analysis of the 'age' attribute
First determine the 25 / 50 / 75% quantiles

In [None]:
Q1 = np.quantile(age_values, .25)
Q2 = np.quantile(age_values, .50)
Q3 = np.quantile(age_values, .75)
'Quantiles : 25 % : {0}, 50 % : {1}, 75 % : {2}'.format(Q1,Q2,Q3)

Now draw a boxplot and the distribution of attribute values.  
The distribution plot also displays a Gaussian kernel estimate.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))

sns.boxplot(age_values , orient = 'v', ax = ax1)
ax1.set_ylabel('Age', fontsize=15)
ax1.tick_params(labelsize=15)

sns.distplot(age_values, kde = True, ax = ax2)
sns.despine(ax = ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.set_ylabel('Percentage', fontsize=15)
ax2.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)

### Outlier analysis  of the 'age' attribute
First determine the interquartile range (IQR) and from this the 'normal' range by : Q3/Q1 quartile +/- 1.5 IQR  
This range corresponds to whiskers in the boxplots above

In [None]:
IQR = Q3 - Q1
upper_limit = min(Q3 + 1.5*IQR, max_value)
lower_limit = max(Q1 - 1.5*IQR, min_value)
print('Normal Range : [{0},{1}]'.format(lower_limit, upper_limit))

Determine outliers

In [None]:
outlier_count = age_values[(age_values > upper_limit) | (age_values < lower_limit)].count()
print('Number of outliers: {0}'.format(outlier_count))
print('Percentage Outliers : {0} %'.format(round(outlier_count*100/df_count,2)))

### A single scatter plot
Again this can be created easily with seaborn

In [None]:
sns.pairplot(df, x_vars=['age'], y_vars=['job'], hue="y", height=10)     

### Illustrate Discretization of an attribute with scikit-learn's KBinsDiscretizer
We discretize into 4 bins with different strategies

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

n_bins = 4
strategies = ['uniform', 'quantile', 'kmeans']

binning_outputs = {}
X = age_values.to_numpy() # age_values is a pandas series, the discretizer needs a numpy array
for strategy in strategies:
    # the discretization requires only two lines of code :
    # 1. create the discretizer object for that strategy
    enc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy=strategy)
    # 2. fit the discretizer and transform input values
    X_binned = enc.fit_transform(X.reshape(-1,1)) 
    # X_binned is 2-dimensional --> convert to flat array
    binning_outputs[strategy] = X_binned.ravel()

Plot the resulting bins

In [None]:
import matplotlib.cm as cm

cmap = cm.get_cmap("Accent") #ListedColormap(['r', 'g', 'b', 'c'])

def plot_bins(ax, X, X_binned):
    ax.set_xticks([]) # remove axis
    ax.set_yticks([])
    Y = np.zeros_like(X)
    ax.scatter(X, Y, c=X_binned, cmap = cmap,  marker = 'x')
    ax.set_ylim(-1,1)
    _, counts = np.unique(X_binned, return_counts=True)
    ax.text(60, -0.5, counts, horizontalalignment='center')
    
fig, axarr = plt.subplots(len(strategies), figsize=(10,5))

axarr[0].set_title("Binning Strategies for 'age' attribute", size='large')
    
for ax, strategy in zip(axarr, strategies):
    ax.set_ylabel(strategy, size='large')
    plot_bins(ax, X, binning_outputs[strategy])
    
fig.tight_layout()
fig.subplots_adjust(top=0.88)

plt.show()