# Clustering

### Dataset: Marketing in Banking

Source: https://www.kaggle.com/janiobachmann/bank-marketing-dataset

(Original source: https://archive.ics.uci.edu/ml/datasets/bank+marketing)

#### Input variables:

##### bank client data:
 1. age (numeric)
 2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
 3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
 4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
 5. default: has credit in default? (categorical: 'no','yes','unknown')
 6. housing: has housing loan? (categorical: 'no','yes','unknown')
 7. loan: has personal loan? (categorical: 'no','yes','unknown')

##### related with the last contact of the current campaign:
 8. contact: contact communication type (categorical: 'cellular','telephone') 
 9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
 10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
 11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

##### other attributes:
 12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
 13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
 14. previous: number of contacts performed before this campaign and for this client (numeric)
 15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

##### social and economic context attributes
 16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
 17. cons.price.idx: consumer price index - monthly indicator (numeric) 
 18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
 19. euribor3m: euribor 3 month rate - daily indicator (numeric)
 20. nr.employed: number of employees - quarterly indicator (numeric)
 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

%matplotlib inline

## Load Data

In [None]:
df = pd.read_csv('bank.csv')
df.head()

In [None]:
df.info()

In [None]:
# Ignore the deposit column (which is actually the label for
# a different task - classification)

# Here, we want to discover clusters in the features
features = df.columns[df.columns != 'deposit']

## Data Cleaning

### Missing Values

In [None]:
df.isna().sum()

No NaN values to clean!

### Categorical Features

There are quite a few categorical features to encode.

In [None]:
for f in features:
    if (df[f].dtypes == 'object'):
        print(f, df[f].unique())

In [None]:
from sklearn.preprocessing import LabelEncoder

# store encoders in a dictionary so that we
# can refer to them easily for using / saving / loading
encoders = dict()

for f in features:
    if (df[f].dtypes == 'object'):
        encoder = LabelEncoder()
        encoder.fit(df[f])
        df[f] = encoder.transform(df[f])
        # or if you don't want to replace columns:
        # df[f + '_enc'] = encoder.transform(df[f])
        encoders[f] = encoder

In [None]:
# inspect the encoded classes
for k, v in encoders.items():
    print(k, v.classes_)

## Data Visualisation

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])

pca_2d = PCA(n_components=2)
Z_2d = pca_2d.fit_transform(X_scaled)

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(Z_2d[:, 0], Z_2d[:, 1])
ax.set(xlabel='Z[0]', ylabel='Z[1]', title='2-d PCA plot')

## KMeans Clustering

- Run K-means with 2 or 3 clusters
- Examine Elbow plot to find any better K
- Explore relationships between clusters and features

### KMeans (clusters = 2)

### Elbow Plot

### Exploring clusters