# Model Selection
In this notebook, we will focus on model selection. We first import all libraries will be using, define some functions and import the dataset we will be using through out the notebook.

In [207]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time


%matplotlib inline
pd.plotting.register_matplotlib_converters()

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.neighbors import LocalOutlierFactor
from sklearn.manifold import TSNE
from sklearn import metrics

sns.set(style = "ticks")

Next, we apply the same transformations we have done in Load & Cleanse notebook to get the dataframe ready.

In [170]:
file_path = "../data/data.csv"
data = pd.read_csv(file_path, index_col = "consumer_id")

cols_with_na = [col for col in data.columns if data[col].isnull().any()]

data.drop(cols_with_na, axis = 1, inplace = True)

## Clustering and Labeling

The sample data is not labeled. So, we cannot use any supervised algorithms. Therefore, we will first label the data using a clustering algorithm and then use a tree-based classification algorithm to train a classification model. This way, we will be able to classify unseen data based on the classification model we have built.

In this section we will try a couple of different approaches. Namely, we will use _k_-means, DBSCAN and LOF to get clusters. Although LOF is not really a clustering method, I think it will be useful in our case for identifying outliers and label them. We will use the silhouette score to measure the efficiency of the algorithm in identifying the clusters.

We will use the following function that will help us to visualise the clusters with t-SNE.

In [240]:
def tsne_plot(d, m):
    """
    Function to visualise clusters with t-SNE for a given dataset d and model m
    """
    d_copy = d.copy()
    
    d_copy["tsne-d1"] = m[:, 0]
    d_copy["tsne-d2"] = m[:, 1]

    plt.figure(figsize=(10,10))
    sns.scatterplot(
        x="tsne-d1", y="tsne-d2",
        palette=sns.color_palette("hls", 2),
        hue = "label",
        data=d_copy,
        legend="full",
        alpha=0.3
    )

### _k_-means
We first will standardise the dataset using `StandardScaler` module. Then, we will build a _k_-means clustering with `n_clusters=2`. We also keep original index and column names for demonstration purposes.

In [251]:
n_clusters = 2
data_norm = pd.DataFrame(StandardScaler().fit_transform(data), index = data.index)
k_means = KMeans(n_clusters=n_clusters, random_state=123).fit_predict(data_norm)
data_norm.columns = data.columns
data_norm["label"] = k_means + 1 
data_norm["label"] = data_norm["label"].apply(lambda i: str(i))
data_norm.head()

Unnamed: 0_level_0,has_gender,has_first_name,has_last_name,has_email,has_dob,account_age,account_last_updated,account_status,app_downloads,unique_offer_clicked,total_offer_clicks,unique_offer_rides,total_offer_rides,avg_claims,min_claims,max_claims,total_offers_claimed,label
consumer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1284b75c-ecae-4015-8e3d-359c0347ede8,-1.100641,0.044766,0.064163,0.028296,-0.827427,-0.444455,-0.286783,0.0,-0.172207,-0.0837,-0.534124,0.138535,-0.236103,-0.064656,-0.036079,-0.071999,-0.742085,1
128af162-d2c3-4fe4-986c-359c8bdc6c04,-1.100641,0.044766,0.064163,0.028296,-0.827427,-0.467517,-0.286783,0.0,-0.172207,-0.0837,0.411027,-0.721042,-0.149881,-0.064656,-0.036079,-0.071999,-0.742085,1
12aada5e-36eb-4e9e-8d62-359c076c1b40,-1.100641,0.044766,0.064163,0.028296,-0.827427,-0.444455,-0.286783,0.0,-0.172207,-0.0837,-0.345094,0.425061,1.646409,-0.064656,-0.036079,-0.071999,0.942748,2
12c2e02f-bc79-4048-83ba-359cd3280dcf,-1.100641,0.044766,0.064163,0.028296,-0.827427,-0.475205,-0.286783,0.0,-0.172207,-0.0837,0.221997,-1.007567,-0.178622,-0.064656,-0.036079,-0.071999,-0.742085,1
12fabdf0-0582-489e-a6d3-35509ab8ae6f,0.908561,0.044766,0.064163,0.028296,1.208565,2.507479,-1.403976,0.0,-0.172207,0.401801,0.032967,-0.14799,0.209377,-0.064656,-0.036079,-0.071999,0.942748,2


Let's calculate the silhouette score. Yo can see that silhouette score is pretty low.

In [252]:
score = metrics.silhouette_score(data_norm[data.columns], data_norm["label"])
print("For n_clusters = {}, silhouette score is {}.".format(n_clusters, score))

For n_clusters = 2, silhouette score is 0.27662983159058524.


Next, we will visualise the clusters using t-SNE. We already have function defined at the beginning of this section to plot the result. We reduce the dimensionality first and then plot the clusters. From the plot below, it is clear that we cannot separate the clusters properly.

In [None]:
start_time = time.time()
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, verbose = 0, random_state=123, learning_rate=50)
tsne_results = tsne.fit_transform(data_norm[data.columns])

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-start_time))

tsne_plot(data_norm, tsne_results)
plt.savefig("p_kmeans.png")

### DBSCAN
We will follow a similar approach to the previous section.

In [None]:
n_clusters = 2
data_norm = pd.DataFrame(StandardScaler().fit_transform(data), index = data.index)
k_means = KMeans(n_clusters=n_clusters, random_state=123).fit_predict(data_norm_kmeans)
data_norm_kmeans.columns = data.columns
data_norm_kmeans["label"] = k_means + 1 
data_norm_kmeans["label"] = data_norm_kmeans["label"].apply(lambda i: str(i))
data_norm_kmeans.head()