# Clustering predial

### Notebook automatically generated from your model

Model KMeans (k=4), trained on 2021-08-25 04:10:56.

#### Generated on 2021-08-28 17:08:16.182745

Clustering
This notebook will reproduce the steps for clustering the dataset predial.

#### Warning

The goal of this notebook is to provide an easily readable and explainable code that reproduces the main steps
of training the model. It is not complete: some of the preprocessing done by the DSS visual machine learning is not
replicated in this notebook. This notebook will not give the same results and model performance as the DSS visual machine
learning model.

Let's start with importing the required libs :

In [0]:
import sys
import dataiku
import numpy as np
import pandas as pd
import sklearn as sk
import dataiku.core.pandasutils as pdu
from dataiku.doctor.preprocessing import PCA
from collections import defaultdict, Counter

And tune pandas display options:

In [0]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data

The first step is to get our machine learning dataset:

In [0]:
# We apply the preparation that you defined. You should not modify this.
preparation_steps = []
preparation_output_schema = {u'userModified': False, u'columns': [{u'type': u'bigint', u'name': u'col_0'}, {u'type': u'bigint', u'name': u'consecutivo predial'}, {u'type': u'bigint', u'name': u'corregimiento'}, {u'type': u'bigint', u'name': u'barrio'}, {u'type': u'bigint', u'name': u'manzana'}, {u'type': u'double', u'name': u'cc_tarifa'}, {u'type': u'double', u'name': u'estrato'}, {u'type': u'bigint', u'name': u'clase'}, {u'type': u'double', u'name': u'area_terr'}, {u'type': u'double', u'name': u'area_terr_comun'}, {u'type': u'double', u'name': u'area_const'}, {u'type': u'double', u'name': u'area_const_comun'}, {u'type': u'bigint', u'name': u'destinacion'}, {u'type': u'double', u'name': u'vlr_terr'}, {u'type': u'double', u'name': u'vlr_terr_comun'}, {u'type': u'double', u'name': u'vlr_const'}, {u'type': u'double', u'name': u'vlr_const_comun'}, {u'type': u'double', u'name': u'vlr_tot_avaluo'}, {u'type': u'bigint', u'name': u'total_owners'}, {u'type': u'bigint', u'name': u'year'}, {u'type': u'double', u'name': u'invoice'}]}

ml_dataset_handle = dataiku.Dataset('predial')
ml_dataset_handle.set_preparation_steps(preparation_steps, preparation_output_schema)
%time ml_dataset = ml_dataset_handle.get_dataframe(limit = 100000)

print ('Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1]))
# Five first records",
ml_dataset.head(5)

#### Initial data management

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll use the features and the preprocessing steps defined in Models.

Let's only keep selected features

In [0]:
ml_dataset = ml_dataset[[u'total_owners', u'estrato', u'destinacion', u'manzana', u'area_terr', u'cc_tarifa', u'invoice', u'area_const', u'vlr_tot_avaluo', u'vlr_terr', u'barrio', u'vlr_const']]

Let's first coerce categorical columns into unicode, numerical features into floats.

In [0]:
# astype('unicode') does not work as expected

def coerce_to_unicode(x):
    if sys.version_info < (3, 0):
        if isinstance(x, str):
            return unicode(x,'utf-8')
        else:
            return unicode(x)
    else:
        return str(x)


categorical_features = [u'estrato', u'destinacion', u'manzana', u'barrio']
numerical_features = [u'total_owners', u'area_terr', u'cc_tarifa', u'invoice', u'area_const', u'vlr_tot_avaluo', u'vlr_terr', u'vlr_const']
text_features = []
from dataiku.doctor.utils import datetime_to_epoch
for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in numerical_features:
    if ml_dataset[feature].dtype == np.dtype('M8[ns]') or (hasattr(ml_dataset[feature].dtype, 'base') and ml_dataset[feature].dtype.base == np.dtype('M8[ns]')):
        ml_dataset[feature] = datetime_to_epoch(ml_dataset[feature])
    else:
        ml_dataset[feature] = ml_dataset[feature].astype('double')

Let's copy our dataset to keep it for eventual profiling at the end.

In [0]:
# train dataset will be the one on which we will apply ml technics
train = ml_dataset.copy()

#### Features preprocessing

The first thing to do at the features level is to handle the missing values.
Let's reuse the settings defined in the model

In [0]:
drop_rows_when_missing = []
impute_when_missing = [{'impute_with': u'MEAN', 'feature': u'total_owners'}, {'impute_with': u'MODE', 'feature': u'estrato'}, {'impute_with': u'MODE', 'feature': u'destinacion'}, {'impute_with': u'MODE', 'feature': u'manzana'}, {'impute_with': u'MEAN', 'feature': u'area_terr'}, {'impute_with': u'MEAN', 'feature': u'cc_tarifa'}, {'impute_with': u'MEAN', 'feature': u'invoice'}, {'impute_with': u'MEAN', 'feature': u'area_const'}, {'impute_with': u'MEAN', 'feature': u'vlr_tot_avaluo'}, {'impute_with': u'MEAN', 'feature': u'vlr_terr'}, {'impute_with': u'MODE', 'feature': u'barrio'}, {'impute_with': u'MEAN', 'feature': u'vlr_const'}]

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    
    print ('Dropped missing records in %s' % feature)

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    elif feature['impute_with'] == 'CONSTANT':
        v = feature['value']
    train[feature['feature']] = train[feature['feature']].fillna(v)
    
    print ('Imputed missing values in feature %s with value %s' % (feature['feature'], coerce_to_unicode(v)))

We can now handle the categorical features (still using the settings defined in Models):

Let's dummy-encode the following features.
A binary column is created for each of the 100 most frequent values.

In [0]:
LIMIT_DUMMIES = 100

categorical_to_dummy_encode = [u'estrato', u'destinacion', u'manzana', u'barrio']

# Only keep the top 100 values
def select_dummy_values(train, features):
    dummy_values = {}
    for feature in categorical_to_dummy_encode:
        values = [
            value
            for (value, _) in Counter(train[feature]).most_common(LIMIT_DUMMIES)
        ]
        dummy_values[feature] = values
    return dummy_values

DUMMY_VALUES = select_dummy_values(train, categorical_to_dummy_encode)

def dummy_encode_dataframe(df):
    for (feature, dummy_values) in DUMMY_VALUES.items():
        for dummy_value in dummy_values:
            dummy_name = u'%s_value_%s' % (feature, coerce_to_unicode(dummy_value))
            df[dummy_name] = (df[feature] == dummy_value).astype(float)
        del df[feature]
        print ('Dummy-encoded feature %s' % feature)

dummy_encode_dataframe(train)

Let's rescale numerical features

In [0]:
rescale_features = {u'total_owners': u'AVGSTD', u'vlr_terr': u'AVGSTD', u'area_terr': u'AVGSTD', u'cc_tarifa': u'AVGSTD', u'invoice': u'AVGSTD', u'area_const': u'AVGSTD', u'vlr_tot_avaluo': u'AVGSTD', u'vlr_const': u'AVGSTD'}
for (feature_name, rescale_method) in rescale_features.items():
    if rescale_method == 'MINMAX':
        _min = train[feature_name].min()
        _max = train[feature_name].max()
        scale = _max - _min
        shift = _min
    else:
        shift = train[feature_name].mean()
        scale = train[feature_name].std()
    if scale == 0.:
        del train[feature_name]
        
        print ('Feature %s was dropped because it has no variance' % feature_name)
    else:
        print ('Rescaled %s' % feature_name)
        train[feature_name] = (train[feature_name] - shift).astype(np.float64) / scale

Removing outliers

In [0]:
# Remove outliers from train set
from dataiku.doctor.preprocessing.dataframe_preprocessing import detect_outliers

outliers = detect_outliers(train, 0.9, 100, 0.01)
train = train[~outliers]

print ("%s outliers found" % (outliers.sum()))

#### Modeling

In [0]:
from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=4)

We can finally cluster our dataset!

In [0]:
%time clusters = clustering_model.fit_predict(train)

Build up our result dataset

#### Results

Inertia

In [0]:
print (clustering_model.inertia_)

Silhouette

In [0]:
from sklearn.metrics import silhouette_score
silhouette = silhouette_score(train.values, clusters, metric='euclidean', sample_size=2000)
print ("Silhouette score :", silhouette)

Join our original dataset with the cluster labels we found.

In [0]:
final = train.join(pd.Series(clusters, index=train.index, name='cluster'))
final['cluster'] = final['cluster'].map(lambda cluster_id: 'cluster' + str(cluster_id))

Compute the cluster sizes

In [0]:
size = pd.DataFrame({'size': final['cluster'].value_counts()})
size.head()

Draw a nice scatter plot

In [0]:
axis_x = train.columns[0]   # change me
axis_y = train.columns[1]  # change me

from ggplot import ggplot, aes, geom_point
print(ggplot(aes(axis_x, axis_y, colour='cluster'), final) + geom_point())

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !
