# Preprocessing: transform categorical data
In `scikit-learn` the classifiers require numeric data. The library makes available a set of preprocessing fuctions which help the transformation. This exercise proposes two types of transformations:

- `OneHotEncoder` for purely categorical columns: if the column has **V** distinct values it is substituted by **V** binary columns where in each row only the bit corrosponding to the original value is true
- `OrdinalEncoder` for ordinal columns: the original **V** values are mapped into the **0..V-1** range

The additional function `ColumnTransformer` allows to apply the different transformations to the appropriate columns with a single statement.

### To do:
- import the appropriate names
- set the random state
- import the data set with the appropriate column names
- inspect the content and the data types
- read carefully the `.names` file of the data set, to understand which are the ordinal and categorical data
- data cleaning
    - the **ordinal transformer** generates a mapping from strings to numbers according to the lexicographic sorting of the strings; in this particular case, the strings indicate numeric subranges, and ranges with one digit constitute exceptions
        '5-9' happens to be after '20-25'
    - it is necessary to transform '5-9' into '05-09', and the same for other similar cases
    - a way to do this is to prepare dictionaries for the translation and use the `.map` function
- prepare the lists of the ordinal, categorical and numeric columns
- prepare the preprocessor
- split the cleaned data into the X and y part
- fit_transform the preprocessor and generate the transformed data set
- split the transformed data set into train and test
- use the same method used for the exercise of 19/11 to test several classifiers

In [1]:
"""
http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
@author: scikit-learn.org and Claudio Sartori
"""
import warnings
warnings.filterwarnings('ignore') # uncomment this line to suppress warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

print(__doc__) # print information included in the triple quotes at the beginning

random_state = 42


http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
@author: scikit-learn.org and Claudio Sartori



In [2]:
# url = 'diagnosis.data'
# names = ['Temp', 'Nau', 'Lum', 'Uri', 'Mic', 'Bur', 'd1', 'd2']
# sep = "\t"
url = 'breast-cancer.data'
names = ['Class','age','menopause','tumor-size','inv-nodes',
         'node-caps','deg-malig','breast','breast-quad','irradiat']
sep = ","

df = pd.read_csv(url, names = names, sep=sep)
df.head()

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


Show the types of the columns

In [3]:
print(df.dtypes)

Class          object
age            object
menopause      object
tumor-size     object
inv-nodes      object
node-caps      object
deg-malig       int64
breast         object
breast-quad    object
irradiat       object
dtype: object


Clean the column `tumor-size`

In [4]:
tumor_size_dict = dict(zip(list(df['tumor-size'].unique()),list(df['tumor-size'].unique())))
tumor_size_dict

{'30-34': '30-34',
 '20-24': '20-24',
 '15-19': '15-19',
 '0-4': '0-4',
 '25-29': '25-29',
 '50-54': '50-54',
 '10-14': '10-14',
 '40-44': '40-44',
 '35-39': '35-39',
 '5-9': '5-9',
 '45-49': '45-49'}

In [5]:
tumor_size_dict['0-4'] = '00-04'
tumor_size_dict['5-9'] = '05-09'

In [6]:
df['tumor-size'] = df['tumor-size'].map(tumor_size_dict)

Clean the column `inv-nodes`

In [7]:
inv_nodes_dict = dict(zip(list(df['inv-nodes'].unique()),list(df['inv-nodes'].unique())))

In [8]:
inv_nodes_dict['0-2']  = '00-02'
inv_nodes_dict['3-5']  = '03-05'
inv_nodes_dict['6-8']  = '06-08'
inv_nodes_dict['9-11'] = '09-11'

In [9]:
df['inv-nodes'] = df['inv-nodes'].map(inv_nodes_dict)

Inspect the data

In [10]:
df.head()

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,00-02,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,00-02,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,00-02,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,00-02,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,00-04,00-02,no,2,right,right_low,no


Prepare the lists of numeric features, ordinal features, categorical features

In [11]:
categorical_features = df.dtypes.loc[df.dtypes == 'object'].index.values
print("The non-numeric features are:")
print(categorical_features)

The non-numeric features are:
['Class' 'age' 'menopause' 'tumor-size' 'inv-nodes' 'node-caps' 'breast'
 'breast-quad' 'irradiat']


In [12]:
numeric_features = list(set(df.dtypes.index.values)-set(categorical_features))
print("The numeric features are:")
print(numeric_features)

The numeric features are:
['deg-malig']


In [13]:
ordinal_features =['age', 'tumor-size','inv-nodes']
print("The ordinal features are:")
print(ordinal_features)

The ordinal features are:
['age', 'tumor-size', 'inv-nodes']


In [14]:
categorical_features = list(set(categorical_features) - set(ordinal_features) - set(['Class']))
print("The categorical features are:")
print(categorical_features)

The categorical features are:
['menopause', 'irradiat', 'breast', 'node-caps', 'breast-quad']


Prepare the transformer

In [15]:
# transf_dtype = np.float64
transf_dtype = np.int32

categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse = False, dtype = transf_dtype)
ordinal_transformer = OrdinalEncoder(dtype = transf_dtype)
preprocessor = ColumnTransformer(
    transformers = [('cat', categorical_transformer, categorical_features),
                    ('ord', ordinal_transformer, ordinal_features)
                   ],
                    remainder = 'passthrough'
    )

Split X and y and check the shapes

In [16]:
X = df.drop(['Class'], axis = 1)
y = df['Class']

In [17]:
labels = y.unique()
print("The labels are:")
print(labels)

The labels are:
['no-recurrence-events' 'recurrence-events']


In [18]:
X.shape

(286, 9)

Fit the preprocessor with X and check the parameters printing the `.named_transformers_` attribute

In [19]:
preprocessor.fit(X)

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('cat',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.int32'>,
                                               handle_unknown='ignore',
                                               n_values=None, sparse=False),
                                 ['menopause', 'irradiat', 'breast',
                                  'node-caps', 'breast-quad']),
                                ('ord',
                                 OrdinalEncoder(categories='auto',
                                                dtype=<class 'numpy.int32'>),
                                 ['age', 'tumor-size', 'inv-nodes'])],
                  verbose=False)

In [20]:
print(preprocessor.named_transformers_)

{'cat': OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.int32'>, handle_unknown='ignore',
              n_values=None, sparse=False), 'ord': OrdinalEncoder(categories='auto', dtype=<class 'numpy.int32'>), 'remainder': 'passthrough'}


Fit-transform X and store the result in X_p, check the shape

In [21]:
X_p = preprocessor.fit_transform(X)

In [22]:
X_p.shape

(286, 20)

For ease of inspection transform `X_p` into a data frame `df_p` and inspect it

In [23]:
df_p = pd.DataFrame(X_p)

In [24]:
df_p.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
count,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0,286.0
mean,0.451049,0.024476,0.524476,0.762238,0.237762,0.531469,0.468531,0.027972,0.776224,0.195804,0.003497,0.073427,0.384615,0.339161,0.083916,0.115385,2.664336,4.881119,0.517483,2.048951
std,0.49847,0.154791,0.500276,0.426459,0.426459,0.499883,0.499883,0.165182,0.417504,0.397514,0.059131,0.261293,0.487357,0.474254,0.277748,0.320046,1.011818,2.10593,1.110417,0.738217
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,0.0,2.0
50%,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,0.0,2.0
75%,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,3.0,6.0,1.0,3.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,10.0,6.0,3.0


In [25]:
df_p.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,0,1,1,0,1,0,0,1,0,0,0,1,0,0,0,1,6,0,3
1,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,2,4,0,2
2,0,0,1,1,0,1,0,0,1,0,0,0,1,0,0,0,2,4,0,2
3,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,4,3,0,2
4,0,0,1,1,0,0,1,0,1,0,0,0,0,0,1,0,2,0,0,2


The columns in the transformed dataset are generated according to the order you see printing the preprocessor after fitting, therefore the last four columns correspond to `'age', 'tumor-size', 'inv-nodes', 'deg-malig'`.

In order to inspect if the translation and check if the mapping is as expected, compare the sorted values of df['tumor-size'] and df_p[17], e.g. comparing the index sequences

In [26]:
orig_col = 'tumor-size'
transf_col = 17
a=pd.DataFrame(zip(df[orig_col],df_p[transf_col]))
print('The number of index discordances between \'{}\' and \'{}\' is {}'.\
      format(orig_col, transf_col, sum(a.sort_values(by = 0).index.values!=a.sort_values(by = 1).index.values)))

The number of index discordances between 'tumor-size' and '17' is 0


Train/test split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_p,y, random_state = random_state)

Classification and test

In [28]:
model_lbls = [
             'dt', 
             'nb', 
             'lp', 
             'svc', 
              'knn',
             'rfc',
             'ada',
            ]

# Set the parameters by cross-validation
tuned_param_dt = [{'max_depth': list(range(1,20))}]
tuned_param_nb = [{'var_smoothing': [10**i for i in range(1,-11, -1)]}]
tuned_param_lp = [{'early_stopping': [True]}]
tuned_param_svc = [{'kernel': ['rbf'], 
                    'gamma': [1e-3, 1e-4],
#                     'C': [1, 10, 100, 1000],
                    'C': [10**i for i in range(0,4)],
                    },
                    {'kernel': ['linear'],
                    'C': [10**i for i in range(0,4)],
                    },
                   ]
tuned_param_knn =[{'n_neighbors': list(range(1,11)),
                   'metric': ['euclidean', 'manhattan', 'chebyshev']}
                 ]
tuned_param_rfc =[{'max_depth': list(range(1,11))}]
tuned_param_ada = [{'learning_rate': [1., 0.1, 0.01, 0.001, 0.0001]}]

models = {
    'dt': {'name': 'Decision Tree       ',
           'estimator': DecisionTreeClassifier(), 
           'param': tuned_param_dt,
          },
    'nb': {'name': 'Gaussian Naive Bayes',
           'estimator': GaussianNB(),
           'param': tuned_param_nb
          },
    'lp': {'name': 'Linear Perceptron   ',
           'estimator': Perceptron(),
           'param': tuned_param_lp,
          },
    'svc':{'name': 'Support Vector      ',
           'estimator': SVC(), 
           'param': tuned_param_svc
          },
    'knn':{'name': 'K Nearest Neighbor  ',
           'estimator': KNeighborsClassifier(),
           'param': tuned_param_knn
          },
    'rfc':{'name': 'Random Forest       ',
           'estimator': RandomForestClassifier(),
           'param': tuned_param_rfc
          }, 
    'ada':{'name': 'Adaboost            ',
           'estimator': AdaBoostClassifier(),
           'param': tuned_param_ada
          },
}

scores = [
#    'precision_macro', 
    'recall_macro', 
#    'accuracy',
#    'f1_macro'
]

In [29]:
# def plot_confusion_matrix(cm):
#     print(cm) 
#     fig = plt.figure(figsize=(10,10))
#     ax = fig.add_subplot(111) 
#     cax = ax.matshow(cm) 
#     plt.title('Confusion matrix of the classifier') 
#     fig.colorbar(cax) 
#     ax.set_xticklabels([''] + labels) 
#     ax.set_yticklabels([''] + labels) 
#     plt.xlabel('Predicted') 
#     plt.ylabel('True') 
#     plt.show()

def print_results(model):
    print("Best parameters set found on train set:")
    print()
    # if best is linear there is no gamma parameter
    print(model.best_params_)
    print()
    print("Grid scores on train set:")
    print()
    means = model.cv_results_['mean_test_score']
    stds = model.cv_results_['std_test_score']
    params = model.cv_results_['params']
    for mean, std, params_tuple in zip(means, stds, params):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params_tuple))
    print()
    print("Detailed classification report for the best parameter set:")
    print()
    print("The model is trained on the full train set.")
    print("The scores are computed on the full test set.")
    print()
    y_true, y_pred = y_test, model.predict(X_test)
    print(classification_report(y_true, y_pred))
    cm = confusion_matrix(y_true,y_pred, labels = labels)
    print(cm)
#     plot_confusion_matrix(cm)
    print()

In [30]:
results_short = {}

for score in scores:
    print('='*40)
    print("# Tuning hyper-parameters for %s" % score)
    print()

    #'%s_macro' % score ## is a string formatting expression
    # the parameter after % is substituted in the string placeholder %s
    for m in model_lbls:
        print('-'*40)
        print("Trying model {}".format(models[m]['name']))
        clf = GridSearchCV(models[m]['estimator'], models[m]['param'], cv=5,
                           scoring=score, 
                           iid = False, 
                           return_train_score = False,
                           n_jobs = 2, # this allows using multi-cores
                           )
        clf.fit(X_train, y_train)
        print_results(clf)
        results_short[m] = clf.best_score_
    print("Summary of results for {}".format(score))
    print("Estimator")
    for m in results_short.keys():
        print("{}\t - score: {:4.2}%".format(models[m]['name'], results_short[m]))

# Tuning hyper-parameters for recall_macro

----------------------------------------
Trying model Decision Tree       
Best parameters set found on train set:

{'max_depth': 14}

Grid scores on train set:

0.567 (+/-0.086) for {'max_depth': 1}
0.610 (+/-0.115) for {'max_depth': 2}
0.583 (+/-0.127) for {'max_depth': 3}
0.551 (+/-0.082) for {'max_depth': 4}
0.574 (+/-0.148) for {'max_depth': 5}
0.574 (+/-0.138) for {'max_depth': 6}
0.597 (+/-0.148) for {'max_depth': 7}
0.591 (+/-0.226) for {'max_depth': 8}
0.567 (+/-0.223) for {'max_depth': 9}
0.576 (+/-0.285) for {'max_depth': 10}
0.577 (+/-0.168) for {'max_depth': 11}
0.552 (+/-0.166) for {'max_depth': 12}
0.564 (+/-0.163) for {'max_depth': 13}
0.620 (+/-0.187) for {'max_depth': 14}
0.565 (+/-0.142) for {'max_depth': 15}
0.573 (+/-0.089) for {'max_depth': 16}
0.608 (+/-0.121) for {'max_depth': 17}
0.571 (+/-0.154) for {'max_depth': 18}
0.576 (+/-0.177) for {'max_depth': 19}

Detailed classification report for the best parameter set:

T

Best parameters set found on train set:

{'learning_rate': 0.01}

Grid scores on train set:

0.586 (+/-0.146) for {'learning_rate': 1.0}
0.620 (+/-0.102) for {'learning_rate': 0.1}
0.643 (+/-0.161) for {'learning_rate': 0.01}
0.567 (+/-0.086) for {'learning_rate': 0.001}
0.567 (+/-0.086) for {'learning_rate': 0.0001}

Detailed classification report for the best parameter set:

The model is trained on the full train set.
The scores are computed on the full test set.

                      precision    recall  f1-score   support

no-recurrence-events       0.72      0.98      0.83        49
   recurrence-events       0.80      0.17      0.29        23

            accuracy                           0.72        72
           macro avg       0.76      0.58      0.56        72
        weighted avg       0.74      0.72      0.65        72

[[48  1]
 [19  4]]

Summary of results for recall_macro
Estimator
Decision Tree       	 - score: 0.62%
Gaussian Naive Bayes	 - score: 0.63%
Linear Percept