## CLASSIFICATION: UNCLASSIFIED

# An Example of Preparing Data for Machine Learning

Machine learning tends to work with data that has nice numeric vectors. You'll note that the clustering and t-SNE notebooks both assume that your data is already in a purely numeric numpy array. In practice data rarely shows up in such a nice form, and one needs to do a certain amount of preprocessing to make it all "pretty" for standard machine learning tools. In this notebook we'll step through an example of taking data from a pandas dataframe of mixed data types all the way to numeric vectors ready for clustering. Fortunately for us scikit-learn comes equipped with a number of tools to help, and pandas itself comes with some useful tricks. Thus the first thing we'll do is load the appropriate libraries.

In [1]:
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import pipeline
import pandas as pd
import numpy as np
import hdbscan

Now we need some data to demonstrate on. I've chosen the standard "titanic" dataset to work with. It is available [here](https://git.cse-cst.gc.ca/projects/DM/repos/notebooks/browse/data/titanic.csv). The first thing we need to do is get the data loaded. Pandas can help with that ...

In [2]:
titanic = pd.read_csv('../data/titanic.csv')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


So we can see we have a complex dataframe (see [Visualising Dataframes with Seaborn](http://jupyter.sigint.cse:8080/urls/git.cse-cst.gc.ca/projects/DM/repos/notebooks/raw/Data_Exploration_and_Visualization/Visualising%20Dataframes%20with%20Seaborn.ipynb?at=refs%2Fheads%2Fmaster) for some exploration of this dataset elsewhere). It has some numeric data, like ``age``, and ``fare``, but there are boolean values like ``survived`` and ``adult_male``, and then there are string valued columns like ``sex``, ``class`` and ``embark_town``. Worse still we also have ``deck`` which has missing values (and other columns may also have missing values!). We somehow need to transform all of this into something we can pass to, say, clustering. How do we make it all numeric and nice?

The first step is to make use of pandas itself. Pandas has a very handy function called ``get_dummies`` which looks for string valued columns like ``class`` and ``sex`` for which the data is drawn from a set of categories and converts them into numeric columns. It does that by creating a new column for each value in each category and encoding a 0 or 1 for each observation. For those in the know this is called "one-hot encoding" of categorical data. Pandas makes that easy. Next we need to get jkusty the numerical data from the dataframe. Again pandas makes this easy via the ``select_dtypes`` which will select out columns based on their datatype. We simply need to tell this method to use ``np.number`` (a generic numeric class from numpy) and it will magically return a dataframe with just the numeric data.

In [3]:
numeric_data = pd.get_dummies(titanic).select_dtypes([np.number])
numeric_data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
survived,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
pclass,3.0,1.0,3.0,1.0,3.0,3.0,1.0,3.0,3.0,2.0
age,22.0,38.0,26.0,35.0,35.0,,54.0,2.0,27.0,14.0
sibsp,1.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,1.0
parch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0
fare,7.25,71.2833,7.925,53.1,8.05,8.4583,51.8625,21.075,11.1333,30.0708
sex_female,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
sex_male,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0
embarked_C,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
embarked_Q,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


This looks great! Except we still have those missing values. We can't actually run many machine learning tools while we have NaNs in our data. We need a way to get rid of them somehow. Now, we could have dropped the NAs in pandas, but we would end up losing a lot of data from a small dataset, so instead we'll try to impute the values. Scikit-learn provides an imputer class. In our case we'll use the strategy ``'mean'`` which will fill missing values with the mean value of the feature column. Other strategies, such as ``'median'`` and ``'most_frequent'`` also exist, and do exactly what you would expect. The setup here looks a little odd -- we create an imputer object, fit it to the data and then apply a transform to the same data. Why so complicated? We'll come back to this later, but the main thing to note is that we can *train* one one dataset, and then apply what we learned to a different *test* dataset. If we are doing proper cross validation then this is important (when building classifiers if you judge your accuracy by how well you do on the data you trained on you'll have a bad time -- you need to train on one dataset, and then see how things work on new unseen data).

In [4]:
imputer = preprocessing.Imputer(strategy='mean')
imputer.fit(numeric_data)
filled_data = imputer.transform(numeric_data)
filled_data

array([[  0.        ,   3.        ,  22.        , ...,   1.        ,
          1.        ,   0.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
          0.        ,   1.        ],
       [  1.        ,   3.        ,  26.        , ...,   1.        ,
          0.        ,   1.        ],
       ..., 
       [  0.        ,   3.        ,  29.69911765, ...,   1.        ,
          1.        ,   0.        ],
       [  1.        ,   1.        ,  26.        , ...,   0.        ,
          0.        ,   1.        ],
       [  0.        ,   3.        ,  32.        , ...,   0.        ,
          1.        ,   0.        ]])

Okay we filled in all the NaN values with a reasonable guess. Our next problem is that the data isn't exactly all on the same scale. Ages range from 2 to 80, while ``pclass`` ranges from 1 to 3. We would like to make all our features roughly comparable, especially if we are going to be clustering them. To do that we want to scale the data so each feature is on the same scale -- in practice we could subtract the mean and divide by the standard deviation (if our data was normally distributed). Since this sort of thing is a common operation scikit-learn provides tools to do this. If you want to just do the standard mean and standard deviation trick then you can use ``StandardScaler``. To be a little more robust to outliers we will use ``RobustScaler`` which uses robust statistics (such as median, and inter-quartile range) to do the scaling.

In [5]:
scaler = preprocessing.RobustScaler()
scaler.fit(filled_data)
scaled_data = scaler.transform(filled_data)
scaled_data

array([[ 0.        ,  0.        , -0.59223982, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        , -2.        ,  0.63852941, ..., -1.        ,
        -1.        ,  1.        ],
       [ 1.        ,  0.        , -0.28454751, ...,  0.        ,
        -1.        ,  1.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        , -2.        , -0.28454751, ..., -1.        ,
        -1.        ,  1.        ],
       [ 0.        ,  0.        ,  0.17699095, ..., -1.        ,
         0.        ,  0.        ]])

One last thing: we have some redundancy in our features. You'll note that ``embarked_C`` and ``embark_town_Cherbourg`` from our numeric dataframe will be effectively the same. Other features will potentially be highly correlated as well. It would be nice to remove some of this obvious redundancy. We can do this by using Principal Component Analysis to reduce the overall dimension of our dataset, effectively stripping out correlated features. In scikit-learn this is done via the ``PCA`` class. Again it is the same ``fit`` and then ``transform`` process.

In [6]:
reducer = decomposition.PCA(n_components=15)
reducer.fit(scaled_data)
reduced_data = reducer.transform(scaled_data)
reduced_data

array([[-1.46723322,  0.60596188,  0.96579931, ..., -0.0108578 ,
        -0.00542452, -0.03222112],
       [ 2.52931345, -0.47960619, -1.46691513, ..., -0.26042646,
        -0.01231339,  0.04115801],
       [-1.07611019,  0.21351006, -1.61625174, ..., -0.03076527,
        -0.0157039 ,  0.05708236],
       ..., 
       [-0.39814968,  1.41296638, -0.21753861, ...,  0.02286474,
         0.01000691,  0.05320269],
       [ 0.51949812, -1.09365786, -0.82545504, ..., -0.29761009,
        -0.03558347, -0.04695924],
       [-1.44792646, -0.50599888,  0.62005618, ...,  0.01709805,
        -0.01322648, -0.04635845]])

Now we have data that is numeric, has non missing values, is all on the same scale, and has some of the redundancy removed. We are in a place where we could, for example, hand the data to a clustering algorithm. Before I do that, however, I am going to demonstrate another feature of scikit-learn: pipelines. When you have some data to which you want to apply a set of transforms and then run an estimator (a classifier, a clusterer, a regression etc.) you can fit all the pieces together in a pipeline object. Then we can call ``fit`` and ``predict`` on the pipeline as a whole and scikit-learn will take care of passing the data from one transform to the next all the way to our final estimator. To construct a pipeline simply instantiate all the transformers and the estimator you want to use and pass them as an ordered list to the ``Pipeline`` class (with names so you can later extract interim data from the pipeline if you wish). You can see an example below where we construct the same transforms as above, along with a clusterer object, and hand it all to the pipeline.

In [7]:
imputer = preprocessing.Imputer(strategy='mean')
scaler = preprocessing.RobustScaler()
reducer = decomposition.PCA(n_components=15)
clusterer = hdbscan.HDBSCAN(min_cluster_size=20)
cluster_pipeline = pipeline.Pipeline([('impute', imputer),
                                      ('scale', scaler),
                                      ('pca', reducer),
                                      ('hdbscan', clusterer)])

Now we can call ``fit`` and ``predict`` on the pipeline as a whole on the original source data. Since we are doing clustering we are going to predict the data we fit with, but if this was, say, a classification problem then we could fit the pipeline to the training data and then run predict on the test set. Putting all your data in a pipeline like this makes it easier to work with, and consistent when you rerun things, rather than passing the data around yourself by hand.

In [8]:
cluster_labels = cluster_pipeline.fit_predict(numeric_data)
cluster_labels

array([-1, -1,  3, -1, -1,  1, -1, -1, -1, -1, -1, -1,  5, -1, -1, -1, -1,
       -1,  3, -1,  4, -1, -1, -1, -1, -1,  0, -1,  7,  6, -1, -1,  7, -1,
       -1, -1,  0,  5, -1, -1,  3, -1,  0, -1,  7,  6,  1,  7, -1,  3, -1,
        5, -1,  2, -1, -1,  2,  0, -1, -1,  0, -1, -1, -1, -1, -1,  2,  5,
       -1, -1,  4, -1, -1,  0, -1, -1,  6,  6, -1,  3,  5, -1,  7, -1,  2,
       -1, -1,  6, -1,  5,  6,  5, -1, -1, -1,  6, -1, -1,  2,  4,  3,  6,
       -1, -1, -1,  6, -1, -1, -1,  7, -1, -1,  5,  3, -1,  5, -1,  4, -1,
       -1, -1,  6, -1,  2, -1, -1,  1, -1, -1, -1,  0,  5, -1,  2,  4,  0,
       -1, -1, -1, -1, -1, -1,  3,  1,  4, -1, -1, -1, -1,  4,  4, -1, -1,
       -1,  6, -1,  7,  6,  6, -1, -1,  2, -1, -1, -1, -1, -1, -1,  4, -1,
       -1, -1, -1,  5, -1, -1, -1, -1,  4, -1, -1,  0, -1, -1, -1, -1,  7,
       -1,  1, -1,  2,  4, -1, -1, -1, -1,  1, -1,  7, -1,  6, -1, -1,  0,
       -1, -1, -1,  0,  7, -1,  5,  2,  5,  4,  1, -1,  3,  4, -1,  4, -1,
        4, -1,  6, -1,  5

What did we get out? Well we can have a look at one of the clusters from the original dataframe to get an idea ...

In [9]:
titanic[cluster_labels == 3]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
18,0,3,female,31.0,1,0,18.0,S,Third,woman,False,,Southampton,no,False
40,0,3,female,40.0,1,0,9.475,S,Third,woman,False,,Southampton,no,False
49,0,3,female,18.0,1,0,17.8,S,Third,woman,False,,Southampton,no,False
79,1,3,female,30.0,0,0,12.475,S,Third,woman,False,,Southampton,yes,True
100,0,3,female,28.0,0,0,7.8958,S,Third,woman,False,,Southampton,no,True
113,0,3,female,20.0,1,0,9.825,S,Third,woman,False,,Southampton,no,False
142,1,3,female,24.0,1,0,15.85,S,Third,woman,False,,Southampton,yes,False
216,1,3,female,27.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
235,0,3,female,,0,0,7.55,S,Third,woman,False,,Southampton,no,True


We can also group the dataframe by the cluster labels and then look at the average values of various categories to see what the clustering has seprated out of the data.

In [10]:
titanic.groupby(cluster_labels)[['survived', 'age', 'pclass', 'fare', 'adult_male']].mean()

Unnamed: 0,survived,age,pclass,fare,adult_male
-1,0.484171,30.331571,2.098696,45.535514,0.536313
0,0.135135,28.095238,2.945946,8.609578,1.0
1,0.0625,29.6875,3.0,9.681119,1.0
2,1.0,31.9875,2.0,19.387805,0.0
3,0.333333,26.16129,3.0,11.104631,0.0
4,0.0,33.543478,1.873418,17.135337,1.0
5,0.0,21.147059,3.0,8.012247,1.0
6,0.0,29.09375,3.0,7.990822,1.0
7,0.785714,20.357143,3.0,9.160868,0.0
