# Notebook 5A: Basic data cleaning and preprocessing


#### Conceptualized, organized and prepared by: Christopher Monterola

#### This notebook is based on the following references:

**Python Machine Learning by Sebastian Raschka and Vahid Mirjalili,
Second Edition, September 2017.**

# General Idea: Dealing with missing data

It is not uncommon in real-world applications for our samples to be missing one
or more values for various reasons. There could have been an error in the data
collection process, certain measurements are not applicable, or particular fields could
have been simply left blank in a survey, for example. We typically see missing values
as the blank spaces in our data table or as placeholder strings such as NaN, which
stands for not a number, or NULL (a commonly used indicator of unknown values in
relational databases).

Unfortunately, most computational tools are unable to handle such missing values,
or produce unpredictable results if we simply ignore them. Therefore, it is crucial
that we take care of those missing values before we proceed with further analyses.
In this section, we will work through several practical techniques for dealing with
missing values by removing entries from our dataset or imputing missing values
from other samples and features.

*The quality of the data and the amount of useful information that it contains are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to examine and preprocess a dataset before we feed it to a learning algorithm. In this chapter, we will discuss the essential data preprocessing techniques that will help us build good machine learning models.*

The topics that we will cover here are as follows:

• Removing and imputing missing values from the dataset

• Getting categorical data into shape for machine learning algorithms

# Step 1. Identify missing values in tabular data

But before we discuss several techniques for dealing with missing values, let's create a simple example data frame from a Comma-separated Values (CSV) file to get a better grasp of the problem:

In [1]:
import pandas as pd
from io import StringIO
csv_data = \
 '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


Using the preceding code, we read CSV-formatted data into a pandas DataFrame
via the read_csv function and noticed that the two missing cells were replaced by
NaN. The StringIO function in the preceding code example was simply used for the
purposes of illustration. It allows us to read the string assigned to csv_data into a
pandas DataFrame as if it was a regular CSV file on our hard drive.
For a larger DataFrame, it can be tedious to look for missing values manually; in this
case, we can use the isnull method to return a DataFrame with Boolean values that
indicate whether a cell contains a numeric value (False) or if data is missing (True).
Using the sum method, we can then return the number of missing values per column
as follows:

In [2]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

This way, we can count the number of missing values per column; in the following
subsections, we will take a look at different strategies for how to deal with this
missing data.

# Step 2. Eliminate samples or features with missing values

One of the easiest ways to deal with missing data is to simply remove the
corresponding features (columns) or samples (rows) from the dataset entirely; rows
with missing values can be easily dropped via the dropna method:

In [3]:
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Similarly, we can drop columns that have at least one NaN in any row by setting the
axis argument to 1:

In [4]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


The dropna method supports several additional parameters that can come in handy:

In [5]:
# only drop rows where all columns are NaN
#(returns the whole array here since we don't have a row with where all values are NaN

df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [6]:
# drop rows that have less than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [7]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


Although the removal of missing data seems to be a convenient approach, it also
comes with certain disadvantages; for example, we may end up removing too
many samples, which will make a reliable analysis impossible. Or, if we remove too
many feature columns, we will run the risk of losing valuable information that our
classifier needs to discriminate between classes. In the next section, we will thus
look at one of the most commonly used alternatives for dealing with missing values:
interpolation techniques.

# Step 3. Imputing missing values

Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column. A convenient way to achieve this is by using the Imputer class from scikit-learn, as shown in the following code:

In [8]:
csv_data = \
 '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
5.0,6.0,7,8.0
10.0,11.0,7.0,'''

df = pd.read_csv(StringIO(csv_data))

print(df)

import numpy as np
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
# alternatively strategy can be 'most_frequent', 'median'
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

      A     B    C    D
0   1.0   2.0  3.0  4.0
1   5.0   6.0  NaN  8.0
2   5.0   6.0  7.0  8.0
3  10.0  11.0  7.0  NaN


array([[ 1.        ,  2.        ,  3.        ,  4.        ],
       [ 5.        ,  6.        ,  5.66666667,  8.        ],
       [ 5.        ,  6.        ,  7.        ,  8.        ],
       [10.        , 11.        ,  7.        ,  6.66666667]])

Here, we replaced each NaN value with the corresponding mean, which is separately
calculated for each feature column. If we changed the axis=0 setting to axis=1, we'd
calculate the row means. Other options for the strategy parameter are median or
most_frequent, where the latter replaces the missing values with the most frequent
values. This is useful for imputing categorical feature values, for example, a feature
column that stores an encoding of color names, such as red, green, and blue, and we
will encounter examples of such data later in this chapter. How about if data is mixed categorical and numerical?

In [9]:
import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

before...
     0    1    2
0    a  1.0  2.0
1    b  1.0  1.0
2    b  2.0  2.0
3  NaN  NaN  NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667


# Step 4. Handling Categorical data


Most machine learning algorithms work with numerical values only. However, it is not uncommon that real-world datasets contain one or more categorical feature columns. In this section, we will make use of simple yet effective examples to see how we deal with this type of data in numerical computing libraries.

#### Nominal and ordinal features


When we are talking about categorical data, we have to further distinguish between
nominal and ordinal features. Ordinal features can be understood as categorical
values that can be sorted or ordered. For example, t-shirt size would be an ordinal
feature, because we can define an order XL > L > M. In contrast, nominal features
don't imply any order and, to continue with the previous example, we could think of
t-shirt color as a nominal feature since it typically doesn't make sense to say that, for
example, red is larger than blue.


#### Creating an example dataset

Before we explore different techniques to handle such categorical data, let's create a
new DataFrame to illustrate the problem:

In [10]:
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


As we can see in the preceding output, the newly created DataFrame contains a
nominal feature (color), an ordinal feature (size), and a numerical feature (price)
column. The class labels (assuming that we created a dataset for a supervised
learning task) are stored in the last column. The learning algorithms for classification
that we discuss in this book do not use ordinal information in class labels.

## Step 4.1 Mapping ordinal features

To make sure that the learning algorithm interprets the ordinal features correctly,
we need to convert the categorical string values into integers. Unfortunately, there is
no convenient function that can automatically derive the correct order of the labels
of our size feature, so we have to define the mapping manually. In the following
simple example, let's assume that we know the numerical difference between
features, for example, XL = L +1 = M + 2:

In [11]:
size_mapping = {
'XL': 2,
'L': 1,
'M': 0}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class1
1,red,1,13.5,class2
2,blue,2,15.3,class1


If we want to transform the integer values back to the original string representation
at a later stage, we can simply define a reverse-mapping dictionary inv_size_
mapping = {v: k for k, v in size_mapping.items()} that can then be
used via the pandas map method on the transformed feature column, similar to
the size_mapping dictionary that we used previously. We can use it as follows:

In [12]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

## Step 4.2 Encoding class labels
Many machine learning libraries require that class labels are encoded as integer
values. Although most estimators for classification in scikit-learn convert class
labels to integers internally, it is considered good practice to provide class labels as
integer arrays to avoid technical glitches. To encode the class labels, we can use an
approach similar to the mapping of ordinal features discussed previously. We need
to remember that class labels are not ordinal, and it doesn't matter which integer
number we assign to a particular string label. Thus, we can simply enumerate the
class labels, starting at 0:

In [13]:
import numpy as np
class_mapping = {label:idx for idx,label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

Next, we can use the mapping dictionary to transform the class labels into integers:

In [14]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,0
1,red,1,13.5,1
2,blue,2,15.3,0


We can reverse the key-value pairs in the mapping dictionary as follows to map the
converted class labels back to the original string representation:

In [15]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class1
1,red,1,13.5,class2
2,blue,2,15.3,class1


Alternatively, there is a convenient LabelEncoder class directly implemented in
scikit-learn to achieve this:

In [16]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0])

Note that the fit_transform method is just a shortcut for calling fit and
transform separately, and we can use the inverse_transform method to transform
the integer class labels back into their original string representation:

In [17]:
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class1
1,red,1,13.5,class2
2,blue,2,15.3,class1


## Step 4.3 Performing one-hot encoding on nominal features

In the previous section, we used a simple dictionary-mapping approach to convert
the ordinal size feature into integers. Since scikit-learn's estimators for classification
treat class labels as categorical data that does not imply any order (nominal), we used
the convenient LabelEncoder to encode the string labels into integers. It may appear
that we could use a similar approach to transform the nominal color column of our
dataset, as follows:

In [18]:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 0, 10.1],
       [2, 1, 13.5],
       [0, 2, 15.3]], dtype=object)

After executing the preceding code, the first column of the NumPy array X now
holds the new color values, which are encoded as follows:

• blue = 0

• green = 1

• red = 2

If we stop at this point and feed the array to our classifier, we will make one of the
most common mistakes in dealing with categorical data. Can you spot the problem?
Although the color values don't come in any particular order, a learning algorithm
will now assume that green is larger than blue, and red is larger than green.
Although this assumption is incorrect, the algorithm could still produce useful
results. However, those results would not be optimal.
A common workaround for this problem is to use a technique called one-hot
encoding. The idea behind this approach is to create a new dummy feature for each
unique value in the nominal feature column. Here, we would convert the color
feature into three new features: blue, green, and red. Binary values can then be
used to indicate the particular color of a sample; for example, a blue sample can be
encoded as blue=1, green=0, red=0. To perform this transformation, we can use the
OneHotEncoder that is implemented in the scikit-learn.preprocessing module:

In [19]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("color", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
X

array([[0.0, 1.0, 0.0, 0, 10.1],
       [0.0, 0.0, 1.0, 1, 13.5],
       [1.0, 0.0, 0.0, 2, 15.3]], dtype=object)

When we initialized the OneHotEncoder, we defined the column position of the
variable that we want to transform via the categorical_features parameter (note
that color is the first column in the feature matrix X). By default, the OneHotEncoder
returns a sparse matrix when we use the transform method, and we converted the
sparse matrix representation into a regular (dense) NumPy array for the purpose
of visualization via the toarray method. Sparse matrices are a more efficient way
of storing large datasets and one that is supported by many scikit-learn functions,
which is especially useful if an array contains a lot of zeros. To omit the toarray
step, we could alternatively initialize the encoder as OneHotEncoder(...,
sparse=False) to return a regular NumPy array.

An even more convenient way to create those dummy features via one-hot encoding
is to use the get_dummies method implemented in pandas. Applied to a DataFrame,
the get_dummies method will only convert string columns and leave all other
columns unchanged:

In [20]:
pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,0,0,1,0
1,13.5,1,0,0,1
2,15.3,2,1,0,0


When we are using one-hot encoding datasets, we have to keep in mind that it
introduces multicollinearity, which can be an issue for certain methods (for instance,
methods that require matrix inversion). If features are highly correlated, matrices are
computationally difficult to invert, which can lead to numerically unstable estimates.
To reduce the correlation among variables, we can simply remove one feature
column from the one-hot encoded array. Note that we do not lose any important
information by removing a feature column, though; for example, if we remove the
column color_blue, the feature information is still preserved since if we observe
color_green=0 and color_red=0, it implies that the observation must be blue.

If we use the get_dummies function, we can drop the first column by passing a True
argument to the drop_first parameter, as shown in the following code example:

In [21]:
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)


Unnamed: 0,price,size,color_green,color_red
0,10.1,0,1,0
1,13.5,1,0,1
2,15.3,2,0,0


# Step 5. Bringing features onto the same scale

Feature scaling is a crucial step in our preprocessing pipeline that can easily be
forgotten. Decision trees and random forests are two of the very few machine
learning algorithms where we don't need to worry about feature scaling. Those
algorithms are scale invariant. However, the majority of machine learning and
optimization algorithms behave much better if features are on the same scale.

The importance of feature scaling can be illustrated by a simple example. Let's
assume that we have two features where one feature is measured on a scale from 1
to 10 and the second feature is measured on a scale from 1 to 100,000, respectively.
When we think of the squared error function in Adaline in Chapter 2, Training Simple
Machine Learning Algorithms for Classification, it is intuitive to say that the algorithm
will mostly be busy optimizing the weights according to the larger errors in the
second feature. Another example is the k-nearest neighbors (KNN) algorithm with
a Euclidean distance measure; the computed distances between samples will be
dominated by the second feature axis.


Now, there are two common approaches to bring different features onto the same
scale: normalization and standardization. Those terms are often used quite loosely
in different fields, and the meaning has to be derived from the context. Most often,
normalization refers to the rescaling of the features to a range of [0, 1], which is a
special case of min-max scaling. To normalize our data, we can simply apply the
min-max scaling to each feature column, where the new value (i)norm x of a sample x(i)
can be calculated as follows:

\begin{equation}
x^{(i)}_{norm} =\frac{x^{(i)} -x_{min}}{x_{max} -x_{min}}
\end{equation}

Here, $x^{(i)}$ is a particular sample, $x_{min}$ is the smallest value in a feature column, and
$x_{max}$ the largest value. The min-max scaling procedure is implemented in scikit-learn and can be used as
follows:

In [22]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

X_norm = mms.fit_transform(X)
print(X)
print(X_norm)

[[0.0 1.0 0.0 0 10.1]
 [0.0 0.0 1.0 1 13.5]
 [1.0 0.0 0.0 2 15.3]]
[[0.         1.         0.         0.         0.        ]
 [0.         0.         1.         0.5        0.65384615]
 [1.         0.         0.         1.         1.        ]]


Although normalization via min-max scaling is a commonly used technique that
is useful when we need values in a bounded interval, standardization can be more
practical for many machine learning algorithms, especially for optimization algorithms
such as gradient descent. The reason is that many linear models, such as the logistic
regression and SVM initialize the weights to 0 or small random values close
to 0. 

Using standardization, we center the feature columns at mean 0 with standard
deviation 1 so that the feature columns takes the form of a normal distribution, which
makes it easier to learn the weights. Furthermore, standardization maintains useful
information about outliers and makes the algorithm less sensitive to them in contrast
to min-max scaling, which scales the data to a limited range of values.
The procedure for standardization can be expressed by the following equation:

\begin{equation}
x^{(i)}_{std} = \frac{x^{(i)}-\mu_x}{\sigma_x}
\end{equation}


Here, $\mu_x$ is the sample mean of a particular feature column and $\sigma_x$ is the corresponding standard deviation.

The following  illustrates the difference between the two commonly used feature scaling techniques, standardization and normalization, on a simple sample dataset consisting of numbers 0 to 5:

In [23]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()

ex = np.array([[0, 1, 2, 3, 4, 5]])
ex.T
X_norm = mms.fit_transform(ex.T)
print("X_norm")
print(X_norm)
X_std=stdsc.fit_transform(ex.T)
print("\nX_std")
print(X_std)



X_norm
[[0. ]
 [0.2]
 [0.4]
 [0.6]
 [0.8]
 [1. ]]

X_std
[[-1.46385011]
 [-0.87831007]
 [-0.29277002]
 [ 0.29277002]
 [ 0.87831007]
 [ 1.46385011]]


Note that it important to highlight that we fit the StandardScaler or the MinmaxScaler class only once—on the training data—and use those parameters to transform the test set or any new data point.

>>> X_train_std = stdsc.fit_transform(X_train)

>>> X_test_std = stdsc.transform(X_test)

# SUMMARY

In summary, the following are the basic steps in data preprocessing/cleaning. A more tedious job includes purging categorical variables that are one and the same like: "Apple, aple, mansanas, appl" all into apple; such can be handled by encoding class labels but that is generally more tedious.

Step 1. Identify missing values in tabular data

Step 2. Eliminate samples or features with missing values OR

Step 3. Imputing missing values

Step 4. Handling Categorical data 

    4.1 Mapping ordinal features
    
    4.2 Encoding class labels
    
    4.3 Performing one-hot encoding on nominal features

Step 5. Bringing features onto the same scale

# Illustration: Consider in the next notebook a Dirty Bank Marketing Data

Bank Marketing Data Set 

Source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the client will subscribe (yes/no) a term deposit (variable y).


Attribute Information:

Input variables:
#### Bank client data:  
1 - age (numeric)   
2 - job : type of job (categorical: 'admin.','blue collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')   
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)   
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')   
5 - default: has credit in default? (categorical: 'no','yes','unknown')   
6 - housing: has housing loan? (categorical: 'no','yes','unknown')   
7 - loan: has personal loan? (categorical: 'no','yes','unknown')   
   
#### Related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')    
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')   
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')   
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.    

#### Other attributes:   
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)   
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)   
14 - previous: number of contacts performed before this campaign and for this client (numeric)   
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')    

 Social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)   
17 - cons.price.idx: consumer price index - monthly indicator (numeric)    
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)    
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)   
20 - nr.employed: number of employees - quarterly indicator (numeric)   

#### Output variable (desired target):  
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')  


Relevant Papers:  

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


## Class Exercise: Implement the five steps (if applicable) and classify using kNN, Logistic Regression and Support Vector Machine. 