In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
from lib import plot_decision_regions

In [2]:
import pandas as pd

In the example below, we simulate reading a dataset with missing values

In [3]:
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
0.0,11.0,12.0'''
df = pd.read_csv(StringIO(csv_data))

# empty values replaced with NaN
print("Data set with empty values")
print(df)
print()

# number of empty values in each column
print("No. of empty values in each column")
print(df.isnull().sum())
print()

# access underlying numpy array
print("Underlying numpy array")
print(df.values)

Data set with empty values
     A     B     C    D
0  1.0   2.0   3.0  4.0
1  5.0   6.0   NaN  8.0
2  0.0  11.0  12.0  NaN

No. of empty values in each column
A    0
B    0
C    1
D    1
dtype: int64

Underlying numpy array
[[  1.   2.   3.   4.]
 [  5.   6.  nan   8.]
 [  0.  11.  12.  nan]]


## eliminating samples or features with missing values

In [4]:
# dropping rows with missing rows
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [5]:
# dropping columns with at least one NaN in any row
df.dropna(axis = 1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,0.0,11.0


In [6]:
# only drop rows where all columns are NaN
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,0.0,11.0,12.0,


In [7]:
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [8]:
# only drom columns where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,0.0,11.0,12.0,


## imputing missing values

One of the most common interpolation techniques is
**mean imputation** where we simply replace the missing value
by the mean value of the entire column.

In [9]:
# use the Imputer class from sklearn to perform mean imputation
from sklearn.preprocessing import Imputer
# axis=0 for columns (axis=1 for rows)
# other strategies: most_frequent, median
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
# the mean is separately calculated for each column
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [  0. ,  11. ,  12. ,   6. ]])

## handling categorical data

In [10]:
import pandas as pd
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'XL', 15.3, 'class1']
    ])
df.columns = ['color', 'size', 'price', 'classlabel']
df
# color is nominal, size is ordinal, price is numerical

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


learning algorithms for classification discussed in the book do not use ordinal information in class labels

we have to manually define a mapping function for ordinal data to convert them to numbers

In [11]:
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1
}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In this case, we have assumed the relationship between the different size labels to be XL = L + 1 = M + 2

to convert back to the ordinal labels, we can generate an inverse mapping

In [15]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
inv_size_mapping

{1: 'M', 2: 'L', 3: 'XL'}

### encoding class labels

Many libraries require class labels to be encoded as integer values. Since it doesn't matter which number is assigned to which label, we can simply enumerate class labels starting at 0.

In [12]:
class_mapping = {label: idx for idx, label in 
                enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

next, we transform the class labels into integers

In [13]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


we can reverse the key value pair to convert back to string
class labels

In [14]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


Alternatively, we can use the **`LabelEncoder`** built into scikit-learn for that

In [16]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
# fit_transform is a shortcut for calling fit then transform
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0])

we can use the `inverse_transform` method to convert from integer back to string labels

In [17]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

### one-hot encoding on nominal features

Since nominal features have no intrinsic order, it may seem appropriate to encode them the same way we do with class labels, as follow

In [22]:
color_le = LabelEncoder()
X = df[['color', 'size', 'price']].values
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

However, the ML algorithm will mistakenly assume that green > blue and red > green

a common workaround is to use **one-hot-encoding**. We create a binary feature for each value. So a blue sample could be encoded as `blue=1,green=0,red=0`. We use the **`OneHotEncoder`** in sklearn to achieve this

In [26]:
from sklearn.preprocessing import OneHotEncoder
# specify the indices of the columns that contain
# categorical values
ohe = OneHotEncoder(categorical_features=[0])
# the encoder's transform method returns a sparse matrix
# which convert to numpy's regular (dense) array
# we could init the constructor with sparse=False to avoid this
ohe.fit_transform(X).toarray()

array([[  0. ,   1. ,   0. ,   1. ,  10.1],
       [  0. ,   0. ,   1. ,   2. ,  13.5],
       [  1. ,   0. ,   0. ,   3. ,  15.3]])

a more convenient way of one-hot encoding is via the **`get_dummies`** method in pandas. Applied on a DataFrame, it will only convert string columns and leave others unchanges

In [28]:
pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0.0,1.0,0.0
1,13.5,2,0.0,0.0,1.0
2,15.3,3,1.0,0.0,0.0
