# Missing value (MV) handling - the basic approaches

Perform listwise deletion and feature deletion on the provided data frame x

Reminder:
- listwise deletion: All rows containing any MV will be removed
- feature deletion: All feature/columns containing any MV will be removed

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

x= pd.DataFrame([[1, 2,0], [np.nan, 3,1], [7, 6,0],[1, 6,0],[np.nan, np.nan,1], [3, np.nan,0]])
print("Original data: \n",x)

#Listwise deletion


#Feature deletion


Original data: 
      0    1  2
0  1.0  2.0  0
1  NaN  3.0  1
2  7.0  6.0  0
3  1.0  6.0  0
4  NaN  NaN  1
5  3.0  NaN  0

Listwise deletion
Data after deleting rows with missing values: 
      0    1  2
0  1.0  2.0  0
2  7.0  6.0  0
3  1.0  6.0  0

Feature deletion
Data after deleting features with missing values: 
    2
0  0
1  1
2  0
3  0
4  1
5  0


Univariate feature imputation

Imputes missing values in a feature using only non-missing values in that feature (and no other features)

The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

Based on the provided array x, perform average imputation using the SimpleImputer class and applying different strategies:
- mean
- median
- constant
- most frequent
Fit and transform the data and print the corresponding result for each strategy!

In [1]:
import numpy as np
from sklearn.impute import SimpleImputer

x= np.array([[1, 2], [np.nan, 3], [7, 6],[1, 6]])
print("Original data: \n",x)

#Average Imputation using strategy='mean'


#Average Imputation using strategy='median'


#Average Imputation using strategy='constant'


#Average Imputation using strategy='most_frequent'


Original data: 
 [[ 1.  2.]
 [nan  3.]
 [ 7.  6.]
 [ 1.  6.]]
Transformed data (mean imputation): 
 [[1. 2.]
 [3. 3.]
 [7. 6.]
 [1. 6.]]
Transformed data (median imputation): 
 [[1. 2.]
 [1. 3.]
 [7. 6.]
 [1. 6.]]
Transformed data (constant imputation): 
 [[ 1.  2.]
 [17.  3.]
 [ 7.  6.]
 [ 1.  6.]]
Transformed data (mostfrequent value imputation): 
 [[1. 2.]
 [1. 3.]
 [7. 6.]
 [1. 6.]]


Univariate imputation can also be applied to string values.

Use the SimpleImputer class again in combination with a "most frequent" strategy.

You should get the Value Error shown below - and hopefully no other errors ;)

Try to fix it!

In [5]:
#Univariate imputation with string values
#first try
x_string= np.array([["Mike", 2], [np.nan, 3], ["Peter", 6],["Peter", 6]])
print("Original data: \n",x_string)

#Average Imputation using strategy='most_frequent'


Original data: 
 [['Mike' '2']
 ['nan' '3']
 ['Peter' '6']
 ['Peter' '6']]


ValueError: SimpleImputer does not support data with dtype <U5. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.

Use the same array again (including a missing value in the second example) and apply a "constant" imputation this time an replace the MV with your favourite name!

Original data: 
 [['Mike' 2]
 [nan 3]
 ['Peter' 6]
 ['Peter' 6]]
Transformed data (constant value imputation): 
 [['Mike' 2]
 ['Hugo' 3]
 ['Peter' 6]
 ['Peter' 6]]
