# Step 3 - Data Preprocessing

In the previous 2 modules:
* Learned to acquire data through different methods
* Discovered multiple formats in which data can be found and how to interact with each of them
* Performed exploratory data analysis 

Now we will give you a brief introduction on data preprocessing. By the end of this module you will be able to

1. Handle missing values and outliers
2. Perform data transformations
3. Apply feature selection techniques

For this module we are going to use a different dataset from the previous 2 modules. We are going to use the Credit Approval dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/). This data contains information regarding credit card applications. It is important to notice that the names of the features are not displayed in order to protect the confidentiality of the data. 

In [1]:
# We start by loading the packages we are going to work with
from ucimlrepo import fetch_ucirepo 
import pandas as pd 
import numpy as np
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, f_classif

In [2]:
# load our data https://archive.ics.uci.edu/dataset/27/credit+approval
credit_approval = fetch_ucirepo(id=27) 
  
# data (as pandas dataframes) 
X = credit_approval.data.features 
y = credit_approval.data.targets 
X

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,202.0,g,f,1,t,t,1.25,v,w,g,u,0.000,30.83,b
1,560,43.0,g,f,6,t,t,3.04,h,q,g,u,4.460,58.67,a
2,824,280.0,g,f,0,f,t,1.50,h,q,g,u,0.500,24.50,a
3,3,100.0,g,t,5,t,t,3.75,v,w,g,u,1.540,27.83,b
4,0,120.0,s,f,0,f,t,1.71,v,w,g,u,5.625,20.17,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,260.0,g,f,0,f,f,1.25,h,e,p,y,10.085,21.08,b
686,394,200.0,g,t,2,t,f,2.00,v,c,g,u,0.750,22.67,a
687,1,200.0,g,t,1,t,f,2.00,ff,ff,p,y,13.500,25.25,a
688,750,280.0,g,f,0,f,f,0.04,v,aa,g,u,0.205,17.92,b


## Missing values and outliers

One of the most commonly encountered problems is the presence of missing values. Some of the ways we can threat these observations are the following:

* Remove observations: Simply remove the observations for which relevant data is missed. 
* Median imputations (continous features): Replace missing values with the median of the distribution, particullarly if the distribution of the data is skewed.
* Mean imputation (continous features): Replace missing values with the mean of the distribution, particullarly if the data is normally distributed.

Outliers represent cases in which a data point possess a value that significantly differs from the rest of the observations. One way to deal with these outliers, is to eliminate those data points in which the value is more than a certain number of standard deviations from the mean, or in other words, if the z-score of the observation is above a certain threshold.

Now, let's see some code to handle these cases.

In [3]:
# First, we identify the presence of missing values
for x in X.columns:
    print(f"The column {x} has missing values: {X[x].isna().any()}")
print(f"\n{len(X)}")

The column A15 has missing values: False
The column A14 has missing values: True
The column A13 has missing values: False
The column A12 has missing values: False
The column A11 has missing values: False
The column A10 has missing values: False
The column A9 has missing values: False
The column A8 has missing values: False
The column A7 has missing values: True
The column A6 has missing values: True
The column A5 has missing values: True
The column A4 has missing values: True
The column A3 has missing values: False
The column A2 has missing values: True
The column A1 has missing values: True

690


In [4]:
# Remove any row that has a missing value
X_clean = X.dropna() # This single line of code removes any row with a missing value in at least one column
for x in X_clean.columns:
    print(f"The column {x} has missing values: {X_clean[x].isna().any()}")
print(f"\n{len(X_clean)}")

The column A15 has missing values: False
The column A14 has missing values: False
The column A13 has missing values: False
The column A12 has missing values: False
The column A11 has missing values: False
The column A10 has missing values: False
The column A9 has missing values: False
The column A8 has missing values: False
The column A7 has missing values: False
The column A6 has missing values: False
The column A5 has missing values: False
The column A4 has missing values: False
The column A3 has missing values: False
The column A2 has missing values: False
The column A1 has missing values: False

653


As we see, the second version of our features dataframe has no missing values in any of the columns. However, we also lost 5% of the observations. 

In [5]:
# Performing mean/median imputations
X_imp = X.copy() # For this example, we are going to work with a copy
floats = X_imp.select_dtypes(include=['float']) # We identify the columns with floating point values
print(f"Continous features: {floats.columns}\n")
for x in floats:
    if X_imp[x].isna().any():  # Check if the column has missing values
        mean = X_imp[x].mean() # Replace .mean() to impute the median
        X_imp[x].fillna(mean, inplace=True)
        
for x in X_imp.columns:
    print(f"The column {x} has missing values: {X_imp[x].isna().any()}")
print(f"\n{len(X_imp)}")

Continous features: Index(['A14', 'A8', 'A3', 'A2'], dtype='object')

The column A15 has missing values: False
The column A14 has missing values: False
The column A13 has missing values: False
The column A12 has missing values: False
The column A11 has missing values: False
The column A10 has missing values: False
The column A9 has missing values: False
The column A8 has missing values: False
The column A7 has missing values: True
The column A6 has missing values: True
The column A5 has missing values: True
The column A4 has missing values: True
The column A3 has missing values: False
The column A2 has missing values: False
The column A1 has missing values: True

690


As we can see, none of the columns with continous features has missing values anymore. 

In [6]:
# Handling outliers 

threshold = 3
valid_rows = [] # This list will store the rows with missing values

for x in floats:
    z_score = stats.zscore(X_clean[x])
    valid = abs(z_score) <= threshold
    valid_rows.append(valid)

# Combine the outlier masks for all columns
combined = pd.concat(valid_rows, axis=1).all(axis=1)
combined

# Remove rows with outliers
X_clean = X_clean[combined]
X_clean

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,202.0,g,f,1,t,t,1.25,v,w,g,u,0.000,30.83,b
1,560,43.0,g,f,6,t,t,3.04,h,q,g,u,4.460,58.67,a
2,824,280.0,g,f,0,f,t,1.50,h,q,g,u,0.500,24.50,a
3,3,100.0,g,t,5,t,t,3.75,v,w,g,u,1.540,27.83,b
4,0,120.0,s,f,0,f,t,1.71,v,w,g,u,5.625,20.17,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,260.0,g,f,0,f,f,1.25,h,e,p,y,10.085,21.08,b
686,394,200.0,g,t,2,t,f,2.00,v,c,g,u,0.750,22.67,a
687,1,200.0,g,t,1,t,f,2.00,ff,ff,p,y,13.500,25.25,a
688,750,280.0,g,f,0,f,f,0.04,v,aa,g,u,0.205,17.92,b


## Data Transformations

Applying transformations to our data can make it more useful and improve our model's performance. Let's explore some common techniques to transform our data. 

## Dummies

For most of models, it is important to transform our categorical variables into a series binary columns (or dummies) to obtain a numerical representation of the presence or absence of each category. Let's see how to do that in the following code. 

In [7]:
# Dummies
categorical = X.select_dtypes(include=['object']).columns# First, we identify the non-numerical features
X_dummies = pd.get_dummies(X[categorical], prefix=categorical) # Create dummies
X_dummies = pd.concat([X, X_dummies], axis=1) # Concatenate dataframes
X_dummies.drop(categorical, axis=1, inplace=True) # Keep only the dummies, as well as the original continous features

X_dummies

Unnamed: 0,A15,A14,A11,A8,A3,A2,A13_g,A13_p,A13_s,A12_f,...,A6_w,A6_x,A5_g,A5_gg,A5_p,A4_l,A4_u,A4_y,A1_a,A1_b
0,0,202.0,1,1.25,0.000,30.83,True,False,False,True,...,True,False,True,False,False,False,True,False,False,True
1,560,43.0,6,3.04,4.460,58.67,True,False,False,True,...,False,False,True,False,False,False,True,False,True,False
2,824,280.0,0,1.50,0.500,24.50,True,False,False,True,...,False,False,True,False,False,False,True,False,True,False
3,3,100.0,5,3.75,1.540,27.83,True,False,False,False,...,True,False,True,False,False,False,True,False,False,True
4,0,120.0,0,1.71,5.625,20.17,False,False,True,True,...,True,False,True,False,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,260.0,0,1.25,10.085,21.08,True,False,False,True,...,False,False,False,False,True,False,False,True,False,True
686,394,200.0,2,2.00,0.750,22.67,True,False,False,False,...,False,False,True,False,False,False,True,False,True,False
687,1,200.0,1,2.00,13.500,25.25,True,False,False,False,...,False,False,False,False,True,False,False,True,True,False
688,750,280.0,0,0.04,0.205,17.92,True,False,False,True,...,False,False,True,False,False,False,True,False,False,True


In this case, we have a boolean representation for each category.

## Interactions

For some models, like linear regression, the use of interactions between features might provide more information to the model and improve performance. 

In pandas, creating interaction is not a complicated task. 

In [8]:
# Interaction

X_int = X.copy() 

X_int['A2*3'] = X_int['A2']*X_int['A3'] # We create an interaction feature of A2 and A3 by simply multiplying the columns.
X_int

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1,A2*3
0,0,202.0,g,f,1,t,t,1.25,v,w,g,u,0.000,30.83,b,0.00000
1,560,43.0,g,f,6,t,t,3.04,h,q,g,u,4.460,58.67,a,261.66820
2,824,280.0,g,f,0,f,t,1.50,h,q,g,u,0.500,24.50,a,12.25000
3,3,100.0,g,t,5,t,t,3.75,v,w,g,u,1.540,27.83,b,42.85820
4,0,120.0,s,f,0,f,t,1.71,v,w,g,u,5.625,20.17,b,113.45625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,260.0,g,f,0,f,f,1.25,h,e,p,y,10.085,21.08,b,212.59180
686,394,200.0,g,t,2,t,f,2.00,v,c,g,u,0.750,22.67,a,17.00250
687,1,200.0,g,t,1,t,f,2.00,ff,ff,p,y,13.500,25.25,a,340.87500
688,750,280.0,g,f,0,f,f,0.04,v,aa,g,u,0.205,17.92,b,3.67360


## Data normalization and standardization

Another type of data transformation are the normalization and standardization of the continous features. These techniques help us reduce the impact of the outliers, as well as to improve the performance of our models.

* Normalization: In this method, we set all values between 0 and 1, with the minimum value assigned 0 and the maximum 1, with the rest of the values being transformed between 0 and 1.
* Standardization: In this method, all values are centered around the mean, with the mean receiving a value of 0. Different from the previous one, there is no default range in this method.

The *sklearn* library offers a convenient way to perform these tasks. 

In [9]:
# Normalize continous features
X_norm = X.copy()  # We are going to work with a copy
X_floats = X[floats.columns]
scaler_minmax = MinMaxScaler()
X_norm[floats.columns] = scaler_minmax.fit_transform(X_floats)
X_norm

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,0.1010,g,f,1,t,t,0.043860,v,w,g,u,0.000000,0.256842,b
1,560,0.0215,g,f,6,t,t,0.106667,h,q,g,u,0.159286,0.675489,a
2,824,0.1400,g,f,0,f,t,0.052632,h,q,g,u,0.017857,0.161654,a
3,3,0.0500,g,t,5,t,t,0.131579,v,w,g,u,0.055000,0.211729,b
4,0,0.0600,s,f,0,f,t,0.060000,v,w,g,u,0.200893,0.096541,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,0.1300,g,f,0,f,f,0.043860,h,e,p,y,0.360179,0.110226,b
686,394,0.1000,g,t,2,t,f,0.070175,v,c,g,u,0.026786,0.134135,a
687,1,0.1000,g,t,1,t,f,0.070175,ff,ff,p,y,0.482143,0.172932,a
688,750,0.1400,g,f,0,f,f,0.001404,v,aa,g,u,0.007321,0.062707,b


In [10]:
# Standardize continous features
X_std = X.copy()  # We are going to work with a copy
X_floats = X[floats.columns]
scaler_standard = StandardScaler()
X_std[floats.columns] = scaler_standard.fit_transform(X_floats)
X_std

Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,0.103555,g,f,1,t,t,-0.291083,v,w,g,u,-0.956613,-0.061777,b
1,560,-0.811931,g,f,6,t,t,0.244190,h,q,g,u,-0.060051,2.268118,a
2,824,0.552661,g,f,0,f,t,-0.216324,h,q,g,u,-0.856102,-0.591526,a
3,3,-0.483738,g,t,5,t,t,0.456505,v,w,g,u,-0.647038,-0.312843,b
4,0,-0.368582,s,f,0,f,t,-0.153526,v,w,g,u,0.174141,-0.953898,b
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,0.437505,g,f,0,f,f,-0.291083,h,e,p,y,1.070704,-0.877742,b
686,394,0.092039,g,t,2,t,f,-0.066806,v,c,g,u,-0.805846,-0.744677,a
687,1,0.092039,g,t,1,t,f,-0.066806,ff,ff,p,y,1.757198,-0.528760,a
688,750,0.552661,g,f,0,f,f,-0.652915,v,aa,g,u,-0.915403,-1.142198,b


## Feature Selection

When we work with big datasets containing lots of features, it is important to only work with features that impact the prediction of our target. By doing this, we prevent overfitting our model, and we also use our computational resources more efficiently. The [sklearn feature selection](https://scikit-learn.org/stable/modules/feature_selection.html) module offers a series of feature selection methods which are convinient for this task. We will briefly explore some of them:

* Variance Threshold: This method estimates the probability for each feature of having a variance different from 0. We set a threshold for this probability and remove all features below that threshold.
* KBest: This method calculates a score (based on a function) between our features and our target, and select the *k* number of features that scored the highest. 



In [11]:
# Variance Threshold
X_floats = X_std.copy()
X_floats = X_floats[floats.columns]
for x in floats: # Our X_std still had some missing values. We are going to imput the mean. 
    if X_floats[x].isna().any():  # Check if the column has missing values
        mean = X_floats[x].mean() # Replace .mean() to impute the median
        X_floats[x].fillna(mean, inplace=True)
selector = VarianceThreshold(threshold=0.982) # We can adjust the threshold. We are using an arbitrary high threshold for demonstrative purposes.
X_var = selector.fit_transform(X_floats)
selected_columns = X_floats.columns[selector.get_support()]
X_var = pd.DataFrame(X_var, columns=selected_columns)
print(X_var)

           A8        A3        A2
0   -0.291083 -0.956613 -0.061777
1    0.244190 -0.060051  2.268118
2   -0.216324 -0.856102 -0.591526
3    0.456505 -0.647038 -0.312843
4   -0.153526  0.174141 -0.953898
..        ...       ...       ...
685 -0.291083  1.070704 -0.877742
686 -0.066806 -0.805846 -0.744677
687 -0.066806  1.757198 -0.528760
688 -0.652915 -0.915403 -1.142198
689  1.814125 -0.278161  0.287205

[690 rows x 3 columns]


As we see, our features dataframe now has less features. In our threshold, we asked to remove all features with a probability (p-value) of having a variance equal to 0 of 1.82% or higher (which is a very strick threshold).

In [12]:
# K-Best
y_bin = np.where(y["A16"]=="+",1,0) # The current dtype of y is object. We need to convert it into a binary series.
k_selector = SelectKBest(f_classif, k=2) # K = 2 because we only want to keep 2 features.
X_k = k_selector.fit_transform(X_floats, y_bin)
k_columns = X_floats.columns[k_selector.get_support()]
X_k = pd.DataFrame(X_k, columns=k_columns)
X_k

Unnamed: 0,A8,A3
0,-0.291083,-0.956613
1,0.244190,-0.060051
2,-0.216324,-0.856102
3,0.456505,-0.647038
4,-0.153526,0.174141
...,...,...
685,-0.291083,1.070704
686,-0.066806,-0.805846
687,-0.066806,1.757198
688,-0.652915,-0.915403


The two features with the highest scores using the F-statistic as our scoring method, are A8 and A3. Therefore, the method returns a dataframe with those 2 columns.

We invite you to take a look at more methods trough the official documentation.

## Dimensionality Reduction

Another helpful way to improve the performance of our models and take better advantage of our computational resources is by performing dimensionality reduction to our data. This technique implies the transformation of our data into a lower dimension space, while performing the most relevant information. A very common approach for this task is the [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis). 

The next cell shows the implementation of a code using *sklearn*. 

In [13]:
# Principal Component Analysis
X_pca = X_floats.copy()

pca = PCA(n_components=2) # We are going to reduce our data into a 2D space. 
X_pca = pca.fit_transform(X_pca)
X_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(2)])
print(X_pca)

          PC1       PC2
0   -0.742321  0.259076
1    1.558978  0.338133
2   -1.060236  0.404592
3   -0.095252 -0.158644
4   -0.394970 -0.773721
..        ...       ...
685 -0.201128 -0.421724
686 -0.889304  0.009526
687  0.583331 -0.700276
688 -1.641751  0.062929
689  1.391807 -0.050432

[690 rows x 2 columns]


As we see, our data was transformed and reduced into two components.