<center>
  <a href="MLSD-02-DataPreprocessing-B.ipynb" target="_self">Data Preprocessing B</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-D.ipynb">Data Preprocessing D | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>

# <center>DATA PREPROCESSING C</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Data Preprocessing
<b>Dataset</b>: Online Shopper data set.<br>
<b>Tasks</b>: 
- To read in and explore data set.
- To fill missing data with mean values. 
- To encode categorical data.
- To split dataset into training and test sets.
- To rescale features.

## Read in and Explore Data Set

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Data set link
path = "./data/onlineShopper/dataset.csv"

# Prepare dataframe using the data at given link and defined columns list
df = pd.read_csv(path)
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [3]:
# Array of features
X = df.iloc[:, :-1].values # All rows and columns except the last column
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
# Array of dependent variable
y = df.iloc[:, -1].values # All rows of last column
print(y)

['No ' 'Yes' 'No ' 'No ' 'Yes' 'Yes' 'No ' 'Yes' 'No ' 'Yes']


## Fill Missing Data with Mean Values

In [5]:
# Import class SimpleImputer from impute model in sklearn
from sklearn.impute import SimpleImputer

# Create object of SimpleImputer class
imputa = SimpleImputer(missing_values = np.nan, strategy = 'mean')
''' Using the fit method, we apply the `imputa` object on the matrix of our feature x.
The `fit()` method identifies the missing values and computes the mean of such feature a missing value is present.
'''
imputa.fit(X[:, 1:3]) # identifies the missing values and computes the mean of such feature where a missing value is present.

# Replace the missing value using transform method
X[:, 1:3] = imputa.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encode Categorical Data
During encoding, we transform text data into numeric data.<br>
Country and Purchased column data are categorical data.<br>
It is not possible to compute correlation between feature and dependent variables when mathematical equations in machine learning takes only numeric inputs.<br>
Therefore, need to convert string entries in dataset into numeric values.

For our dataset, encode:
- France into 0
- Spain into 1
- Germany into 2

But machine learning model will interpret this in numerical order between 0 for France, 1 for Spain, and 2 for Germany.<br>
But this may not always be the case.<br><br>
To prevent this misinterpretation, we make use of <b>one-hot encoding<br>
One-hot encoding converts categorical Country column into three columns.<br>
It creates a unique binary vector for each country such that there is no numerical order between the country categories.

In cases where there are high cardinality categorical variables, one-hot encoding may generate many new features. <br>
This is not a practical solution.<br>
Proposed solution is to use <b>Target Encoding</b> or <b>Leave-one-out Encoding</b> to handle these high cardinality categorical features. 


In [6]:
# Import classes
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder= 'passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


**Observations**:
- The Country column has been transformed into 3 columns with each row representing only one encoded column
- France is encoded into a vector [1.0 0.0 0.0]
- Spain is encoded into a vector [0.0 0.0 1.0]
- Germany is encoded into a vector [0.0 1.0 0.0]

In [7]:
# Encode dependent variable y
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() 
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


**Observations**:
- Use LabelEncoder for dependent variable y because there are only two values for y

## Split Dataset into Training and Test Sets
Split dataset into a training set and a test set.<br>
The training set is the fraction of a dataset used to implement the model.<br>
The test set is the fraction of a dataset used to evaluate the performance of the model.<br>
The test set is assumed to be unknown during the process of the model implementation.

In [8]:
# Import libraries
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state= 1)

In [9]:
# Print
print("\nX_train\n", X_train)
print("\nX_test\n", X_test)
print("\ny_train\n", y_train)
print("\ny_test\n", y_test)


X_train
 [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

X_test
 [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

y_train
 [0 1 0 0 1 1 0 1]

y_test
 [0 1]


**Observations**:
- Our features set was divided into eight observations for the X_train and 2 for the X_test, which correspond (since we set our seed, random = 1) to the same splitting of the dependent variable y.

## Rescale Features
Datasets may not have features of the same scale.<br>
Some features often have tremendous values, and others have small values.<br>
Suppose we implement our machine learning model on such datasets. <br>
In that case, features with tremendous values dominate those with small values, and the machine learning model treats those with small values as if they don’t exist (their influence on the data is not be accounted for). <br>
To ensure this is not the case, we need to scale our features on the same range, i.e., within the interval of -3 and 3.

We need to scale the Age and Salary columns of X_train and X_test into this interval.

In [10]:
# Import libraries
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Apply the feature scaling on the features other than dummy variables.
X_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])
X_test[:, 3:]  = scaler.fit_transform(X_test[:, 3:])

In [11]:
# Print
print("\nX_train\n", X_train)
print("\nX_test\n", X_test)


X_train
 [[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]

X_test
 [[0.0 1.0 0.0 -1.0 -1.0]
 [1.0 0.0 0.0 1.0 1.0]]


**Observations**:
- Only Age and Salary columns and not the dummy variables are scaled. 
- This is because scaling the dummy variables may interfere with their intended interpretation even though they fall within the required range.
- The Age and Salary features are now scaled on the same range, i.e., within the interval of -3 and 3.

<center>
  <a href="MLSD-02-DataPreprocessing-B.ipynb" target="_self">Data Preprocessing B</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-D.ipynb">Data Preprocessing D | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>