### Practical 8

### Data Preprocessing on EmployeeRecords Dataset

In [1]:
# ------------------------Basic Machine learning principle: ---------------------------------
#The features also called independent variables are the variables containing some information 
#using which we can predict the information contained in so called dependent variable.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [1]:
#---------------------------------Step 1:- Importing the dataset-----------------------------

# importing dataset "EmployeeRecords.csv" and integrate to a variable "dataset"
# This step reads all the values of this dataset and it will create a Pandas "DataFrame".
dataset = pd.read_csv("D:/ML Dataset/EmployeeRecords.csv")
dataset

<IPython.core.display.Javascript object>

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [2]:
#---------------------------Step 2 : Follow basic principle in Machine Learning------------------------------

#In any dataset with which we are going to train a machinert model we have the same entities.
# which are the features and the dependent variable vector.
# Features or Independent variables are the columns used to predict the so called dependent variable.
# Dependent variables is generally the last column of dataset.
# This is because last column may give some valuable predictions based on previous columns.


# ------------------------Step 3 : Retrieving Independent Features and Dependent variables in x and y-------------------

#The iloc function will fetch all rows of all columns except last one
# The first parameter in iloc is to fetch rows. Simply ':' without any indexes will fetch all rows.
# The second parameter ':-1' denotes a range that will fetch all columns except last column.
#x refers to independent variable

X = dataset.iloc[:, : -1].values

# -------------------------------Retrieving dependent variable i.e. last column ----------------------------------

# The iloc function will fetch all rows of only last column.
# The second parameter ':-1' will fetch only last column.
# y refers to dependent variable

y = dataset.iloc[:,-1].values

print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [3]:
# -------------------------------------Taking care of missing data---------------------------------------

# In real world datasets they may contain missing values, blanks, NaNs or other placeholders.
# Such datasets however are incompatible  with scikit-learn estimators
# which assume that all values in an array are numerical, and that all have and hold meaning.
# one solution is to discard entire rows and/or columns of dataset containing missing values.
# however it may lead to loss of valuable data
# Better strategy is to impute the missing values, i.e., to infer them from the known part of data 

# For this we use SimpleImputer class from sklearn.impute module
# the SimpleImputer class provides basic strategies for imputing missing values.
# Missing values are imputed using constant value, or statistics(mean, median or 'most frequent value')
# of each column in which the missing values are located.
# This class also allows for different missing values encodings.

# Here we replace missing values, encodes as np.nan, using the mean value of the column 

In [4]:
# ---------------------Step 4 From sklearn library import sklearn.impute module--------------------------------

from sklearn.impute import SimpleImputer

# ------------------------Step 5 Create an instance of the SimpleImputer class---------------------------------

# The poarameters used in creating SimpleImputer object will enable replacing nulls with mean value.

#1. First argument 'missing_values' specifies which value we have to replace.
#2. Second argument 'strategy' specifies the replacement policy i.e. mean, median or most_frequent.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 

<IPython.core.display.Javascript object>

In [5]:
# -------------------Step 6: Use fit() to apply imputer object to matriz of features-----------------------

# The object created in above step is not yet connected to the dataset matrix of features.
# IIn this step we will apply this imputer object on the matrix of features.
# For this we use fit method that will connect this imputer object to the matrix of features.

# The fit method observes missing values in specified columns and computes average of values in the.

#                                   Syntax : fit(X[rows,columns])

# Fit the imputer on the rows in the columns in matrix of features 'X'.

# Here 'X' denotes the matrix of features where we want to replace the missing data.
# Inside 'X' first specify the range of rows that this fit method will read.
# Then in 'X' specify the range of columns to look for some missing data
# Note: We must specify colums with only numeric values or real numbers like age, salary etc.

imputer.fit(X[:,1:3])

SimpleImputer()

In [6]:
# ---------------------------- Step 7: Call transform() method using imputer object -------------------------------
# It will replace the missing values of each column with specified value i.e. mean value here.

#                                   Syntax : transform(X)

# => Impute all missing values in X and return transformed columns.
# Transform method returns the new updated version of the matrix of features.
# Thus, on left side the lvalue will be same as 'X' matrix of features on right side of transform().

X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [9]:
#---------------------------------------Encoding Categorical data -------------------------------------

# Encoding categorical data i.e. Encoding the Independent Variable

# It may be difficult for a Machine Learning model to compute correlations between:
#    Categorical Strings based features and outcome based dependent variable column.
#    Thus, its necessary to convert or encode these strings i.e. categories into relevant numbers.
#    For this we use OneHotEncoding

# OneHot Encoding consists of turning the categorical column into 'n' no. of columns.
# 'n' no. refers to the 'n' different classes wil be created, if 5 classes then 5 columns and so on.
# Example : It it has 3 classes then 3 columns will be created, if it has 5 then 5 columns will be created.
# OneHotEncoding consists of creating binary vectors for each of the class present.
# Encoding with binary variate like 100, 010, 001 etc.


In [10]:
# ---------------------------------Step 8 : OneHotEncoding for encoding categorical data------------------------------

# A. Importing ColumnTransformer class from sklearn.compose module.

from sklearn. compose import ColumnTransformer

# B. sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

# C. The next step is to create an object of the ColumnTransformer a class.

# It takes two arguments:

# 1. First are Transformers that specify tuple of 3 things: )1. kind of transformation i.e. encoding,
#   (2.) type of encoding i.e. OneHotEncoder and (3.) indexes of columns we want to encode.
# 2. Second argument is remainder that specifies the columns that will be passed through and not encoded

#         Syntax : ColumnTransformer(transformers=[(kind of transformation i.e. encoding, kind of encoding, 
#                                                                              column_index), remainder])

# remainder - defines what to do with others

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0])], remainder='passthrough')

# D. Next step is to connect 'ct' object to matrix of features 'X' in a single step using fit_transform()
# The fit_transform() will fit and transform 'X', at same time.
# the fit_transform() will return the new matrix of features 'X' with 'n' columns OneHotEncoder
# The train() used later in Machine Learning model will expect the matrix of features x as numpy array.
# So we force and cast the output of this fit_transform() method to be a numpy array.

X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [11]:
#---------------------------------------- Step 9 : Label Encoding ------------------------------------------

# Encoding the dependent Variable.
# for this use another class called Label Encoder which encodes the labels in zeros and ones respectively.

# A. Importing LabelEncoder class from sklearn.preprocessing module.

from sklearn.preprocessing import LabelEncoder

# B. Next we create an object of this LabelEncoder class.
# No parameters required since it is applied on the one single vector needed to encode.

le = LabelEncoder()

# C. Use fit_transform() to fit and transform 'le' object to 'y' dependent variable and 
# convert text in numerical values directly.
# Here no need to convert to numpy 
# LHS and RHS contain same 'y'.

y = le.fit_transform(y)

print(y)

# Note : We must apply OneHotEncoding when we have several categories in one of the features of your 
# matrix of features but also you can do a simple LabelEncoding when we have two classes
# which we can directly encode into zero.

[0 1 0 0 1 1 0 1 0 1]


In [12]:
#----------------------------Data Splitting and Feature Scaling-------------------------------------

#Values should be within the range of -3 to +3

In [13]:
# -----------------------------Step 10 : Dataset Splitting -----------------------------

# Splitting the dataset into the Training set and Test set

# Sklearn library contains a module called module_selection contains function called train_test_split()

# train_test_split() function will create 4 different set:
# i.e. a pair of matrix of features and dependent variable for training set and 
# a pair of matrix of features and dependent variable for test set.
# The set X_train is a matrix of features of the training set.
# The set X_test is the matrix of features of the test set.
# The set y_train is a dependent variable of training set.
# The set y_test is a dependent variable of the test set.

# Most Machine Learning models expect this format of inputs.
# For training it expects X_train & Y_train as inputs in fit() method.
# For the predictions also called inference these models will predict X_test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =2/10, random_state = 1)

#test_size :- "float or int value -0.0 - 1.0" It represents the proportion of dataset to include in the test split.
#             The value range is between 0.0 and 1.0 Default value is 0.25
#             Here test size is set 0.2 i.e. 20% of total observations will be there in test set
#             and 80% will go in training set.

#train_size :- float or int. It represents the proportion of dataset to include in the train split.
#              The value range is between 0.0 - 1.0. Default value is complement of the test size.


# Random_state: - Since the observations will be randomly split into the training set and test set so we are just 
# fixiing the seed in random_state = 1 so that we get the same split and therefore the same training set and same test set.

#              Note : It doesn't matter if the random_state is 0 or 1 or any other integer.
#              What natters is that it snould be set the same value, if we want to validate
#              our processing over multiple runs of code. Setting randon_state to a fixed value
#              will guarantee that same sequence of random numbers are generated each time you

In [14]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [15]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [16]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [17]:
print(y_test)

[0 1]


In [18]:
# ------------------------------------- Step 11 : Feature Scaling ---------------------------------

# Feature Scaling
# Feature Scaling is a technique to standardize the independent features present in the data
# in a fixed range.
# It is performed during the data pre-processing to handLe highly varying values.
# If feature scaling is not done, then an ML algorithm tends to weigh greater values, higher
# and consider smaller values as the lower values, regardless of the unit of the values.
# Thus, greater values tend to dominate smatLer values.
# Example: If an algorithm is not using feature scaling method then it may consider the value
# 4000 meters to be greater than 6 km which is false. The algorithm here may give wrong predictions.

# SO, we use Feature Scaling to bring all values to same magnitudes and thus, solve this issue.
# Note : We won't have to apply feature scaling for all the Machine learning models at all t times.

In [19]:
# Example : 
# In our dataset the range of Age feature is 27.50, whereas range of Salary is 48000 - 83000
# It shows that range of Salary is much wider than the range of Age.
# The difference in Age contributes less compared to salary to the overall difference in feature set.
# Thus, age feature will be dominated by salary feature if we do not apply feature scaling.
# Therefore, we should use Feature Scaling to bring all values to the same magnitudes to solve this.

In [20]:
# Two methods of feature Scaling

# 1 .Standardisation and 2. Normalisation
 
#    Xstand = X - mean(X)/ standard deviation (X)

#    Xnorm = X-min(X)/ (max(X) - min(X))
# if outliers are there use Standardisation otherwise use Normalization

In [21]:
# In contrast to Standardisation, we obtain smaller standard deviations through "Max-Min Normalisation,
# Applying "Max-Min Nomaralisation generates smaller standard deviations than using Standardisation.
# It implies that data is more concentrated around mean if we scale data using Max-Min Noraralisation.
# Thus, if we have outliers in a feature then normalizing data scale data to a small intervol
# standardization does not have a bounding range.
# So, even if you have outliers in your data, they not be affected by standardization.
# Standardisation is more robust to outliers and is preferable over Max-Min Normalisation.

In [22]:
# Question 4 : Do we need to apply feature scaling to the dummy variables i.e. columns[0,1,2]
# in the matrix of features i.e. encoded values?

# Answer : The answer is No. The goal of standardization or feature scaling is to have all the values
# of the features in the same range. Standardization actually transforms features so that 
# they take values between more or less -3 and 3 approx. Since here are dummy variables already
# take values between this range because they're equal to either 1 or 0.
# So, nothing extra needs to be done here regarding standardization.

In [23]:
# Import the sklearn.preprocessing module that contains the StandardScaler class.
from sklearn.preprocessing import StandardScaler

# Now create an object of the StandardScaler class
# Note :  There are no arguments to input since we just want  to get the mean and standard deviation and 
# then apply this formula to all the values in the feature.
# This will be done automatically and no parameters are needed for this.

sc = StandardScaler()

In [24]:
# Now we fit our scalar Standardization toot on the training set X_train.
# Based on above discussion we won't apply feature scaling on the dummy variables and
# we will fit our standard scalar object only on ages and the salaries columns.
# We take all the rows of these two data columns. Here in X_train[;, 3:] the first ':'refers to all
# rows and '3:' refers to column range from age and salary column.
# Remember to take only all those columns with numerical values from large matrix of features.

# Applying the formulas:
# 1. Fit() : It only compute the mean and the standard deviation of all values of age and salary.

# 2. Transform() : Transform method that will indeed apply standardization formula by transforming
#                  each of the values of each feature (Age and Salary) into the 'xstand' value
#                  resulting from standardization formula.

# Thus, the fit() just get the mean and standard deviation of each of your features and
# transform() will apply the standardization formula to transform values for then to be in same scale.

# 3. fit_transform : A method of the StandardScaLer class that will follow the two functions
#                    at the some time meaning it will fit the matrtx of features to give the mean
#                    and standard deviation and after tnat transform all the values of the features
#                    as per standardization formula.

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

In [25]:
# Now we will transform our matrix of features of test set meaning X_test.
# For this data we will only apply the transform method because indeed the features of the test set
#need to be scaled by the same scalar that was used on the training set.
#Here we will not use fit_transform()  on X_test because it would generate a new scalar
#and that would absoloutely not make sense because X_test will actually
# predict function that will produce predictions in later stages of ML models.

X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [26]:
# On printing X_train we will find that we get same values for the dummy variables which
# are indeed still between minus three and plus three.
# The age and salary variables were transformed so that they take new values between minus two and plus two.
# Thus, they are now on same scale.
# This will help to improve or optimize the training of certain ML models.

print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [27]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
