<a href="https://colab.research.google.com/github/kc2209/Machine_Learning/blob/main/Datapreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

importing the dataset(First upload the file into google colab using files menu)

In [None]:
df = pd.read_csv("Data_preprocessing.csv")

The next step is to create essential elements for machine learning models i.e., Feature matrix and Dependent variable vector.
Feature matrix holds independent features or predictors. These are the characteristics you will use to build a model that predicts the dependent variable.
The dependent variable vector stores the dependent variable also called target variable or response variable.
x is feature matrix, y is dependent variable vector.

In [None]:
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

Our dataset has four columns, in which Country, age and salary are independent variables or features. Purchased is the dependent variable.

Python's iloc() function is an important tool in Pandas for data manipulation. It allows the selection and retrieval of specific rows and columns in DataFrames or Series using integer-based indexing. iloc() allows to identification data by specifying row and column indices numerically.
For reference see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

.values converts the selected subset array into numpy array.
Here -1 is because that the dependent variable 'Purchase' is last column.

In [None]:
print(x)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


Addressing Missing Values:

Imputation is one of the common techniques used to deal with missing values, which replaces the missing value with estimates i.e., mean or median or mode.

We can do this imputation using **Scikit-learn** library which provides the SimpleImputer class for various imputation strategies.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:3])
x[:,1:3]= imputer.transform(x[:,1:3])
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Missing_values argument tells the SimpleImputer to identify missing values. Here np.nan which means not a number is used as an indicator for missing values, which is a common representation of missing values in numpy arrays which are often used in machine learning.

Encoding Categorical Data

Next step is to Encode categorical data. Since many machine learning algorithms primarily work with numerical data. Encoding is transforming the categorical data to numerical data. We have a popular method called "one-hot Encoding".
Lets see how we can apply this method to the "country" column using scikit-learn library.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


ColumnTransformer enable us to apply different transformations to specific columns of our data.

*  In the transformer parameter, First we specifiy which type of transformation we want to apply. This value is arbitary. OneHotEncoder() specifies that its onehotencoder class we want to apply. And [0] indicates that onehotencoder will be applied only to the first column.   
*   Second parameter, remainder = 'passthrough' tells the ColumnTransformer function to leave the columns which are not mentioned, unchanged.

Now ct means that we want to apply onehotencoder tranformation only to the first column of our data and leave the other columns unchnaged.

Now to apply this ct to our data 'x', use use fit_transform(x). Fit_transform belongs to ColumnTransfer class which performs two functionalities. Firstly,Fit function examines the 1st column of our data(since we specified only 1st column) and determine which unique values exist to perform onehotencoding.Next is tranform: Once equippped with the knowledge of the data, it applies the designated transformations to the specifies columns.

Then np.array converts the resulted tranformed data to numpy array.




In [None]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


SPlitting the data:

Next our step is to split the data into training set and testing set.


This can be done using a function from scikit learn library called "train_test_split" within its model_selection module.

It generates two pairs:

*   A training set containing features(X_train) and target variables(y_train)
*   A testing set containing features(X_test) and target variables(y_test)



In [None]:
from sklearn.model_selection import train_test_split
# create four variables: x_train,x_test,y_train,y_test
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=1)

x,y represents our feature matrix and target variable. test_size = 0.2 indicates that 20% of the data will be set for testing.
random_state parameter is an optional parameter which controls the randomness in shuffling the data before splitting. random_state = 1 ensures that everytime we run this code we get the same random split. The number is like an identification mark. This number is arbitary.

In [None]:
print(x_train,'\n',x_test,'\n',y_train,'\n',y_test)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]] 
 [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]] 
 [0 1 0 0 1 1 0 1] 
 [0 1]


Feature scaling:

we will use "standardScaler" class from preprocessing module, which facilitates standardization on feature matrices of both training set and test set.

In [None]:
from sklearn.preprocessing import StandardScaler
#Lets create an object or variable and call it sc to perform standardization
sc = StandardScaler()
# Note that we dont apply this feature scaling to dummy variables i.e., the features we got after encoding the country column which only has either 0 or 1. We dont do feature
# scaling to them. Instead we do for other continuous variables such as age etc.
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
# .fit evaluates the features and compute mean and standrad deviation of the features and transform performs the trnsformation of the these values using mean and SD
x_test[:,3:] = sc.transform(x_test[:,3:])
# here we are using only transform, instead of fit_transform. Here the transform function transforms the testing data using the same mean and SD  calculated by fit above.
# this ensures the consistency in transformation and prevents data leakage.

In [None]:
print(x_train)
print(x_test)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


Now our data is ready and now we are ready to build the machine learning algorithms.