<a href="https://colab.research.google.com/github/hmushtaq786/machine_learning_a-z/blob/main/data_preprocessing/1_data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [21]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [22]:
dataset = pd.read_csv("Data.csv")
# iloc is used for integer indexing and loc is used for label indexing such as ["name1", "name2"]
X = dataset.iloc[:, :-1].values # [row_min:row_max, col_min:col_max] # including lower bound and excluding the upper bound
y = dataset.iloc[:, -1].values

In [23]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [24]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [25]:
# number of missing entries in each column
missing_values = dataset.isnull().sum()
print(missing_values)

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


In [26]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
# combination of fit and transform method where fit will look at missing values and computer the average of the column, to actually replace the nan with the average values, we will use transform
imputer.fit(X[:, 1:3]) # adding only the numerical columns
X[:, 1:3] = imputer.transform(X[:, 1:3]) # returns the same transformed columns

In [27]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

In [28]:
# converting labelled data into numbers (such as france = 0, spain = 1 and germany = 2) is not an ideal approach since the machine learning
# model can interpret that this order matters. So, we use OneHotEncoding to convert our labelled column into the number of distinct labels
# that we have and creating binary vectors for each of the label (such as france = [1, 0, 0], spain = [0, 1, 0] and germany = [0, 0, 1])
# For dependent column, we can use the approach of converting it to numbers if there is a binary output i.e. LabelEncoding

### Encoding the Independent Variable

In [29]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# In ColumnTransformer class, we pass two arguments
# 1. transformers: specify which transformation we want to do and on which indices of the columns. We pass list of tuples. In the tuple, first value is the kind of transformation i.e. encoding in our case. second value is the kind of encoding and  third value is the indices of the columns we want to encode
# 2. remainder: do we want to keep the columns that won't be applied any transformation. passthrough = keep columns, drop = remove columns
ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough")
X = np.array(ct.fit_transform(X)) # ct.fit_transform() does not return as numpy array so we convert it ourselves

In [30]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [31]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y) # no need to convert it to numpy array since it is depending variable vector and doesn't need to be a numpy array as what is expected by the machine learning models

In [32]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

## Feature Scaling