<a href="https://colab.research.google.com/github/pritish-tripathy-aiml/Python-AI-Machine-Learning/blob/main/Data_Preprocessing_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Preprocessing in Python**

In [1]:
# The Machine Learning Process
# 1
'''
Data Preprocessing
-> Import the data
-> Clean the Data
-> Split into training and test Sets
'''
# 2
'''
Modelling
-> Build the Model
-> Train the Model
-> Make Predictions
'''
# 3
'''
Evaluation
-> Calculate Performance Metrics
-> Make a verdict
'''

'\nEvaluation\n-> Calculate Performance Metrics\n-> Make a verdict\n'

In [2]:
# Training Set and Test Set
'''
We always split data into two parts that is train set and test set before performing any Function
Train = 80% and Test = 20%
Train Set => ŷ = b₀ + b₁X₁ + b₂X₂ then applying the model on test set to know the actual values
'''

'\nWe always split data into two parts that is train set and test set before performing any Function\nTrain = 80% and Test = 20%\nTrain Set => ŷ = b₀ + b₁X₁ + b₂X₂ then applying the model on test set to know the actual values\n'

In [3]:
# Feature Scaling
'''
Normalization => X' = (X - Xmin) / Xmax - Xmin
Standardization => X' = X - mu/sigma
'''

"\nNormalization => X' = (X - Xmin) / Xmax - Xmin\nStandardization => X' = X - mu/sigma\n"

In [4]:
# Dataset Used -> Part 1 - Data Preprocessing/Section 2/Python/Data.csv

##**Importing the Libraries**

In [5]:
import numpy as np    # To work with Arrays
import matplotlib.pyplot as plt   # To make charts and diagrams
import pandas as pd     # To import dataset and to create matrix for variables

##**Importing the Dataset**

In [6]:
dataset = pd.read_csv('/content/Data.csv')
'''
Features/Independent Variables will be before the Dependent Variables Columns and Dependent Variables in the Last Column.
Dependent Variables we need to Predict
'''
# Takes all the Columns except the Last One
X = dataset.iloc[:, :-1].values      # Stores Features in X

# Takes the Dependent Variable Columns
y = dataset.iloc[:, -1].values      # Stores Dependent Variables in y

In [7]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [8]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


##**Taking Care of Missing Data**

In [9]:
'''
If their is huge amount of data and missing details is 1% or less than 1% then:
You can just delete the missing data and create the model, It will not affect the model much
But if you have a smaller dataset you need to take care of the empty values
'''

'\nIf their is huge amount of data and missing details is 1% or less than 1% then:\nYou can just delete the missing data and create the model, It will not affect the model much\nBut if you have a smaller dataset you need to take care of the empty values\n'

In [10]:
# Replacing missing salary with average of all Salary
from sklearn.impute import SimpleImputer    # Used to take care of Empty Values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [11]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


##**Encoding Categorical Data**

In [12]:
'''
Encoding means making Categories of the Countries using Binary Bits
'''

'\nEncoding means making Categories of the Countries using Binary Bits\n'

###**Encoding Independent Variables/Features Data**

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [15]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


###**Encoding Dependent Variables/Features Data**

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [17]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


##**Splitting the Dataset into Training and Test Set**

In [18]:
'''
We have to do Feature Scaling after we Split the Dataset into Training and Test Set
to prevent Resource Leakage as Test Set is a complete new Dataset on which we evaluate our model with Observations
'''

'\nWe have to do Feature Scaling after we Split the Dataset into Training and Test Set\n'

In [19]:
# Splitting the Dataset into Training and Test Set
'''
Training Set -> We apply our model in Training Set
Test Set -> It is like a completely new set with which we test our Model
'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [20]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [21]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [22]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [23]:
print(y_test)

[0 1]


##**Feature Scaling**

In [24]:
'''
In some of the Models some Features dominate other features and the dominated Features are not even Considered in the
Machine Learning Model.
So, Feature Scaling ensures that all the Features are on the same Scale.
'''

'\nIn some of the Models some Features dominate other features and the dominated Features are not even Considered\nSo, Feature Scaling ensures that all the Features are on the same Scale.\n'

In [25]:
'''
There are two ways to do Feature Scaling
i) Standardisation = (x - mean(x))/standard deviation(x)
ii) Normalisation = (x - min(x))/(max(x) - min(x))
x = means all the values
Standardisation is recommended for feature scaling because Normalisation works well with values which have Normal Distribution
whereas Standardisation does all kind of Distribution perfectly
'''

'\nThere are two ways to do Feature Scaling\ni) Standardisation = (x - mean(x))/standard deviation(x)\nii) Normalisation = (x - min(x))/(max(x) - min(x))\nx = means all the values\nStandardisation is recommended for feature scaling because Normalisation works well with values which have Normal Distribution\nwhereas Standardisation does all kind of Distribution perfectly\n'

In [26]:
# Feature Scaling should be applied to Numerical Values
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [27]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [28]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


##**End of Data Preprocessing in Python**