<a href="https://colab.research.google.com/github/jaishreejoshita/Machine_Learning_Projects/blob/main/Data_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Process to do data pre-processing through the Machine Learning on salary structure of individuals from different countries

Importing Libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Importing the dataset

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
df = pd.read_csv('/content/Data_preprocessing.csv')
print (df)
X = df.iloc[:, :-1].values #to convert data into numpy array
Y = df.iloc[:, -1].values

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [11]:
print (X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print (Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### Importing Scikit-learn to handle missing values

In [12]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 1:3]) #here 3 means column 2 as python starts indexing from 0
X[:, 1:3] = imputer.transform(X[:,1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding Categorical data:

## Encoding the independent variable

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder #OneHotEncoder to transfer the categorical data into binary digits
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))
print (X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Encoding the dependent variables
Note: *We leverage LabelEncoder from Scikit learn's preprocessing library since it contains only two distinct categories compare to OneHotEncoding.

simple and Effective way to encode binary categories variables, where the order the labels doesn't matter, unlike in ordinary encoding.*

In [14]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
Y = le.fit_transform(Y) #the transform data fit back to the numpy array #LabelEncoder
print (Y)


[0 1 0 0 1 1 0 1 0 1]


## Splitting the data into training and testing set
Scikit-learn offers a function called train_test_split within its model _selection module. This fucntion efficiently splits your data into training and testing sets, perfectly suited for model needs.

It generates two pairs:

*   A training set containing features (X_train) and target variables (Y-train)
*   A testing set containing features (X_test) and target variables (Y-test)




In [15]:
from sklearn.model_selection import train_test_split
#create four variables: X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)


In [18]:
print (X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [20]:
print (X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [21]:
print (Y_train)

[0 1 0 0 1 1 0 1]


In [22]:
print (Y_test)

[0 1]


### Feature scaling -

Even after splitting our data into training and testing sets, we might encounter issues if our features have significantly different ranges. This is where feature scaling comes in.

Feature scaling is a data preprocessing technique that transforms our features to a common scale, that prevents one feature from dominating over others, ensuring that each feature contributes proportionately to the learning process of the machine learning model. If features have significantly different ranges and are not scaled, the model might give more weight to features with larger scales, leading to biased results. By scaling the features to a common scale, feature scaling helps mitigate this issue, ensuring that all features are equally considered during model training and evaluation.

It's important to note that feature scaling is applied after the train-test split because the test set is supposed to be a brand new set on which we will evaluate our machine learning model. We're not supposed to work with it for training to avoid data leakage. This maintains the integrity of our evaluation process, allowing us to assess the performance of our model on unseen data accurately.

There are two common techniques for feature scaling: Normalization and Standardization.


*   Normalisation is a good choice when we don't know the underlying distribution of our data

*   Standardization is good to use when our data follows a normal distribution

**StandardScaler** class from the preprocessing module, which facilitates standardization on both the matrix of features of the training set and the matrix of features of the test set.



In [24]:
from sklearn.preprocessing import StandardScaler
#lets create an object and call it sc to perform standardization
#note that we dont use scaling features for dummy variables
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
#fit and transform
#lets apply the same transformation to testing data
X_test[:, 3:] = sc.transform(X_test[:, 3:])
#we are just using transform in the testing data as we already computed mean and standard deviation from the training data

In [25]:
print (X_train)

[[0.0 0.0 1.0 -0.1915918438457856 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057902 -0.07013167641635401]
 [1.0 0.0 0.0 0.5667085065333239 0.6335624327104546]
 [0.0 0.0 1.0 -0.3045301939022488 -0.30786617274297895]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200352 -0.5646194287757336]]


In [26]:
print (X_test)

[[0.0 1.0 0.0 -1.4661817944830127 -0.9069571034860731]
 [1.0 0.0 0.0 -0.44973664397484425 0.20564033932253029]]
