# Machine Learning Notebook 1: Data preprocessing

### Compiled by Amit Purswani
LinkedIn: https://www.linkedin.com/in/amit-purswani-2a073777/

Repositories
1. Data Analysis:
https://github.com/kranemetal/Data-Analysis-Projects

2. Machine Learning:
https://github.com/kranemetal/MachineLearning

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Importing Dataset

In [2]:
df = pd.read_csv("C:\\Users\krane\Desktop\datasets\Data.csv")

In [3]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [4]:
df.shape

(10, 4)

## Separating independent variables x and dependent variable y

In [5]:
x = df.iloc[:,:-1].values
# :-1 means all columns except the last one
y = df.iloc[:,-1].values
# -1 means exactly last column

In [6]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [7]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [8]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

#### When missing data % is very low like 1% we can delete those records, as that doesn't have much effect on overall data.

In [9]:
#import library for treating missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:,1:3])
# 1:3 means second and third column i.e. Age and Salary are selected, indexing starts at 0.
# 3 stands for 4th column which is not included in count.
x[:,1:3] = imputer.transform(x[:,1:3])

In [10]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Encode categorical data

### Encoding independent variable


In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [12]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough',)

In [13]:
x = np.array(ct.fit_transform(x))

In [14]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding dependent variable

In [15]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [16]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


# Splitting dataset in Train and Test sets

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=1)
# test_size ratio is percentage of rows taken as test set. 0.2 means 20%
#random_state is any number, maintained same when we need to reproduce similar results, when working together on same data.

In [19]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [20]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [21]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [22]:
print(y_test)

[0 1]


# Feature Scaling

#### Data scaling is to be done after splitting Train and Test.<br>Because scaling will reveal the mean and std. dev of all data while scaling, which we dont want, we want test data to be left untouched for testing our ML models.<br>Scaling need not be done on dummy variables that are created while encoding categorical variables.

### Standardization

x_std = x - mean(x) / std.dev (x)

Results in data-point values between -3 to 3.

<b>Generally works well in all cases.</b>

### Normalization

x_norm = x - min(x) / max(x) - min(x)

Results in datapoint values between 0 to 1.

<b>Recommended when we have Normal Distribution in most of the features of the dataset.</b>

In [23]:
from sklearn.preprocessing import StandardScaler

In [24]:
sc = StandardScaler()

In [25]:
x_train[:,3:] = sc.fit_transform(x_train[:,3:])

#Fit calculates the values
#Transform updates the values

In [26]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [27]:
x_test[:,3:] = sc.transform(x_test[:,3:])
#Note: For Test set we only do Transform, as values have already been calculated during fit of Train Set.

In [28]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


# <center>The End</center>