# Data Preprocessing Tools

## Importing the libraries

In [2]:
import pandas as pd # useful for preprocessing
import numpy as np # has numpy arrays 
import matplotlib.pyplot as plt

## Importing the dataset

In [3]:
# creat a pandas dataframe 
dataset = pd.read_csv("Data.csv") # automatically assumes 1st row is header 

#create a matrix of features and dependent variables
'''
features are the independant variables which will be used to predict the outcome of something( the dependent variable)
in the given dataset: 
country, salary and age are features. 
purchases is the dependent variable ( it is dependant on the features)

X is the feature vector matirx = 1st 3 columns
Y is the dependent variable matrix = last column
=> just split the dataframe and extract. 
iloc = locate index. 
'''

x= dataset.iloc[:,:-1].values
y= dataset.iloc[:,-1] .values


In [4]:
print(x)
print()
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [5]:
'''
Various approaches : 
1. vv less missing data - just delete those rows
2. replace by avg of that column 
3. replace by median
4. replace by highest frequency value

Scikit learn is the amaze library that we're gonna use 
'''

from sklearn.impute import SimpleImputer

# missing values are denoted by nan and replace it by its mean
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')  

# fit method inputs the data to the object. 
'''
fit method expects only numerical data. So data in any other format should be converted to numerical data. 
in our case only cols 2 and 3 have missing values and they are in numberical format already. 
'''
imputer.fit(x[:,1:3])
# transform performs the operation and fills the empty cells. 
x[:,1:3] = imputer.transform(x[:,1:3])


In [6]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data


the goal here is to encode the categorical data. There are only 3 countries. So we can call france = 0, spain = 1 and
germany = 3. But the ML model might consider this to be some sort of a heirarchy and give wrong results. 
This is where one hot encoding comes in. We will make every country as a bool = does it belong to france or not and etc. 
As a result we're splitting one country col to 3 different cols aka, giving every country a unique ID 

for eg: France,44,72000 - this becomes:

France Spain Germany Age   Salary  
    1   0     0       44   72000     


### Encoding the Independent Variable

In [8]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder
# what kind of tranformation, which index 
# tranformer = [(what type of transformation, transformation class, columns to be transformed)]
# if remainder isnt given, the remaining columns will not appear in the final result. 
# We want the other columns of the matrix to be left the way it is. 
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])] , remainder ='passthrough')

# fit and tranform 
# convert the matrix to a numpy array. 
x= np.array(ct.fit_transform(x))

In [9]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [10]:
#convert yes and no to 0s and 1s
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()
y = le.fit_transform(y)

In [11]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split
# train_test_split(matrix of features, dependent variable,test_size=, random_state = an int)
#random_state specifies the seed value for the randomised splitting of data into test and train sets. 
#if you make this a constant, the test and train split will always have the same random split every time. 

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 1)

print(x_train)
print()
print(x_test)
print()
print(y_train)
print()
print(y_test)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

[0 1 0 0 1 1 0 1]

[0 1]


## Feature Scaling

Feature scaling is done to ensure some variables dont dominate others. 
for eg, In multi variable linear regression model: y = b0 + b1x1+ b2x2 +b3x3 +.... bnxn 
If x3's values are too high but the weightage of every variable is the same. Hence we can scale x3 down by adjusting its coefficient. 

There are two kinds of feature scaling : Normalisation and Standardisation

Standardisation makes all the values to be contained in the same range.

Xstand = [x - mean(x)]/standard_deviation(x)
Xnorm = [x-min(x)]/[max(x) - min(x)]

Standardisation works all the time, whereas normalisation works well only when features have a normal distribution. 

In [13]:
from sklearn.preprocessing import StandardScaler 

sc = StandardScaler()
'''
Dont do standarisation on the categorical variables. Theyre dummy variables to denote a country. 
'''
x_train[:,3:]= sc.fit_transform(x_train[:,3:])

#we will not calculate the new scaling factor for train dataset. We're only scaling to the same factor as before. 
# so dont do fit on test. only transform. 

x_test[:,3:] = sc.transform(x_test[:,3:])


In [14]:
print(x_train)
print()
print(x_test)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
