## Data Preprocessing Tools

### Importing the libraries 

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

In [4]:
dataset = pd.read_csv(r'C:\Users\jaker\OneDrive\Documents\Machine Learning\Machine Learning A-Z\Part 1\Section 2\Python\Data.csv')
# x is going to be the matrix of features
# for x we need the information from the first 3 of the 4 columns in data.csv
# iloc = locate indexes. we want all rows (:), we want all columns bar the last one (:-1)
x = dataset.iloc[:, :-1].values
# y is going to be the dependent variable vector
# we want all of the rows (:) and we only want the last column (-1)
y = dataset.iloc[:, -1].values

Covered so far:
    How to import a dataset -- 
    How to create a matrix of features (independent variables) -- 
    How to create the dependent variable vector

In [5]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [6]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### Handling missing data

Generally, we don't want missing data because it can cause errors when training the machine learning model.
We could just delete the data but with small datasets this can be quite disruptive. 
If a significant proportion of the data is missing, as in this case - a more sophisticated solution is required. 
The following section shows how to replace missing values by substituting in the mean average from that category

In [7]:
from sklearn.impute import SimpleImputer #import simple imputer class
#create an instance of the simple imputer class, the target values are what numpy refers to as 'nan' 
#the replacement strategy is to use the mean average of the category in place of the nans. 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 
#the fit method will work out the averages 
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

#why 1:3? this refers to the desired columns. 0 is omitted because only columns containing numerical data are
#valid arguments for these methods. 1 is the lower bound, and 3 is used because Python excludes the upper bound. 
#This all means that the method will act on the 2nd and 3rd columns of the feature matrix. 

In [8]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Encoding categorical data 

We need to turn categories in to numbers but it's not as simple as e.g. France(0), Germany(1), Spain(2)
because the ordered nature of the values could lead to some misinterpretation. Therefore we encode the 
categories in to binary values by one hot encoding. 

Independent variable

In [9]:
from sklearn.compose import ColumnTransformer # import column transformer class
from sklearn.preprocessing import OneHotEncoder # import onehot encoder
# the transformers argument specifies what kind of transformartion 
# we want and which columns to apply the tranformation to
# we want to do an encoding transform
# of type: one hot encoding 
# applied to the first column (index 0)
# the second argument 'passthrough' says keep the other columns the same
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# apply all the above the feature matrix 'x'
x = np.array(ct.fit_transform(x))

In [10]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


Dependent variable

In [11]:
from sklearn.preprocessing import LabelEncoder #import class labelencoder
le = LabelEncoder() #create instance of labelencoder class 
y = le.fit_transform(y) #apply it to dependent variable vector

In [12]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


### Split the data into training and test

The four categories we need are: 
    The feature matrix of the training set (X_train) --
    The dependent variable vector of the training set (y_train) --
    The feature matrix of the test set (X_test) --
    The dependent variable vector of the test set (y_test) 

In [13]:
from sklearn.model_selection import train_test_split
#create these four variables by performing train_test_split on x and y, with a ratio of 80:20
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

In [14]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [15]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [16]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [17]:
print(y_test)

[0 1]


### Feature scaling

In some machine learning models, some features can totally overshadow others to the point where they're barely 
considered by the model. Feature scaling aims to resolve this issue. 

In [18]:
from sklearn.preprocessing import StandardScaler # import the standard scaler class
sc = StandardScaler() # create an instance of it 
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) #use the fit transform method of the class on the training data 
#all rows, any columns after 3 (to miss out 0,1,2 which are one hot encoded)
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # repeat process with test data

In [19]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [20]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
