# Getting our data ready for ML via sklearn

Yeah, I know, it feels kind of difficult... <br>
![](https://media1.tenor.com/images/7ecb7f303712e4e24350d5b5ad9689fa/tenor.gif?itemid=5436040)



## Standard imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Our objectives 
#### 1. Split data into features (x usually) and labels (y usually)
#### 2. Filling (also called "Imputing") or disregarding missing values
#### 3. Converting non-numerical values to numerical values (a.k.a Feature encoding)

In [3]:
# Let's begin with the heart disease dataset
hd = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/heart-disease.csv")
hd.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1


In [None]:
#length/size of the dataset
len(hd)

In [None]:
# get the feature (x) ready = all columns except target
x = hd.drop("target", axis=1)
x.head(3)

In [None]:
# get the labels (y) ready = target column
y = hd.target
y.head(3)

# Always Remember! <br>
## NEVER  EVALUATE  or  TEST your models on DATA that it is LEARNT FROM -- that's why we split it into training and test sets.
(because if you do, it is like cheating in an examination)  <br>
![](https://media.giphy.com/media/A4CLbWb9o5rj2/giphy.gif)

### 1. Splitting the data into test and training sets
Imagine it this way
* Test data = Final exam
* Training data = Mock exam (So we use the train_test_split first!)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2) 
# test_size=0.2 means that we want our test dataset to be 20% of the overall data

In [None]:
#check the shapes of the new matrices just created. 
x_train.shape, x_test.shape, y_train.shape, y_test.shape

Just notice,  <br>
242 = 80% of 303 <br>
61 = 20% of 303 <br>
So, training set is 80% of overall data and test set is 20% of the overall data  <br>
(I have mentioned this above too!)

### 2. Filling (also called "Imputing") or disregarding missing values
### Make sure data is in numerial format/values  - if not, make them!

In [None]:
#let us use a new dataset now. So first we will get the dataset
cse = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/car-sales-extended.csv")
cse.head()
#ummm... I mean car-sales-extended by cse. (I'm lazy!)

In [None]:
len(cse)

In [None]:
cse.dtypes

In [None]:
# prepare the feature and label
x = cse.drop("Price", axis=1)
y = cse.Price

In [None]:
# Splitting the data into test and training sets (Practice time!)
from sklearn.model_selection import train_test_split # No need to do it again, it's already done above. But still... :P
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2) 

In [None]:
x_train.size, x_test.size, y_train.size, y_test.size

In [None]:
# build ML model
from sklearn.ensemble import RandomForestRegressor

Now this random forest reggressor is the same as a classifier random forest. <br>
But this time it can predict a number which is what we're trying to do. <br>
We're trying to predict the price of a car given we are given some attributes about it. <br>

#### Now, our ML algorithm cannot deal with strings! So we need to take care of that --

### 3. Converting categorical values (strings) into numerical values

In [None]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder() #instantiate the one hot encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough") 
# this is 'transformer' created by 'columntransformer' and we are asking the columntransformr to take the 
# onehot encoder and apply it to the categorical features and for the remainder/remaining of the columns - 
# just passthrough - don't do anything to those!
transformed_x = transformer.fit_transform(x) #fit the above to our data = x
transformed_x

In [None]:
pd.DataFrame(transformed_x) #changing the above into a dataframe

### The concept of "One hot encoding in python"
![](https://cdn-images-1.medium.com/fit/t/1600/480/1*ggtP4a5YaRx6l09KQaYOnw.png)
"<p>One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. It&rsquo;s easier to understand visually: in the example below, we One Hot Encode a&nbsp;<em class="ji">color&nbsp;</em>feature which consists of three categories (<em class="ji">red</em>,&nbsp;<em class="ji">green</em>, and&nbsp;<em class="ji">blue</em>).</p>
<p>Sci-kit Learn offers the&nbsp;<code class="jy kj kk kl km b">OneHotEncoder</code>&nbsp;class out of the box to handle categorical inputs using One Hot Encoding. Simply create an instance of&nbsp;<code class="jy kj kk kl km b">sklearn.preprocessing.OneHotEncoder</code>&nbsp;then fit the encoder on the input data (this is where the One Hot Encoder identifies the possible categories in the DataFrame and updates some internal state, allowing it to map each category to a unique binary feature), and finally, call&nbsp;<code class="jy kj kk kl km b">one_hot_encoder.transform()</code>&nbsp;to One Hot Encode the input DataFrame. The great thing about the&nbsp;<code class="jy kj kk kl km b">OneHotEncoder</code>&nbsp;class is that, once it has been fit on the input features, you can continue to pass it new samples, and it will encode the categorical features consistently.</p>"

Source of above information - https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

### Now, there is another way to do this! - Using pandas!

In [None]:
dummies = pd.get_dummies(cse[["Make", "Colour", "Doors" ]])
dummies #doesn't work as above as Doors are in numbers

### Now let's refit the model

In [None]:
model = RandomForestRegressor() #create a model
np.random.seed(42)
x_train, x_test,  y_train, y_test = train_test_split(transformed_x,
                                                    y,
                                                    test_size=0.2) #remember to keep the orders same (funny, I mess it up all the time!)
model.fit(x_train, y_train) #train a model

In [None]:
model.score(x_test, y_test) #run/evaluate a model

![](https://media1.tenor.com/images/50e1f1cdd0df8e1bd6a7feab86ca8ac8/tenor.gif?itemid=8852977)


# Handling Missing Values With Pandas

* Imputation - Fill the missing values with 'some' values. 
* Remove the samples with missing data altogether.

In [None]:
#import a new dataset
csm = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/car-sales-missing-data.csv")
csm 

In [None]:
csm.isna()

In [None]:
csm.isna().sum() # get a compact data on no of cells with missing data

In [None]:
#same thing with a larger dataset
csme = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/car-sales-extended-missing-data.csv")
csme

In [None]:
csme.isna().sum()

![](https://media1.tenor.com/images/bbb5ed42721b7a95d4507fe68caba984/tenor.gif?itemid=5195597)
#### Convert our data to numbers | categories to numbers (just as before). 
##### But before that, remove the "NaN" from the dataframe!

In [None]:
# option-1 use pandas to fill the missing values

# fill the 'Make' column
csme.Make.fillna("missing", inplace=True)

# fill the 'Colour' column
csme.Colour.fillna("missing", inplace=True)

# fill the 'Odometer (KM)' column
csme['Odometer (KM)'].fillna(csme['Odometer (KM)'].mean(), inplace=True) # replace with the mean

# fill the 'Doors' column
csme.Doors.fillna(4, inplace=True) # as most cars have 4 doors ideally

#lets see what we end up with
csme.isna().sum()

In [None]:
# ok, now we need to think about the price!
# lets see what is the mean price here - would it be logical to use the mean price in missing sections?
csme.Price.mean()

In [None]:
# nahhh that doesn't look ideal for me!
# So I will simply remove the whole rows which have prices missing.
# it's just my choice, your's might be different - cheers!
csme.dropna(inplace=True)
csme.isna().sum()
# so we miss 50 rows now!

In [None]:
# prepare the feature and label
x = csme.drop("Price", axis=1)
y = csme.Price

In [None]:

from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder() #instantiate the one hot encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough") 
# this is 'transformer' created by 'columntransformer' and we are asking the columntransformr to take the 
# onehot encoder and apply it to the categorical features and for the remainder/remaining of the columns - 
# just passthrough - don't do anything to those!

# HERE IS A LITTLE CHANGE
transformed_x = transformer.fit_transform(csme) #fit the above to the dataset itself INSTEAD of x!
transformed_x

![](https://media1.tenor.com/images/cb84e3e676def7599aec1fca4ed0e461/tenor.gif?itemid=5610528)

<p>Once your data is all in numerical format, there's one more transformation you'll probably want to do to it.</p>
<p>It's called&nbsp;<strong>Feature Scaling</strong>.</p>
<p>In other words, making sure all of your numerical data is on the same scale.</p>
<p>For example, say you were trying to predict the sale price of cars and the number of kilometres on their odometers varies from 6,000 to 345,000 but the median previous repair cost varies from 100 to 1,700. A machine learning algorithm may have trouble finding patterns in these wide-ranging variables.</p>
<p>To fix this, there are two main types of feature scaling.</p>
<ul>
<li>
<p><strong>Normalization&nbsp;</strong>(also called min-max scaling) - This rescales all the numerical values to between 0 and 1, with the lowest value being close to 0 and the highest previous value being close to 1. Scikit-Learn provides functionality for this in the&nbsp;<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html" rel="noopener noreferrer">MinMaxScalar class</a>.</p>
</li>
<li>
<p><strong>Standardization</strong>&nbsp;- This subtracts the mean value from all of the features (so the resulting features have 0 mean). It then scales the features to unit variance (by dividing the feature by the standard deviation). Scikit-Learn provides functionality for this in the&nbsp;<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" rel="noopener noreferrer">StandardScalar class</a>.</p>
</li>
</ul>
<p>A couple of things to note.</p>
<ul>
<li>
<p>Feature scaling usually isn't required for your target variable.</p>
</li>
<li>
<p>Feature scaling is usually not required with tree-based models (e.g. Random Forest) since they can handle varying features.</p>
</li>
</ul>
<p><strong>Extra reading</strong></p>
<p>For further information on this topic, I'd suggest the following resources.</p>
<ul>
<li>
<p><a href="https://medium.com/@rahul77349/feature-scaling-why-it-is-required-8a93df1af310" rel="noopener noreferrer">Feature Scaling - why is it required?</a>&nbsp;by Rahul Saini</p>
</li>
<li>
<p><a href="https://benalexkeen.com/feature-scaling-with-scikit-learn/" rel="noopener noreferrer">Feature Scaling with Scikit-Learn</a>&nbsp;by Ben Alex Keen</p>
</li>
<li>
<p><a href="https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/" rel="noopener noreferrer">Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization</a>&nbsp;by Aniruddha Bhandari</p>
</li>
</ul>

# Handling Missing Values With ScikitLearn

<p>Safety tips! :p</p>
<ul>
<li>
<p>Split your data first (into train/test), always keep your training &amp; test data separate</p>
</li>
<li>
<p>Fill/transform the training set and test sets separately (this goes for filling data with pandas as well)</p>
</li>
<li>
<p>Don't use data from the future (test set) to fill data from the past (training set)</p>
</li>
</ul>

In [None]:
#import a fresh version of the dataset
csm = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/car-sales-missing-data.csv")
csm 