# Getting our data ready for ML via sklearn

Yeah, I know, it feels kind of difficult... <br>
![](https://media.giphy.com/media/1eBQ8eo1AYyGc/giphy.gif) ![](https://media1.tenor.com/images/7ecb7f303712e4e24350d5b5ad9689fa/tenor.gif?itemid=5436040)

See? I don't always use memes :)
Best wishes to you!

## Standard imports

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Our objectives 
#### 1. Split data into features (x usually) and labels (y usually)
#### 2. Filling (also called "Imputing") or disregarding missing values
#### 3. Converting non-numerical values to numerical values (a.k.a Feature encoding)

In [47]:
# Let's begin with the heart disease dataset
hd = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/heart-disease.csv")
hd.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1


In [48]:
#length/size of the dataset
len(hd)

303

In [49]:
# get the feature (x) ready = all columns except target
x = hd.drop("target", axis=1)
x.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2


In [50]:
# get the labels (y) ready = target column
y = hd.target
y.head(3)

0    1
1    1
2    1
Name: target, dtype: int64

# Always Remember! <br>
## NEVER  EVALUATE  or  TEST your models on DATA that it is LEARNT FROM -- that's why we split it into training and test sets.
(because if you do, it is like cheating in an examination)  <br>
![](https://media.giphy.com/media/A4CLbWb9o5rj2/giphy.gif)

### 1. Splitting the data into test and training sets
Imagine it this way
* Test data = Final exam
* Training data = Mock exam (So we use the train_test_split first!)

In [51]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2) 
# test_size=0.2 means that we want our test dataset to be 20% of the overall data

In [52]:
#check the shapes of the new matrices just created. 
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

Just notice,  <br>
242 = 80% of 303 <br>
61 = 20% of 303 <br>
So, training set is 80% of overall data and test set is 20% of the overall data  <br>
(I have mentioned this above too!)

### 2. Filling (also called "Imputing") or disregarding missing values
### Make sure data is in numerial format/values  - if not, make them!

In [53]:
#let us use a new dataset now. So first we will get the dataset
cse = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/car-sales-extended.csv")
cse.head()
#ummm... I mean car-sales-extended by cse. (I'm lazy!)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [54]:
len(cse)

1000

In [55]:
cse.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [56]:
# prepare the feature and label
x = cse.drop("Price", axis=1)
y = cse.Price

In [57]:
# Splitting the data into test and training sets (Practice time!)
from sklearn.model_selection import train_test_split # No need to do it again, it's already done above. But still... :P
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2) 

In [58]:
x_train.size, x_test.size, y_train.size, y_test.size

(3200, 800, 800, 200)

In [59]:
# build ML model
from sklearn.ensemble import RandomForestRegressor

Now this random forest reggressor is the same as a classifier random forest. <br>
But this time it can predict a number which is what we're trying to do. <br>
We're trying to predict the price of a car given we are given some attributes about it. <br>

#### Now, our ML algorithm cannot deal with strings! So we need to take care of that --

### 3. Converting categorical values (strings) into numerical values

In [60]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder() #instantiate the one hot encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough") 
# this is 'transformer' created by 'columntransformer' and we are asking the columntransformr to take the 
# onehot encoder and apply it to the categorical features and for the remainder/remaining of the columns - 
# just passthrough - don't do anything to those!
transformed_x = transformer.fit_transform(x) #fit the above to our data = x
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [61]:
pd.DataFrame(transformed_x) #changing the above into a dataframe

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


### The concept of "One hot encoding in python"
![](https://cdn-images-1.medium.com/fit/t/1600/480/1*ggtP4a5YaRx6l09KQaYOnw.png)
"<p>One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. It&rsquo;s easier to understand visually: in the example below, we One Hot Encode a&nbsp;<em class="ji">color&nbsp;</em>feature which consists of three categories (<em class="ji">red</em>,&nbsp;<em class="ji">green</em>, and&nbsp;<em class="ji">blue</em>).</p>
<p>Sci-kit Learn offers the&nbsp;<code class="jy kj kk kl km b">OneHotEncoder</code>&nbsp;class out of the box to handle categorical inputs using One Hot Encoding. Simply create an instance of&nbsp;<code class="jy kj kk kl km b">sklearn.preprocessing.OneHotEncoder</code>&nbsp;then fit the encoder on the input data (this is where the One Hot Encoder identifies the possible categories in the DataFrame and updates some internal state, allowing it to map each category to a unique binary feature), and finally, call&nbsp;<code class="jy kj kk kl km b">one_hot_encoder.transform()</code>&nbsp;to One Hot Encode the input DataFrame. The great thing about the&nbsp;<code class="jy kj kk kl km b">OneHotEncoder</code>&nbsp;class is that, once it has been fit on the input features, you can continue to pass it new samples, and it will encode the categorical features consistently.</p>"

Source of above information - https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

### Now, there is another way to do this! - Using pandas!

In [62]:
dummies = pd.get_dummies(cse[["Make", "Colour", "Doors" ]])
dummies #doesn't work as above as Doors are in numbers

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


### Now let's refit the model

In [66]:
model = RandomForestRegressor() #create a model
np.random.seed(42)
x_train, x_test,  y_train, y_test = train_test_split(transformed_x,
                                                    y,
                                                    test_size=0.2) #remember to keep the orders same (funny, I mess it up all the time!)
model.fit(x_train, y_train) #train a model

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [67]:
model.score(x_test, y_test) #run/evaluate a model

0.3235867221569877

![](https://media1.tenor.com/images/50e1f1cdd0df8e1bd6a7feab86ca8ac8/tenor.gif?itemid=8852977)


# Handling Missing Values With Pandas

