# Choosing the right model / estimator / algorithm for our model
(sklearn uses the term 'estimator' as another term for machine learning algorithm or model) <br>


### We are going to work with - 
### Classification 
* Predicting if a sample is under one specific category/something or not. 
* It is the task of predicting a discrete class label. 
* A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label. 
* Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

### Regression 
* Predicting a number or continuous quantity. 
* A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.  
* Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

Visit https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/ for more information

![](https://scikit-learn.org/stable/_static/ml_map.png)
Source - https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html <br>
<br>Just a friendly advice - Reading reasearch articles are very helpful if you want to have a good grasp on any concept. Here is one for sklearn - https://jmlr.csail.mit.edu/papers/volume12/pedregosa11a/pedregosa11a.pdf . 
Give this a read. 
You won't understand everything at one go, but with time and practice things will become clearer.

![](https://thumbs.gfycat.com/QualifiedFewKoala-size_restricted.gif)

#### Optimus Prime - Let's Roll!

In [1]:
# standard imports
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

Sklearn has a built-in dataset called the "Boston Housing Dataset". We will use that for our learning.<br>
Refer for understanding the dataset - https://scikit-learn.org/stable/datasets/index.html#boston-dataset

In [2]:
from sklearn.datasets import load_boston
boston = load_boston()
boston;
# here boston is a dictionary.

In [3]:
# see the keys in the boston dictionary (just for our ease)
for i in boston:
    print(i)

data
target
feature_names
DESCR
filename


In [4]:
# change the dataset from dictionary to dataframe
bdf = pd.DataFrame(boston["data"], 
                   columns=boston["feature_names"])
bdf["target"] = pd.Series(boston["target"])
bdf.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [5]:
len(bdf)

506

![](https://scikit-learn.org/stable/_static/ml_map.png)
* \>50 samples = YES
* predicting a category = NO
* predicting a quantity = YES
* <100k samples = YES
* few features should be important = let's pretend NO. 

Then visit this URL = https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

In [6]:
# try out the Ridge Regression model
from sklearn.linear_model import Ridge

# setup random seed
np.random.seed(42)

#create the data
x = bdf.drop("target", axis=1)
y = bdf["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# Instantiate Ridge Model
model = Ridge()
model.fit(x_train, y_train) #find patterns 

# check the score of the Ridge model on the test data
model.score(x_test, y_test) #evaluate the above patterns

0.6662221670168521

The above score is good, but not the best. The best would be 1.0. So how do we improve? What if Ridge wasn't working in this case?
#### Let's give "Random forest regressor" a try!
![](https://thumbs.gfycat.com/HelplessWarmheartedIguanodon-size_restricted.gif)

In [9]:
# trying the Random forest regressor
from sklearn.ensemble import RandomForestRegressor

# setup a random seed
np.random.seed(42)

# create the data
x = bdf.drop("target", axis=1)
y = bdf["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# Instantiate Random forest regressor
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train) #find patterns 

# check the score of the Ridge model on the test data
rfr.score(x_test, y_test) #evaluate the above patterns

0.873969014117403

### The score improved! NICE!
![](https://media.giphy.com/media/ToMjGppwd6mojcZSFJm/giphy.gif)

In [7]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder() #instantiate the one hot encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough") 
# this is 'transformer' created by 'columntransformer' and we are asking the columntransformr to take the 
# onehot encoder and apply it to the categorical features and for the remainder/remaining of the columns - 
# just passthrough - don't do anything to those!
transformed_x = transformer.fit_transform(x) #fit the above to our data = x
transformed_x

ValueError: A given column is not a column of the dataframe