# Choosing the right model / estimator / algorithm for our model
(sklearn uses the term 'estimator' as another term for machine learning algorithm or model) <br>
![](https://thumbs.gfycat.com/QualifiedFewKoala-size_restricted.gif)

#### Optimus Prime - Let's Roll!

### We are going to work with - 
### Classification 
* Predicting if a sample is under one specific category/something or not. 
* It is the task of predicting a discrete class label. 
* A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label. 
* Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

### Regression 
* Predicting a number or continuous quantity. 
* A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.  
* Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

Visit https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/ for more information

![](https://scikit-learn.org/stable/_static/ml_map.png)
Source - https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html <br>
<br>Just a friendly advice - Reading reasearch articles are very helpful if you want to have a good grasp on any concept. Here is one for sklearn - https://jmlr.csail.mit.edu/papers/volume12/pedregosa11a/pedregosa11a.pdf . 
Give this a read. 
You won't understand everything at one go, but with time and practice things will become clearer.



In [1]:
# standard imports
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

Sklearn has a built-in dataset called the "Boston Housing Dataset". We will use that for our learning.<br>
Refer for understanding the dataset - https://scikit-learn.org/stable/datasets/index.html#boston-dataset

In [2]:
from sklearn.datasets import load_boston
boston = load_boston()
boston;
# here boston is a dictionary.

In [3]:
# see the keys in the boston dictionary (just for our ease)
for i in boston:
    print(i)

data
target
feature_names
DESCR
filename


In [4]:
# change the dataset from dictionary to dataframe
bdf = pd.DataFrame(boston["data"], 
                   columns=boston["feature_names"])
bdf["target"] = pd.Series(boston["target"])
bdf.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [5]:
len(bdf)

506

# Regression
![](https://scikit-learn.org/stable/_static/ml_map.png)
* \>50 samples = YES
* predicting a category = NO
* predicting a quantity = YES
* <100k samples = YES
* few features should be important = let's pretend NO. 

Then visit this URL = https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

In [6]:
# try out the Ridge Regression model
from sklearn.linear_model import Ridge

# setup random seed
np.random.seed(42)

#create the data
x = bdf.drop("target", axis=1)
y = bdf["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# Instantiate Ridge Model
model = Ridge()
model.fit(x_train, y_train) #find patterns 

# check the score of the Ridge model on the test data
model.score(x_test, y_test) #evaluate the above patterns

0.6662221670168521

The above score is good, but not the best. The best would be 1.0. So how do we improve? What if Ridge wasn't working in this case?
#### Let's give "Random forest regressor" a try!
![](https://media.giphy.com/media/NYx77GABHTji/giphy.gif)

In [7]:
# trying the Random forest regressor
from sklearn.ensemble import RandomForestRegressor

# setup a random seed
np.random.seed(42)

# create the data
x = bdf.drop("target", axis=1)
y = bdf["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# Instantiate Random forest regressor
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train) #find patterns 

# check the score of the Ridge model on the test data
rfr.score(x_test, y_test) #evaluate the above patterns

0.873969014117403

### The score improved! NICE!
![](https://media.giphy.com/media/ToMjGppwd6mojcZSFJm/giphy.gif)
#### So till now we used regression. It is time we see what happens with Classification.

# Classification
![](https://media.giphy.com/media/xUPOqfRo35i4oRf8NG/giphy.gif)
(meet Bumblebee!)

In [8]:
# get the dataset
hd = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/heart-disease.csv")
hd.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [9]:
hd.describe()  #just for fun

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


![](https://scikit-learn.org/stable/_static/ml_map.png)

* \>50 samples = YES
* predicting a category = YES
* is the data labelled? = YES
* <100k samples = YES
* so let's try <b> `linear SVC (support vector classification)` </b>

<br>Read about Support Vector Machine (SVM) here - https://scikit-learn.org/stable/modules/svm.html
<br>Read about linear SVC here - https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [13]:
# import the linear SVC estimator class
from sklearn.svm import LinearSVC

# setup a random seed
np.random.seed(42)

# create the data
x = hd.drop("target", axis=1)
y = hd["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# instantiate LinearSVC
clf = LinearSVC()
clf.fit(x_train, y_train)

# evaluate the LinearSVC model
clf.score(x_test,y_test)



0.4918032786885246

Now here we see that the model is operating at only ~49% accuracy.<br>
This happens because-
* our dataset is small to begin with
* target column has only 2 classes of values - 1 or 0 where 1 means the patient has a heart disease and 0 means healthy. So, when we deal with data that basically says "yes" or "no" - we can expect the result to be 50%. 
* Also, we have a warning message from Linear SVC (google to know about it!)

### So, what now? Let's get back to the decision tree again
![](https://scikit-learn.org/stable/_static/ml_map.png)

* is our data in text? = NO
* so let's try Ensemble classifiers (will get back to KNeighbours classifiers later)

Read here to know about the ensemble classifiers - https://scikit-learn.org/stable/modules/ensemble.html
<br>Read here to know about KNeighbour classifiers - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [14]:
# import the ensemble classifiers
from sklearn.ensemble import RandomForestClassifier

# setup a random seed
np.random.seed(42)

# create the data
x = hd.drop("target", axis=1)
y = hd["target"]

# split into test and train sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

# instantiate 
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

# evaluate 
clf.score(x_test,y_test)

0.8524590163934426

The score improved!
![](https://media.giphy.com/media/3ohA30GDNCMLypd9Sw/giphy.gif)

### So,
* When data = `structured` -- Go for ensemble methods
* When data = `unstructured` -- Go for Deep learning or Transfer Learning