# Step 2: Choose the right estimator/algorithm for our problems

In [2]:
# Importing Standard Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Types of problem:
* Regression - Predicting a number
* Classification - Predicting whether a sample is one thing or another

<img src="../images/sklearn-ml-map.png" width=950/>

## 2.1: Picking a machine learning model/estimator for a Regression problem

Check the Scikit-Learn machine learning map...
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

#### Import Boston Housing Dataset

In [3]:
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [4]:
boston["feature_names"]

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [5]:
# Create Dataframe
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [6]:
# No. of samples
len(boston_df)

506

## Let's try the Ridge Regression Model

In [7]:
from sklearn.linear_model import Ridge

### Create the data

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Setup Random Seed
np.random.seed(42)

X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Instantiate Ridge Model

In [10]:
model = Ridge()
model.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [11]:
# Check score
model.score(X_test, y_test)

0.6662221670168522

**How to improve this score?**

**What if Ridge wasn't working?**

Let's refer back to the map...
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## Let's try the Random Forest Regressor

In [12]:
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Random Seed
np.random.seed(42)

# Create Data
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instatiate Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

# Evaluate Random Forest Regressor
rf.score(X_test, y_test)

0.873969014117403

In [13]:
# Check Ridge Model again
model.score(X_test, y_test)

0.6662221670168522

## 2.2: Picking a machine learning model/estimator for a Classification problem

Check the Scikit-Learn machine learning map...
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

#### Import Heart Disease Data

In [14]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [15]:
len(heart_disease)

303

## Let's try the 'LinearSVC'

In [16]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)

# Evaluate LinearSVC
clf.score(X_test, y_test)



0.4918032786885246

## Let's try the Random Forest Classifier

In [17]:
# Import the Random estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Evaluate Random Forest Classifier
clf.score(X_test, y_test)

0.8524590163934426

**A little advise for modelling in machine learning is:**

    1. If you have structured data (tables or dataframes), use ensemble methods, such as, a Random Forest.
    2. If you have unstructured data (text, images, audio, things not in tables), use deep learning or transfer learning.

# Step 3: Fit the model/algorithm and use it to make predictions on our data

## 3.1: Fit the model

In [30]:
# Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Evaluate Random Forest Classifier
clf.score(X_test, y_test)

0.8524590163934426

## 3.2: Make Prediction

Two ways:
1. `predict()`
2. `predict_proba()`

#### predict()

In [31]:
clf.predict(X_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [32]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [33]:
# Compare predictions
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [34]:
clf.score(X_test, y_test)

0.8524590163934426

In [35]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

#### predict_proba()      -----------> Classification Models

`It returns probabilities of a classification labels.`

In [36]:
# Let's test first 5 samples

In [38]:
clf.predict_proba(X_test[:5])                      # predict_proba()

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

In [40]:
clf.predict(X_test[:5])                           # predict()

array([0, 1, 1, 0, 1], dtype=int64)

**`predict_proba() provides a probability of having 0 or 1.`**

`For eg. In case of sample 1,
 Model is 89% sure that the output is 0 and this can be very useful when we are making prediction for any health problem.
         But in case of sample 2,
 Model is only 51% sure that the output is 1, which is not good for analysing any health related problem.`
        

#### predict()
`Can also be used with Regression Model.`

In [42]:
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Random Seed
np.random.seed(42)

# Create Data
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instatiate Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [45]:
y_preds = rf.predict(X_test)
y_preds[:10]

array([23.002, 30.826, 16.734, 23.467, 16.853, 21.725, 19.232, 15.239,
       21.067, 20.738])

In [49]:
# Actual Labels
np.array(y_test[:10])

array([23.6, 32.4, 13.6, 22.8, 16.1, 20. , 17.8, 14. , 19.6, 16.8])

In [50]:
# Compare the predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

2.1226372549019623

`2.1226372549019623 is the average difference between truth vs prediction`

eg. y_test[0] - y_preds[0] ~= 2.1226