## Fit the model/algorithm to the data and use it to make predictions 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 1. Fit the model to the data

In [4]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
# Split the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [6]:
# Let's use RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Setup random seed
np.random.seed(42)

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the model
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Evaluate the model
rfc.score(X_test, y_test)

0.8524590163934426

If you'd like to learn more about the Random Forest and why it's the war horse of machine learning, check out these resources:

- [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
- [Random Forests in Python by yhat](http://blog.yhat.com/posts/random-forests-in-python.html)
- [An Implementation and Explanation of the Random Forest in Python by Will Koehrsen](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76)

### 2. Make predictions using machine learning model

2 ways to make predictions
- `predict()`
- `predict_proba()`

In [7]:
# Compare predictions to truth labels to evaluate the model
y_preds = rfc.predict(X_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [8]:
rfc.score(X_test, y_test)

0.8524590163934426

In [9]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

In [13]:
# make predictions using predict_proba
# it returns a probability  of classification labels
rfc.predict_proba(X_test)[:5]

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

In [12]:
rfc.predict(X_test)[:5]

array([0, 1, 1, 0, 1], dtype=int64)

`predict()` can also be used for regression models

In [50]:
from sklearn.datasets import load_boston
boston_data = load_boston()
boston_data.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [52]:
boston_df = pd.DataFrame(boston_data["data"],
                         columns=boston_data["feature_names"])
boston_df["target"] = pd.Series(boston_data["target"])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [54]:
# Split the data
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

In [55]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the model
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

# Evaluate the model
rfr.score(X_test, y_test)

0.8654448653350507

In [56]:
y_preds = rfr.predict(X_test)
y_preds[:5]

array([23.081, 30.574, 16.759, 23.46 , 16.893])

[Formula](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.score)

In [57]:
# Compare the predictions to the truth
from sklearn.metrics import mean_squared_error

u = mean_squared_error(y_test, y_preds)
mean_array = np.empty_like(y_test)
mean_array.fill(y_test.mean())
v = mean_squared_error(y_test, mean_array)
print(1 - u/v)

0.8654448653350507


### 3. Evaluating a machine learning model

3 ways to evaluate Scikit-Learn  models/estimators
1. Estimator `score` method
2. The `scoring` parameter
3. Problem-specific metric functions
    
#### 3.1 Evaluating a model with the score method   

In [58]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfc = RandomForestClassifier().fit(X_train, y_train)


In [59]:
rfc.score(X_train, y_train)

1.0

In [60]:
rfc.score(X_test, y_test)

0.8524590163934426

In [61]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)

X = boston_df.drop("target", axis=1)
y = boston_df["target"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfr = RandomForestRegressor().fit(X_train, y_train)


In [62]:
rfr.score(X_train, y_train)

0.9763520974033731

In [63]:
rfr.score(X_test, y_test)

0.8654448653350507

**Remember, even if both models are using the same `score` function, they are using different formulas underlying.**

#### 3.2 Evaluating a model with the `scoring` paramter