# 3. FIT THE MODEL TO DATA AND USING IT TO MAKE PREDICTIONS
## This is for Classification Data
This file consist of steps to fit the machine learning model into a classification data. In this case we will use the heart disease data just like before. In the case of classification we also make predictions using the machine learning model generated from the fit() function. For classification case there are two types of prediction functions:
1. [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) function.
1. [predict_proba()](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba) function.

NOTE: the predict and predict_proba is function within any classification machine learning learn algorithm models. All models which handle classification case should have these two fucntions. 

For the regression which will be put into another file specific for fitting and predicting regression we will see that predict_proba() function does not exist. 

For preparation I need to import numpy and pandas to fetch and process heart_disease data.

In [1]:
# prepare the dataframe
import numpy as np 
import pandas as pd 
# get the data from csv file.
raw_df = pd.read_csv("../data/heart-disease.csv")
raw_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# check if the dataframe has non numeric data
raw_df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [5]:
# check dataframe for a nan data
raw_df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [6]:
# check any null values in dataframe
raw_df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

There are no null and NaN types of data inside the raw_df, the data is ready for processing.

In [7]:
# let's split the dataframe to the target
y = raw_df['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [9]:
# split for the feature dataframe
X = raw_df.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [12]:
# now we split the data into test and train, to do this I need sklearn library model selection.
# I will seed the random here to make the scenario replicable for further testing.
from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# test the result :
(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

((242, 13), (242,), (61, 13), (61,))

Number of rows between them are consistent. Ready to process further. 

Now I need to choose model to fit to the train data. I choose the [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [13]:
# import the random forest classifier model
from sklearn.ensemble import RandomForestClassifier
# instantiate the model into variable:
rfc = RandomForestClassifier()
# fit the model to the train data
rfc.fit(X_train, y_train)
# let's just put score here as reference:
rfc.score(X_test, y_test)

0.8524590163934426

Remember this score: 0.8524590163934426.

Now I want to make prediction and put it into y_preds variable. Remember that prediction function is already built in to Random Forest Classidier model.

In [16]:
# making predictions
y_preds = rfc.predict(X_test)
y_preds[:5]

array([0, 1, 1, 0, 1], dtype=int64)

The results of the prediction are stored in the y_preds array. They are consist of 1s and 0s. This is just like the content of the y_test and since it is built from X_test I am sure that the number of prediction result is the same as the y_test number of data.

In [17]:
# check the shape of y_preds and y_test
(y_preds.shape, y_test.shape)

((61,), (61,))

Exactly the same shape between y_test and y_preds. We can now compare them using numpy.mean on their similarities. 

NOTE: This mean function works since this is a classifier which only has two classes. The case of regression it is wiser to use the sklearn.metrics library.

In [18]:
# check the y_preds to y_test
np.mean(y_preds == y_test)

0.8524590163934426

Look at that! The mean value is the same as the score value we get earlier after fitting the data to the Random Forest Calssifier = 0.8524590163934426. 

This means it measure the same thing the fit of the model to the test data.

Now as we are fitting a classification data, we can use predict_proba() function built in to Random Forest Classifier. Remember this function only exclusive to classifier model algoritm.

In [19]:
y_proba = rfc.predict_proba(X_test)
y_proba[:5]

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

The predict_proba results are stored into y_proba array. This y_proba is nested array consist of probability of one class to appear over another. Since we only have two classes there are only two probability numbers on each ensted array. 

To proof these are probabilities when they are summed it will results 1.

In [21]:
y_proba[0][0] + y_proba[0][1]

1.0

To understand which probability is intended for which class we need to revisit the y_preds to see the positions.

In [22]:
y_preds[:5]

array([0, 1, 1, 0, 1], dtype=int64)

As the first prediciton is 0 then the probability 0.89 is for 0. Thus in each nested array the first float number is the probability for 0s to come up in prediction. The other float number is for the 1s to come up in the prediction.