## 2.2 SELECT CLASSIFICATION MODEL
In this session we want to analyze heart_desease data. This is a cathegorical data which I want to make prediction to group the data into two class : having heart desease (1) and not havig heart disease (0).

There are already a data on heart disease in the data directory in this project. We will just use that. Since it was presented as csv file I need to import and convert it first into dataframe.

In [1]:
# importing basic necessary libraries to manage data frames and set random seed
import pandas as pd 
import numpy as np 
# read the csv data and put it into the data frame df
df = pd.read_csv("../data/heart-disease.csv")
# lets test if the df really sucessfully read and tranform the csv file:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Okay the dataframe is successfully created by the pandas libs. Now I need to verify each data type on the dataframe and verify how many empty data in the dataframe.

In [5]:
# checking the dataframe data types
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

Okay mostly it have int64 which is integer and one float64 which allows decimal numbers. But in short all data types in this dataframe are numeric. We will not have too much trouble processing it into the machine learning model then.

Next thing to check is whether any data inside the data frame is empty. This step will use the pandas.DataFrame.isna() function.


In [3]:
# check if there are missing values inside this dataframe.
df.isna()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,False,False,False,False,False,False,False,False,False,False,False,False,False,False
299,False,False,False,False,False,False,False,False,False,False,False,False,False,False
300,False,False,False,False,False,False,False,False,False,False,False,False,False,False
301,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Okay this checking of missing values is not performing as I expected. I returns the whole dataframe while chacking the content of each cell. I need just summary of the non null data. Well at least I get the size of the whole dataframe here, which is 303 rows x 14 columns. 

Now let's try the dataframe.info() function to check

In [6]:
# checking the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


Okay, we know the dataframe has 303 rows. Here is all columns are inspected and it said that all 303 rows of all columns are non null values. Thus no other preparations needed for this data frame. 

I now can continue to process the data starts with separating fetures (X) and labelled (y) data.

In [7]:
# pick the y dataframe which is the heart_disease dataframe target column
y = df["target"]
# check it out
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [8]:
# pick the X (features) dataframe which is the rest of the dataframe exclude target column
X = df.drop('target', axis=1)
# check it out
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


Good the separation for labelled data (y) and features data from the dataframe went well. Now I can split them to train and test data. In order to  retain objectivity I will use the sklearn.model_selection.train_test_split function.

This will retain selection as randomly select while maintain the structure and population of each axis of data. 

However, I need to retain the same random seed in order to maintain repeatable processing to compare and validate the models chosen.

In [9]:
# set the random seed
np.random.seed(45)
# I choose 45 arbitrarily 
# set the test and train data for both features and label
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [12]:
# check on this model selection?
# they all now a mtrix thus we need to verify the shape of each
(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


((242, 13), (242,), (61, 13), (61,))

Okay the shape is consistent since the train X and y have 242 , while the test X and y have 61.

Now I need to choose the classification model from this image map:
<img src="../images/sklearn-ml-map.png">

Let's start from START. 
1. From the dataframe info we get total 303 samples which is > 50 samples. Thus it is qualified to continue.
1. The heart disease data frame will be used to predict whether subject having heat disease or not thus it is predicting category.
1. We have labeled data in this case the target column
1. as we have 303 data sample the dataframe is still under 100K sample. 

From this steps we get the Linear SVC
<img src="../images/sklearn-ml-map-cheatsheet-heart-disease-linear-svc.png">


We will use the sklearn.svm.LinearSVC

In [16]:
# import LinearSVC
from sklearn.svm import LinearSVC
# instantiate the LinearSVC model the max_iter need to be increased (default 1000) since it warn us!
svc = LinearSVC(max_iter=10000)

# fit the train model
svc.fit(X_train, y_train)

# score the model
svc.score(X_test, y_test)



0.8524590163934426

Still the convergence warning is not going away although I already increased the iteration number up to 10.000. 

I need to find another method:
<img src="../images/sklearn-ml-map-cheatsheet-heart-disease-ensemble.png">


NOTE: this selectin of ensemble classifiers is not methodically displayed. The objective is to safe time while presenting stable model with high accuracy.

sklearn.ensemble is the selected model. This will eventually lead us to the Random Forest Classifier as we already use this model before.

In [17]:
# import the sklearn.ensemble random forest classifier
from sklearn.ensemble import RandomForestClassifier
# instantiate the random forest model
rfc = RandomForestClassifier()

# train the model fit
rfc.fit(X_train, y_train)

# score the model 
rfc.score(X_test, y_test)

0.8688524590163934

Not much different with the LinearSVC model above. The LinearSVC has score 85.25 percent while the Random Forest Classifier has score 86.88 percent. 

However, the Random Forest Classifier is more stable thus valid model compared to the LinearSVC which having convergence problem.

### What about the other models?

Looking at the cheat-sheet and the examples above, you may have noticed we've skipped a few.

Why?

The first reason is time. Covering every single one would take a fair bit longer than what we've done here. And the second one is the effectiveness of ensemble methods.

A little tidbit for modelling in machine learning is:
* If you have structured data (tables or dataframes), use ensemble methods, such as, a Random Forest.
* If you have unstructured data (text, images, audio, things not in tables), use deep learning or transfer learning.

For this notebook, we're focused on structured data, which is why the Random Forest has been our model of choice.

If you'd like to learn more about the Random Forest and why it's the war horse of machine learning, check out these resources:
* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html) by yhat
* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

### Experiment until something works

The beautiful thing is, the way the Scikit-Learn API is designed, once you know the way with one model, using another is much the same.

And since a big part of being a machine learning engineer or data scientist is experimenting, you might want to try out some of the other models on the cheat-sheet and see how you go. The more you can reduce the time between experiments, the better.