# Random Forest: Predicting CHD

Here is an intersting problem of understanding what factors contribute to CHD and can CHD be predicted by building an analytical model.

The next two sections will introduce some basics of CHD, where the dataset is derived from and what are the attributes available in the dataset.

### What is coronary heart disease?


[Coronary heart disease (CHD)](https://en.wikipedia.org/wiki/Coronary_artery_disease)  is when your coronary arteries (the arteries that supply your heart muscle with oxygen-rich blood) become narrowed by a gradual build-up of fatty material within their walls. These arteries can become narrowed through build-up of plaque, which is made up of cholesterol and other substances. Narrowed arteries can cause symptoms, such as chest pain (angina), shortness of breath, and fatigue.


### Dataset Description

Data is avaialable at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/
And header informtion is available at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info.txt

A retrospective sample of **males in a heart-disease high-risk region of the Western Cape, South Africa**. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal. 

### Import and load the dataset

In [4]:
import pandas as pd
import numpy as np

np.random.seed(100)



In [106]:
bank_df = pd.read_csv( "bank.csv")

In [107]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing-loan,personal-loan,current-campaign,previous-campaign,subscribed
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [108]:
bank_df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance',
       'housing-loan', 'personal-loan', 'current-campaign',
       'previous-campaign', 'subscribed'],
      dtype='object')

In [109]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                4521 non-null   int64 
 1   job                4521 non-null   object
 2   marital            4521 non-null   object
 3   education          4521 non-null   object
 4   default            4521 non-null   object
 5   balance            4521 non-null   int64 
 6   housing-loan       4521 non-null   object
 7   personal-loan      4521 non-null   object
 8   current-campaign   4521 non-null   int64 
 9   previous-campaign  4521 non-null   int64 
 10  subscribed         4521 non-null   object
dtypes: int64(4), object(7)
memory usage: 388.6+ KB


In [110]:
bank_df.subscribed.value_counts(normalize=True)

no     0.88476
yes    0.11524
Name: subscribed, dtype: float64

The class label int the column **chd** indicates if the person has a coronary heart disease: negative (0) or positive (1). 

Attributes description: 
- **sbp**:          systolic blood pressure 
- **tobacco**:      cumulative tobacco (kg) 
- **ldl**:          low densiity lipoprotein cholesterol 
- **adiposity**:    the size of the hips compared to the person's height 
- **famhist**:      family history of heart disease (Present, Absent) 
- **typea**:        type-A behavior 
- **obesity**:      BMI index
- **alcohol**:      current alcohol consumption 
- **age**:          age at onset

### Encoding Categorical Features

In [112]:
# Assigning list of all column names in the DataFrame
X_features = list( bank_df.columns )
# Remove the response variable from the list
X_features.remove( 'subscribed' )
X_features

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing-loan',
 'personal-loan',
 'current-campaign',
 'previous-campaign']

In [126]:
cat_vars = list(bank_df[X_features].select_dtypes(include='object').columns)
cat_vars

['job', 'marital', 'education', 'default', 'housing-loan', 'personal-loan']

In [127]:
encoded_bank_df = pd.get_dummies(bank_df[X_features], 
                                 columns=cat_vars,
                                 drop_first = True )

In [128]:
encoded_bank_df.head( 10 )

Unnamed: 0,age,balance,current-campaign,previous-campaign,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,job_unemployed,job_unknown,marital_married,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing-loan_yes,personal-loan_yes
0,30,1787,1,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
1,33,4789,1,4,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,1
2,35,1350,1,1,0,0,0,1,0,0,...,0,0,0,1,0,1,0,0,1,0
3,30,1476,4,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,1,1
4,59,0,1,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,0
5,35,747,2,3,0,0,0,1,0,0,...,0,0,0,1,0,1,0,0,0,0
6,36,307,1,2,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,1,0
7,39,147,2,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,0
8,41,221,2,0,0,1,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
9,43,-88,1,2,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,1


In [140]:
bank_df.subscribed = bank_df.subscribed.map(lambda x: 1 if x == 'yes' else 0)

## Splitting Dataset into Train and Test

In [141]:
from sklearn.model_selection import train_test_split

In [142]:
train_X, test_X, train_y, test_y = train_test_split(encoded_bank_df,
                                                    bank_df.subscribed,
                                                    test_size = 0.3,
                                                    random_state = 42 ) 

In [143]:
len( train_X )

3164

In [144]:
len( test_X )

1357

In [145]:
!pip install imblearn



In [146]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

X_resampled, y_resampled = ros.fit_resample(train_X, train_y)

In [147]:
y_resampled.value_counts()

1    2795
0    2795
Name: subscribed, dtype: int64

In [174]:
from sklearn.tree import DecisionTreeClassifier

In [196]:
tree_clf = DecisionTreeClassifier(max_depth=5,
                                  criterion='gini')

In [197]:
tree_clf.fit(train_X, train_y)

DecisionTreeClassifier(max_depth=5)

In [198]:
from sklearn.metrics import classification_report

In [199]:
print(classification_report(test_y,
                            tree_clf.predict(test_X)))

              precision    recall  f1-score   support

           0       0.89      1.00      0.94      1205
           1       0.50      0.03      0.05       152

    accuracy                           0.89      1357
   macro avg       0.70      0.51      0.50      1357
weighted avg       0.85      0.89      0.84      1357



## Building Logistic Regression Model

In [163]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

In [164]:
adaboost = AdaBoostClassifier(n_estimators=100,
                              learning_rate = 0.2)

In [165]:
adaboost.fit(train_X, train_y)

AdaBoostClassifier(learning_rate=0.2, n_estimators=100)

In [166]:
gboost = GradientBoostingClassifier(n_estimators=10, min_samples_leaf=8)
gboost.fit(X_resampled, y_resampled)

GradientBoostingClassifier(min_samples_leaf=8, n_estimators=10)

### Predicting in test set using the model

In [171]:
logreg_test_results = pd.DataFrame( { 'actual':  test_y, 
                                      'predicted': adaboost.predict( test_X ) } )

In [172]:
from sklearn.metrics import classification_report

In [173]:
print(classification_report(logreg_test_results.actual,
                            logreg_test_results.predicted))

              precision    recall  f1-score   support

           0       0.89      1.00      0.94      1205
           1       0.50      0.01      0.03       152

    accuracy                           0.89      1357
   macro avg       0.69      0.51      0.48      1357
weighted avg       0.85      0.89      0.84      1357



In [170]:
import xgboost