# Heart Failure Prediction with CatBoost

# Introduction

CatBoost is a very powerfull algorithm for gradient boosting on decision trees. Compared to other similar popular algorithms like XGBoost, Random Forest, LightGBM etc. it can handle categorical features directly without encoding and it has simpler hyperparameter tuning process. In this kernel we will test Catboost and optimize it with Optuna. 

Catboost model gave us very good accuracy results compared to the other models. It made modeling much faster since it does not require converting categotical features, scaling or nomalization.

In this kernel we focused only on modeling. To improve accuracy score we would need to remove outliers and do more data analysis. This will be done in another kernel.

# Exploratory Data Analysis

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv("../input/heart-failure-prediction/heart.csv")
data.head()

#### Features description
- Age: Age of the patient [years]
- Sex: Sex of the patient [M: Male, F: Female]
- ChestPainType: [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: Resting blood pressure [mm Hg]
- Cholesterol: Serum cholesterol [mm/dl]
- FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: Resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: Maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
- Oldpeak: ST [Numeric value measured in depression] 
- ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: Output class [1: heart disease, 0: Normal]

In [None]:
data.info()

We have no null values. There are 5 categorical and 6 numerical features excluding the target

In [None]:
categorical = data.select_dtypes('object').columns
numerical = data.drop(['HeartDisease'], axis=1).select_dtypes('number').columns

## Target Feature

In [None]:
sns.countplot(x='HeartDisease', data = data)

## Numerical Features

In [None]:
data[numerical].describe()

In [None]:
fig = plt.figure(figsize=(25,30)) #figure size
a = 4  # number of rows
b = 2  # number of columns
c = 1  # initialize plot counter


for column in numerical:
    plt.subplot(a, b, c)
    sns.histplot(x=column, data=data, color='darkred')
    c+=1
    
plt.tight_layout()
plt.show()

In [None]:
skews = data[numerical].skew()
skews

Small skew values meaning normal like distribution. Normalization is not needed. Anyway, CatBoost does not require data normalization.

In [None]:
data.corr()['HeartDisease'].sort_values(ascending=False)[1:]

## Categorical Features

In [None]:
data[categorical].nunique()

In [None]:
fig = plt.figure(figsize=(15,10)) #figure size
a = 2  # number of rows
b = 3  # number of columns
c = 1  # initialize plot counter

#ploting categorical features
for column in categorical:
    plt.subplot(a, b, c)
    sns.barplot(x=column, y=data['HeartDisease'], data=data, palette='rocket')
    c+=1
    
plt.tight_layout()
plt.show()

# CatBoost

In [None]:
#catboost
from catboost import CatBoostClassifier

#hyperparameter tuning
import optuna
#optuna.logging.set_verbosity(0)

#scoring tools
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [None]:
#spliting data_train 
X = data.drop(['HeartDisease'], axis=1)
y = data['HeartDisease']

#spliting data into 5 fold 
cv = KFold(n_splits=5, random_state=22, shuffle=True)

In [None]:
#catboost with default parameters
cat = CatBoostClassifier(cat_features=['FastingBS', 'Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina','ST_Slope'], verbose=0)  
scores = cross_val_score(cat, X, y, cv=cv, scoring="accuracy")
print(f'Accuracy with default parameters: {round(scores.mean(), 4)}')

In [None]:
#optuna optimization
def objective(trial):

    #parameter range
    param = {
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli", "MVS"])
    }
    
    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_float("subsample", 0.1, 1)
        
    model = CatBoostClassifier(**param, cat_features=['FastingBS', 'Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina','ST_Slope'], verbose=0)  
    
    scores = cross_val_score(
        model, X, y, cv=cv,
        scoring="accuracy"
    )
    
    return scores.mean()

#optimazing to maximize accuracy in 100 trials
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

In [None]:
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

#redefining Catboost with the best trial parameters
cat = CatBoostClassifier(**study.best_trial.params, cat_features=['FastingBS', 'Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina','ST_Slope'], verbose=0)

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
print(f'Accuracy after optimization: {round(study.best_value, 4)}')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

cat.fit(X_train, y_train)

In [None]:
pred = cat.predict(X_test)
print(f'Accuracy: {round(accuracy_score(y_test, pred),4)}\n')
print(classification_report(y_test, pred))

In [None]:
#creating a feature importances data frame
feature_importance = np.array(cat.get_feature_importance())
features = np.array(X_train.columns)
df_importances = pd.DataFrame({'Features':features,'Feature importance':feature_importance})

#sorting values
df_importances.sort_values(by=['Feature importance'], ascending=False, inplace=True)

#barplot
fig = sns.barplot(x='Feature importance', y='Features', data = df_importances, palette='rocket')
plt.show()

In [None]:
df_importances.set_index('Features', drop=True)