<a id="intro"></a>
# <a id='toc2_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 1 | Introduction 👋</p>](#toc0_)

⬆️ [Tabel of Contents](#contents_tabel)

<a id="problem"></a>
# <a id='toc3_'></a>[<b><span style='color:darkorange'>🤔 Dataset Problem</span></b>](#toc0_)

<style>
    .underline-red {
        background-color:#FFAA33;
        color: black;
        font-size: 14px;
        font-family: Arial, sans-serif;
        font-weight: bold;
    }
</style>

<div class="explain-box">
The Heart Disease dataset is a widely used dataset in the field of healthcare analytics and predictive modeling. The objective of this dataset is to 
<span class="underline-red">predict the presence or absence of heart disease based on clinical and demographic attributes</span>.

The dataset includes several important features such as 
<b>age, sex, chest pain type, resting blood pressure, cholesterol level, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise-induced angina, ST depression, and more</b>.

The target variable is the <span class="underline-red">heart disease diagnosis</span>, which is a binary classification outcome:
<b>1</b> indicates the presence of heart disease, and <b>0</b> indicates absence.

This classification task is crucial in real-world medical scenarios, as early prediction of heart disease can significantly improve patient outcomes. 
The dataset may exhibit class imbalance, requiring appropriate handling during model training and evaluation. 
It is commonly used for exploring classification algorithms and model evaluation metrics in a healthcare context.
</div>


<a id="objectives"></a>
# <a id='toc4_'></a>[<b><span style='color:darkorange'>📌 Notebook Objectives</span></b>](#toc0_)

**_Project Objective:_**  
The goal of this project is to build a predictive model that can determine the presence of heart disease based on patient health indicators. This is a binary classification task where the model aims to classify whether a person is likely to have heart disease (1) or not (0), using clinical and demographic data.

**_Key Steps:_**

1. **Problem Understanding**  
   Understand the context and importance of predicting heart disease early. Identify the key features that are likely to influence the presence of heart disease, such as age, cholesterol, and chest pain type.

2. **Exploratory Data Analysis (EDA)**  
   - Analyze the distribution of input features (e.g., age, blood pressure, cholesterol).
   - Explore the relationship between each feature and the target variable (heart disease diagnosis).
   - Detect and handle missing values, outliers, and potential data quality issues.

3. **Data Preparation**  
   - Handle missing or inconsistent data (e.g., imputation or removal).
   - Encode categorical variables (e.g., chest pain type, sex).
   - Normalize or standardize numerical features if necessary for model performance.

4. **Model Building**  
   - Train a classification model (e.g., logistic regression, decision trees, random forest).
   - Compare different algorithms and select the most effective one based on performance.
   - Tune hyperparameters to optimize model accuracy and generalization.

5. **Model Evaluation**  
   - Assess the model using appropriate classification metrics (e.g., accuracy, precision, recall, F1 score, AUC).
   - Interpret model predictions and evaluate the quality of results.
   - Check for overfitting or underfitting and apply techniques like cross-validation.

6. **Summary and Conclusion**  
   - Summarize the model’s performance and key findings.
   - Provide recommendations for improving the model or for deploying it in a real-world scenario (e.g., health diagnostics tool).


<a id="description"></a>
# <a id='toc6_'></a>[<b><span style='color:darkorange'>🧾 Dataset Description</span></b>](#toc0_)

<style>
    table {
        width: 100%;
        border-collapse: collapse;
        font-family: Arial, sans-serif;
    }
    th, td {
        border: 1px solid #ddd;
        padding: 8px;
        text-align: left;
    }
    th {
        background-color: #FFAA33;
        color: black;
        font-weight: bold;
    }
    tr:nth-child(even) {
        background-color: 0;
    }
</style>

<table>
    <tr>
        <th>Column Name</th>
        <th>Description</th>
        <th>Type</th>
        <th>Possible Values</th>
    </tr>
    <tr>
        <td><b>age</b></td>
        <td>Age of the patient</td>
        <td>Numeric</td>
        <td>Typically 29–77</td>
    </tr>
    <tr>
        <td><b>sex</b></td>
        <td>Biological sex of the patient</td>
        <td>Binary</td>
        <td>0 = female, 1 = male</td>
    </tr>
    <tr>
        <td><b>cp</b></td>
        <td>Chest pain type</td>
        <td>Categorical</td>
        <td>0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic</td>
    </tr>
    <tr>
        <td><b>trestbps</b></td>
        <td>Resting blood pressure (mm Hg)</td>
        <td>Numeric</td>
        <td>Typically 94–200</td>
    </tr>
    <tr>
        <td><b>chol</b></td>
        <td>Serum cholesterol in mg/dl</td>
        <td>Numeric</td>
        <td>Typically 126–564</td>
    </tr>
    <tr>
        <td><b>fbs</b></td>
        <td>Fasting blood sugar > 120 mg/dl</td>
        <td>Binary</td>
        <td>0 = False, 1 = True</td>
    </tr>
    <tr>
        <td><b>restecg</b></td>
        <td>Resting electrocardiographic results</td>
        <td>Categorical</td>
        <td>0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular hypertrophy</td>
    </tr>
    <tr>
        <td><b>thalach</b></td>
        <td>Maximum heart rate achieved</td>
        <td>Numeric</td>
        <td>Typically 71–202</td>
    </tr>
    <tr>
        <td><b>exang</b></td>
        <td>Exercise-induced angina</td>
        <td>Binary</td>
        <td>0 = No, 1 = Yes</td>
    </tr>
    <tr>
        <td><b>oldpeak</b></td>
        <td>ST depression induced by exercise relative to rest</td>
        <td>Numeric</td>
        <td>Typically 0.0–6.2</td>
    </tr>
    <tr>
        <td><b>slope</b></td>
        <td>Slope of the peak exercise ST segment</td>
        <td>Categorical</td>
        <td>0 = upsloping, 1 = flat, 2 = downsloping</td>
    </tr>
    <tr>
        <td><b>ca</b></td>
        <td>Number of major vessels colored by fluoroscopy</td>
        <td>Numeric (discrete)</td>
        <td>0–3 (sometimes 4)</td>
    </tr>
    <tr>
        <td><b>thal</b></td>
        <td>Thalassemia type</td>
        <td>Categorical</td>
        <td>0 = normal, 1 = fixed defect, 2 = reversible defect</td>
    </tr>
    <tr>
        <td><b>target</b></td>
        <td>Presence of heart disease</td>
        <td>Binary</td>
        <td>0 = No disease, 1 = Disease present</td>
    </tr>
</table>


<a id="ml-models"></a>
# <a id='toc5_'></a>[<b><span style='color:darkorange'>👨‍💻 Machine Learning Models</span></b>](#toc0_)

<div class="explain-box">
    The <b>models</b> used in this notebook:
    <ol start="1">
        <li><b>Logistic Regression</b>,</li>
        <li><b>Support Vector Machine (SVM)</b>,</li>
        <li><b>K-Nearest Neighbour (KNN)</b>,</li>
        <li><b>Decision Tree</b>,</li>
        <li><b>Random Forest</b>,</li>
        <li><b>Gradient Boosting</b>,</li>
        <li><b>Extra Tree Classifier</b>, and</li>
        <li><b>AdaBoost</b>.</li>
    </ol>
</div>

<a id="import_install"></a>
# <a id='toc7_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 2 | Installing and importing Libraries 📚</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

<a id="install"></a>
# <a id='toc8_'></a>[<b><span style='color:darkorange'>Step 2.1 |</span><span style='color:#1E3A8A'> Installing Libraries</span></b>](#toc0_)

In [None]:
# --- Installing Libraries ---
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install seaborn
# !pip install plotly
# !pip install scikit-learn
# !pip install statsmodels
# !pip install scipy
# !pip install xgboost
# !pip install lightgbm
# !pip install catboost
# !pip install yellowbrick
# !pip install ydata-profiling

<a id="import"></a>
# <a id='toc9_'></a>[<b><span style='color:darkorange'>Step 2.2 |</span><span style='color:#1E3A8A'> Importing Libraries</span></b>](#toc0_)

In [None]:
%load_ext autoreload
%autoreload 2

# ============================
# 📊 Helpers
# ============================
import helpers.preprocessing_utils as pu
from helpers.kaggle_downloader import KaggleDatasetManager
from helpers.note_box import make_dataset_summary_box, make_note_box

from helpers import model_utils as mdl
from helpers import config as cf
from helpers import data_visualization_utils as viz
from helpers import metrics_utils as met
from helpers import feature_engineering as fe


# ============================
# 📊 Analiza i przetwarzanie danych
# ============================
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import missingno as msno

import yellowbrick
from yellowbrick.classifier import ConfusionMatrix, ROCAUC, PrecisionRecallCurve
from yellowbrick.model_selection import LearningCurve, FeatureImportances
from yellowbrick.contrib.wrapper import wrap
from yellowbrick.style import set_palette

# ============================
# 📈 Modele
# ============================
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, OneHotEncoder , MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier

# ============================
# 📈 Wizualizacja – Matplotlib, Seaborn, Plotly
# ============================
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import colors

import plotly.graph_objs as go
from plotly.subplots import make_subplots

# Dla Jupyter Notebooka – wyświetlanie wykresów
import plotly.io as pio
pio.renderers.default = 'notebook'
# pio.renderers.default = 'browser' 

%matplotlib inline

# ============================
# ⚙️ Narzędzia i ostrzeżenia
# ============================
import warnings
from termcolor import colored

# ============================
# ✅ Potwierdzenie importu
# ============================
print(colored('\n✅ All libraries imported successfully.', 'green'))

<a id="lib_config"></a>
# <a id='toc10_'></a>[<b><span style='color:darkorange'>Step 2.3 |</span><span style='color:#1E3A8A'> Library configuration</span></b>](#toc0_)

In [None]:
pd.options.mode.copy_on_write = True # Allow re-write on variable
sns.set_style('darkgrid') # Seaborn style
warnings.filterwarnings('ignore') # Ignore warnings
pd.set_option('display.max_columns', None) # Setting this option will print all collumns of a dataframe
pd.set_option('display.max_colwidth', None) # Setting this option will print all of the data in a feature

sns.color_palette("cool_r", n_colors=1)
sns.set_palette("cool_r")

print(colored('\nAll libraries configured succesfully.', 'green'))

<a id="coloring"></a>
# <a id='toc10_'></a>[<b><span style='color:darkorange'>Step 2.4 |</span><span style='color:#1E3A8A'> Coloring console output</span></b>](#toc0_)

In [None]:
class clr:
    bold = '\033[1m'             
    orange = '\033[38;5;208m'  
    blue = '\033[38;5;75m'    
    reset = '\033[0m' 

    @staticmethod
    def print_colored_text():
        print(clr.orange + "This is dark orange text!" + clr.reset)
        print(clr.blue + "This is light blue text!" + clr.reset)
        print(clr.bold + "This is bold text!" + clr.reset)
        print(clr.bold + clr.orange + "This is bold dark orange text!" + clr.reset)

# Testowanie
clr.print_colored_text()


<a id="overview"></a>
# <a id='toc11_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 3 | Reading Dataset 👓</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
manager = KaggleDatasetManager("arezaei81/heartcsv")
data_path = manager.download_and_prepare()
data_path

In [None]:
# --- Importing Datasets ---
data = pd.read_csv('data/heart.csv')

# --- Copy Dataset ---
df = data.copy()

# --- Reading Train Dataset ---
df.head(n=10)

<a id="profiling"></a>
# <a id='toc12_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 4 | Initial Dataset Exploration 🔍</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

<a id="df_train_profile"></a>
# <a id='toc13_'></a>[<b><span style='color:darkorange'>Step 4.1 |</span><span style='color:#1E3A8A'> Dataset Report</span></b>](#toc0_)

In [None]:
profile = ProfileReport(df, title='Dataset Profile', missing_diagrams={'heatmap': False, 'dendrogram': False})

profile.to_notebook_iframe()

<style>
    .underline {
        background-color:#FFAA33;
        color: black;
        font-size: 14px;
        font-family: Arial, sans-serif;
        font-weight: bold;
        padding: 2px 4px;
        border-radius: 4px;
    }
</style>

<div>
  <strong>▶️ <u>Dataset conclusions based on Profile Report</u></strong><br />
  <blockquote>
    <ul>
      <li>The <span class="underline">dataset</span> contains <strong>303 observations</strong> and <strong>14 features</strong>.</li>
      <li><span class="underline">Missing values:</span> No missing values were found in any column.</li>
      <li><span class="underline">Duplicates:</span> The dataset includes <strong>1 duplicated row</strong> (≈ 0.3%).</li>
      <li><span class="underline">Age</span> has a mean of <strong>54</strong> and ranges between <strong>29 and 77</strong>. Its distribution appears approximately normal based on histogram and skewness.</li>
      <li><span class="underline">Sex</span> distribution shows <strong>207 males</strong> and <strong>96 females</strong>, indicating a male-dominant dataset.</li>
      <li><span class="underline">Exang</span> (exercise-induced angina): most patients do <strong>not</strong> suffer from it.</li>
      <li><span class="underline">Resting blood pressure (trestbps)</span> ranges from <strong>84 to 200</strong>, with a mean of <strong>132</strong>, and follows a <strong>moderately right-skewed</strong> distribution.</li>
      <li><span class="underline">Cholesterol (chol)</span> values range from <strong>126 to 564</strong> with a mean around <strong>246</strong>. This feature also exhibits <span class="underline">strong right skewness</span>.</li>
      <li><span class="underline">Maximum heart rate (thalach)</span> ranges from <strong>71 to 202</strong>, average <strong>150</strong>.</li>
      <li><span class="underline">Oldpeak</span> (ST depression): mean is <strong>1</strong>, range <strong>0 to 62</strong>. Notably, <strong>99 values</strong> are exactly <strong>zero</strong>.</li>
      <li><span class="underline">Distributions:</span> Cholesterol and Oldpeak are <span class="underline">highly right-skewed</span>, while <span class="underline">thalach</span> is <span class="underline">moderately left-skewed</span>. These skewed distributions suggest potential outliers at the distribution tails.</li>
      <li><span class="underline">Low variance:</span> Features like <span class="underline">age</span>, <span class="underline">trestbps</span>, <span class="underline">chol</span>, and <span class="underline">thalach</span> have <span class="underline">low standard deviation</span>, indicating limited variation across records.</li>
      <li><span class="underline">Kurtosis:</span> Columns such as <span class="underline">age</span>, <span class="underline">trestbps</span>, <span class="underline">thalach</span> and <span class="underline">oldpeak</span> have kurtosis values <strong>below 3</strong>, meaning they are <span class="underline">platykurtic</span>. Meanwhile, <span class="underline">chol</span> is <strong>leptokurtic</strong> (kurtosis > 3), indicating heavier tails.</li>
      <li>
        <blockquote>📌 <span class="underline">Low standard deviation</span> implies values are <span class="underline">clustered around the mean</span>, while high deviation means greater spread.</blockquote>
        <blockquote>📌 Skewness:
          <ul>
            <li><span class="underline">|Skew| > 1</span>: highly skewed</li>
            <li><span class="underline">0.5 < |Skew| < 1</span>: moderately skewed</li>
            <li><span class="underline">|Skew| ≤ 0.5</span>: approximately symmetric</li>
          </ul>
        </blockquote>
        <blockquote>📌 <span class="underline">Kurtosis</span> describes the <span class="underline">tailedness of the distribution</span>. Normal (mesokurtic) distributions have kurtosis ≈ 3. Values &gt;3 → leptokurtic; &lt;3 → platykurtic.</blockquote>
      </li>
    </ul>
  </blockquote>
</div>


<a id="wrangling_overview"></a>
# <a id='toc16_'></a>[<b><span style='color:darkorange'>Step 4.2 |</span><span style='color:#1E3A8A'> Dataset overview</span></b>](#toc0_)

In [None]:
dataset_name = 'Heart Disease'
df.head(n=10).style.background_gradient(cmap = "PuBu")

<a id="wrangling_num_of_records"></a>
# <a id='toc17_'></a>[<b><span style='color:darkorange'>Step 4.3 |</span><span style='color:#1E3A8A'> Number of records and columns</span></b>](#toc0_)

In [None]:
print(clr.orange + f'🍷 {dataset_name} Dataset has {df.shape[0]} rows and {df.shape[1]} columns' + clr.reset)

<a id="wrangling_basic_info"></a>
# <a id='toc18_'></a>[<b><span style='color:darkorange'>Step 4.4 |</span><span style='color:#1E3A8A'> Basic Info</span></b>](#toc0_)

In [None]:
print(clr.orange + f"\n{'='*50}")
print(f'🔍 Basic Info - {dataset_name} Dataset')
print(f"{'='*50}" + clr.reset)
df.info()

In [None]:
dtypes = df.dtypes.value_counts().to_dict()

make_dataset_summary_box(
    rows=df.shape[0],
    cols=df.shape[1],
    dtypes=dtypes,
    missing_info="There are no missing values." if not df.isnull().values.any() else "Some columns contain missing values."
)

In [None]:
txt = 'We can see that 9 columns (`sex`, `cp`, `fbs`, `restcg`, `exang`, `slope`, `ca`, `thal`, `target`) are indeed numerical in terms of data type, but categorical in terms of their semantics'
make_note_box(text=txt, title='🧠 Analytical Note:')

<a id="wrangling_unique_vals"></a>
# <a id='toc19_'></a>[<b><span style='color:darkorange'>Step 4.5 |</span><span style='color:#1E3A8A'> Unique Values</span></b>](#toc0_)

In [None]:
pu.check_unique_values(df, f'{dataset_name} Dataset')

<a id="wrangling_nums_cols"></a>
# <a id='toc21_'></a>[<b><span style='color:darkorange'>Step 4.6 |</span><span style='color:#1E3A8A'> More info about numerical types</span></b>](#toc0_)

In [None]:
semantic_cats = pu.detect_semantic_categoricals(df, max_unique=10, exclude_target="target")
semantic_cats

In [None]:
numerical_cols = df.select_dtypes(include='number').columns
true_numericals = [col for col in numerical_cols if col not in semantic_cats]

df[true_numericals].describe().T.style.background_gradient(axis=0)

In [None]:
pu.plot_numerical_distributions_plotly(df, true_numericals[:-1])

<a id="corelation_cols"></a>
# <a id='toc21_'></a>[<b><span style='color:darkorange'>Step 4.7 |</span><span style='color:#1E3A8A'> Correlation</span></b>](#toc0_)

In [None]:
pu.plot_correlation_heatmap(df, 'Numerical Variables Correlation Map', show_upper_triangle=False)

<div style="
    background-color: #D0E7FF;
    border-left: 6px solid #1E3A8A;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #1E3A8A;
    line-height: 1.6;
    text: #1E3A8A
">
    <b>🔍 Key Observation:</b><br><br>
    According to the correlation between variables, it can be seen that <code>cp</code>, <code>thalach</code>, <code>slope</code> have the highest positive correlation with the target variable. 
</div>


<a id="wrangling_duplicates"></a>
# <a id='toc23_'></a>[<b><span style='color:darkorange'>Step 4.8 |</span><span style='color:#1E3A8A'> Duplicates</span></b>](#toc0_)

In [None]:
pu.check_duplicates(df, f'{dataset_name} Dataset')

In [None]:
df[df.duplicated(keep=False)]

<div style="
    background-color: #D0E7FF;
    border-left: 6px solid #1E3A8A;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #1E3A8A;
    line-height: 1.6;
    text: #1E3A8A
">
    <b>🔍 Key Observation:</b><br><br>
    It seems that rows 163 and 164 are duplicates, in a later stage of analysis it will be necessary to remove one of them
</div>


<a id="wrangling_outliers"></a>
# <a id='toc24_'></a>[<b><span style='color:darkorange'>Step 4.9 |</span><span style='color:#1E3A8A'> Outliers</span></b>](#toc0_)

<style>
    .underline-red {
        background-color: #FFAA33;;
        color: black;
        font-size: 14px;
        font-family: Arial, sans-serif;
        font-weight: bold;
        padding: 2px 5px;
        border-radius: 4px;
    }
</style>

<div class="explain-box">
Results are provided for two methods:

1. <span class="underline-red"><b>IQR (Interquartile Range)</b></span>

- Values considered outliers are <b>below</b> the first quartile (<span class="underline-red">Q1 - 1.5 × IQR</span>) or <b>above</b> the third quartile (<span class="underline-red">Q3 + 1.5 × IQR</span>).

- This is a more classical method, less sensitive to extreme values.

2. <span class="underline-red"><b>Z-score (standardized Z-value)</b></span>

- Values for which <b>Z-score modulus > 3</b> are considered outliers.

- This method is more sensitive to the distribution of the data – if the data is not normally distributed, it may produce misleading results.
</div>

In [None]:
pu.check_outliers(df, f'{dataset_name} Dataset', numerical_cols=true_numericals[:-1], target_col='target')

In [None]:
# pu.check_outliers(df, f'{dataset_name} Dataset', numerical_cols=true_numericals[:-1], target_col='target')

<div style="
    background-color: #D0E7FF;
    border-left: 6px solid #1E3A8A;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #1E3A8A;
    line-height: 1.6;
    text: #1E3A8A
">
    <b>🔍 Key Observation:</b><br><br>
    - <code>trestbps</code>: 9 outliers<br>
    - <code>chol</code>: 5 outliers<br>
    - <code>thalach</code>: 1 outlier<br>
    - <code>oldpeak</code>: 5 outliers<br>
    - <code>age</code>: No outliers
</div>


<a id="cleaning"></a>
# <a id='toc25_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 5 | Cleaning data 📈</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

<a id="cleaning_columns_names"></a>
# <a id='toc27_'></a>[<b><span style='color:darkorange'>Step 5.1 |</span><span style='color:#1E3A8A'> Standardizing column names and values</span></b>](#toc0_)

In [None]:
df_clean = df.copy(deep=True)
df_clean

In [None]:
df_clean.columns

In [None]:
column_mapping_list = {'age': 'Age', 
                     'sex': 'Sex', 
                     'cp': 'Chest_pain_type', 
                     'trestbps': 'Resting_blook_pressure', 
                     'chol': 'Cholesterol',
                     'fbs': 'Fasting_blood_sugar', 
                     'restecg': 'Electrocardio_results', 
                     'thalach': 'Max_heart_rate',
                     'exang': 'Exercise_included_angina', 
                     'oldpeak': 'ST_depression', 
                     'slope': 'Slope_of_the_peak_ST', 
                     'ca': 'Num_of_colored_vessels', 
                     'thal': 'Thalassemia_type', 
                     'target': 'Present_of_heart_disease'}

df_clean.rename(columns=column_mapping_list, inplace=True)
df_clean.columns

<a id="cleaning_standarization"></a>
# <a id='toc28_'></a>[<b><span style='color:darkorange'>Step 6.3 |</span><span style='color:#1E3A8A'> Standardization of data types  </span></b>](#toc0_)

In [None]:
df_clean.dtypes

In [None]:
df_clean.nunique()

In [None]:
semantic_cats = pu.detect_semantic_categoricals(df_clean, max_unique=10, exclude_target="Present_of_heart_disease")
df_clean[semantic_cats] = df_clean[semantic_cats].astype("category")
df_clean.dtypes


<a id="remove_duplicates"></a>
# <a id='toc28_'></a>[<b><span style='color:darkorange'>Step 6.4 |</span><span style='color:#1E3A8A'> Duplicate handling  </span></b>](#toc0_)

In [None]:
df_clean[df_clean.duplicated()]

In [None]:
df_clean = df_clean.drop_duplicates(keep='first')

# Sprawdzenie, czy zostały usunięte
print(f"Number of duplicates after cleaning: {df_clean.duplicated().sum()}")

<a id="outliers_treatment"></a>
# <a id='toc28_'></a>[<b><span style='color:darkorange'>Step 6.4 |</span><span style='color:#1E3A8A'> Outliers treatment  </span></b>](#toc0_)

<div style="
    background-color: #FFF9C4;
    border-left: 6px solid #FBC02D;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #555;
    line-height: 1.6;
    text: #555
">
    <b>🧠 Note - Sensitivity to Outliers:</b><br><br>
    - <b>SVM (Support Vector Machine):</b> SVMs can be sensitive to outliers. While the decision boundary is determined primarily by the support vectors, outliers can influence which data points are chosen as support vectors, potentially leading to suboptimal classification.<br>
    - <b>Decision Trees (DT) and Random Forests (RF):</b> These tree-based algorithms are generally robust to outliers. They make splits based on feature values, and outliers often end up in leaf nodes, having minimal impact on the overall decision-making process.<br>
    - <b>K-Nearest Neighbors (KNN):</b> KNN is sensitive to outliers because it relies on distances between data points to make predictions. Outliers can distort these distances.<br>
    - <b>AdaBoost:</b> This ensemble method, which often uses decision trees as weak learners, is generally robust to outliers. However, the iterative nature of AdaBoost can sometimes lead to overemphasis on outliers, making the final model more sensitive to them.
</div>


<div style="
    background-color: #FFF9C4;
    border-left: 6px solid #FBC02D;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #555;
    line-height: 1.6;
    text: #555
">
    <b>🧠 Approaches for Outlier Treatment:</b><br><br>
    - <b>Removal of Outliers:</b> Directly discard data points that fall outside of a defined range, typically based on a method like the Interquartile Range (IQR).<br>
    - <b>Capping Outliers:</b> Instead of removing, we can limit outliers to a certain threshold, such as the 1st or 99th percentile.<br>
    - <b>Transformations:</b> Applying transformations like log or Box-Cox can reduce the impact of outliers and make the data more Gaussian-like.<br>
    - <b>Robust Scaling:</b> Techniques like the RobustScaler in Scikit-learn can be used, which scales features using statistics that are robust to outliers.
</div>


<div style="
    background-color: #D0E7FF;
    border-left: 6px solid #1E3A8A;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #1E3A8A;
    line-height: 1.6;
    text: #1E3A8A
">
    <b>🔍 KConclusion:</b><br><br>
    Given the nature of the algorithms (especially SVM and KNN) and the small size of our dataset, direct removal of outliers might not be the best approach. Instead, we'll focus on applying transformations like Box-Cox in the subsequent steps to reduce the impact of outliers and make the data more suitable for modeling.
</div>


<a id="eda"></a>
# <a id='toc34_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 7 | EDA 📈</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

In [None]:

df_clean.columns

<a id="visualization"></a>
# <a id='toc35_'></a>[<b><span style='color:darkorange'>Step 7.1 |</span><span style='color:#1E3A8A'> Visualizations</span></b>](#toc0_)

In [None]:

import plotly.express as px

fig1 = px.scatter(df_clean, x="Age", y="Num_of_colored_vessels", color="Chest_pain_type",
                  title="Age vs Num_of_colored_vessels (colored by Chest_pain_type)",
                  color_continuous_scale="Blues")

fig2 = px.histogram(df_clean, x="Sex", color="Present_of_heart_disease", barmode="group",
                    title="Sex vs Presence of Heart Disease",
                    color_discrete_sequence=px.colors.qualitative.Set1)

fig3 = px.histogram(df_clean, x="Chest_pain_type", color="Present_of_heart_disease", barmode="group",
                    title="Chest Pain Type vs Presence of Heart Disease",
                    color_discrete_sequence=px.colors.qualitative.Set2)

fig4 = px.scatter(df_clean, x="Max_heart_rate", y="Slope_of_the_peak_ST", color="Max_heart_rate",
                  title="Max Heart Rate vs Slope of the Peak ST",
                  color_continuous_scale="Viridis")

fig5 = px.box(df_clean, x="Present_of_heart_disease", y="Max_heart_rate", color="Present_of_heart_disease",
              title="Max Heart Rate vs Presence of Heart Disease",
              color_discrete_sequence=px.colors.qualitative.Pastel)

fig6 = px.box(df_clean, x="Exercise_included_angina", y="ST_depression", color="Exercise_included_angina",
              title="Exercise Induced Angina vs ST Depression",
              color_discrete_sequence=px.colors.qualitative.Set3)

fig7 = px.histogram(df_clean, x="Slope_of_the_peak_ST", color="Present_of_heart_disease", barmode="group",
                    title="Slope of the Peak ST vs Presence of Heart Disease",
                    color_discrete_sequence=px.colors.qualitative.Prism)

In [None]:
fig1.show()
fig2.show()
fig3.show()
fig4.show()
fig5.show()
fig6.show()
fig7.show()

<a id="modeling"></a>
# <a id='toc39_'></a>[<p style="background-color:#1E3A8A; font-family:calibri; font-size:130%; color:white; text-align:center; border-radius:0px 0px; padding:10px">Step 8 | Modeling 📈</p>](#toc0_)


⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
df_clean.columns

<a id="modeling_scaling"></a>
# <a id='toc38_'></a>[<b><span style='color:darkorange'>Step 8.1 |</span><span style='color:#1E3A8A'> Data Balance Analysis</span></b>](#toc0_)

In [None]:
pu.check_balance_in_data(df_clean, 'Present_of_heart_disease')

In [None]:
pu.plot_target_balance(df_clean, 'Present_of_heart_disease')

<div style="
    background-color: #D0E7FF;
    border-left: 6px solid #1E3A8A;
    padding: 14px;
    margin: 20px 0;
    border-radius: 8px;
    font-family: Arial, sans-serif;
    font-size: 15px;
    max-width: 900px;
    color: #1E3A8A;
    line-height: 1.6;
    text: #1E3A8A
">
    <b>🔍 Conclusion:</b><br><br>
    ✅ The target variable is relatively well balanced, with 54% of observations labeled as "Presence of Heart Disease" and 46% labeled as "Absence of Heart Disease." <br>    
    ✅ No strong imbalance is present, which indicates that special balancing techniques (e.g., SMOTE, undersampling, or class weighting) may not be necessary during modeling.<br>     
    ✅ Despite the good balance, it is still recommended to monitor precision, recall, F1-score, and AUC during evaluation to ensure that both classes are properly captured.   

</div>


<a id="modeling_split"></a>
# <a id='toc40_'></a>[<b><span style='color:darkorange'>Step 8.2 |</span><span style='color:#1E3A8A'> Splitting the Training Data </span></b>](#toc0_)

In [None]:
x = df_clean.drop(['Present_of_heart_disease'], axis=1)
y = df_clean['Present_of_heart_disease']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

<a id="modeling_testing"></a>
# <a id='toc40_'></a>[<b><span style='color:darkorange'>Step 8.3 |</span><span style='color:#1E3A8A'> Processing pipeline </span></b>](#toc0_)

In [None]:
num_cols = df_clean.select_dtypes(include='number').columns[:-1]
num_cols

In [None]:
cat_cols = df_clean.select_dtypes(include='category').columns
cat_cols

In [None]:
num_pipeline = Pipeline([
    ('scaling', RobustScaler())
])

cat_pipeline = Pipeline([
    ('onegot', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer([
    ('categorical', cat_pipeline, cat_cols),
    ('numerical', cat_pipeline, num_cols)
])


process_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])

X_train_process = process_pipeline.fit_transform(X_train)
X_test_process = process_pipeline.fit_transform(X_test)

<a id="modeling_testing"></a>
# <a id='toc40_'></a>[<b><span style='color:darkorange'>Step 8.4 |</span><span style='color:#1E3A8A'> Testing Basic Models </span></b>](#toc0_)

In [None]:
results_list = []

<a id="modeling_model2"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.2 |</span><span style='color:#1E3A8A'> K-Nearest Neighbour (KNN)</span></b>](#toc0_)

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

knn_result = pu.metrics_calculator(y_pred, y_test, model_name='KNeighborsClassifier')
results_list.append(knn_result)

<a id="modeling_model3"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.3 |</span><span style='color:#1E3A8A'> SVC</span></b>](#toc0_)

In [None]:
svc = SVC()
svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)

svc_result = pu.metrics_calculator(y_pred, y_test, model_name='SVC')
results_list.append(svc_result)

<a id="modeling_model4"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.4 |</span><span style='color:#1E3A8A'> GaussianNB</span></b>](#toc0_)

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

gnb_result = pu.metrics_calculator(y_pred, y_test, model_name='GaussianNB')
results_list.append(gnb_result)


<a id="modeling_model5"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.5 |</span><span style='color:#1E3A8A'> RandomForestClassifier</span></b>](#toc0_)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

rfc_result = pu.metrics_calculator(y_pred, y_test, model_name='RandomForestClassifier')
results_list.append(rfc_result)


<a id="modeling_model6"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.6 |</span><span style='color:#1E3A8A'> Decision Tree</span></b>](#toc0_)

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)


y_pred = dtc.predict(X_test)

dtc_result = pu.metrics_calculator(y_pred, y_test, model_name='DecisionTreeClassifier')
results_list.append(dtc_result)


<a id="modeling_model6"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.7 |</span><span style='color:#1E3A8A'> Gradientboosting</span></b>](#toc0_)

In [None]:
gdb = GradientBoostingClassifier()
gdb.fit(X_train, y_train)


y_pred = gdb.predict(X_test)

gdb_result = pu.metrics_calculator(y_pred, y_test, model_name='GradientBoostingClassifier')
results_list.append(gdb_result)


<a id="modeling_model1"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 8.4.8 |</span><span style='color:#1E3A8A'> Adaboost</span></b>](#toc0_)

In [None]:
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)


y_pred = ada.predict(X_test)

ada_result = pu.metrics_calculator(y_pred, y_test, model_name='AdaBoostClassifier')
results_list.append(ada_result)

<a id="modeling_model3"></a>
## <a id='toc38_1_'></a>[<b><span style='color:darkorange'>Step 9.5.1 |</span><span style='color:#1E3A8A'> Result</span></b>](#toc0_)

In [None]:
result = pd.concat(results_list, axis=1).style.background_gradient(cmap='Purple')
result

<a id="modeling_testing"></a>
# <a id='toc40_'></a>[<b><span style='color:darkorange'>Step 9.6 |</span><span style='color:#1E3A8A'> Tuning hyperparameters </span></b>](#toc0_)

In [None]:
rf_params = {
    'n_estimators': np.arange(50, 301, 50),         # Liczba drzew (od 50 do 300 co 50)
    'max_depth': [None, 10, 20, 30],                # Maksymalna głębokość (None oznacza brak limitu)
    'min_samples_split': [2, 5, 10],                # Minimalna liczba próbek do podziału węzła
    'min_samples_leaf': [1, 2, 4],                  # Minimalna liczba próbek w liściu
    'max_features': ['sqrt', 'log2', None],         # Liczba cech do rozważenia przy podziale
    'bootstrap': [True, False],                     # Czy używać bootstrapu do losowania próbek
    'oob_score': [True, False],                     # Ocena błędu na danych spoza próby
    'random_state': [42]                            # Ustalony seed losowy dla reprodukowalności wyników
}


logreg_params = {
    # 'penalty': ['l2'],    # Typ regularyzacji
    'C': [0.1, 1, 10],                     # Siła regularyzacji (od 0.001 do 1000)
    'solver': ['liblinear', 'lbfgs', 'saga'],       # Algorytmy optymalizacji
    'max_iter': [100, 200, 500],                    # Maksymalna liczba iteracji
    'tol': [1e-4, 1e-3, 1e-2],                      # Tolerancja dla konwergencji
    'random_state': [42]                            # Seed losowy
}


svc_params = {
    # 'C': [0.1, 1, 10],                     # Parametr regularyzacji
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], # Typ jądra
    'degree': [2, 3, 4, 5],                         # Stopień wielomianu (dla 'poly')
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],       # Parametr gamma
    'coef0': [0.0, 0.1, 0.5],                       # Parametr dla jądra 'poly' i 'sigmoid'
    'shrinking': [True, False],                     # Włączenie/wyłączenie redukcji
    'probability': [True, False],                   # Czy obliczać prawdopodobieństwa
    'tol': [1e-3, 1e-4],                            # Tolerancja dla konwergencji
    'random_state': [42]                            # Seed losowy
}

gnb_params = {
    'var_smoothing': np.logspace(-9, -3, 7)        # Wariancja dla stabilności obliczeń
}


In [None]:
models = [
    # {'name': 'SVC', 'model': SVC, 'params': svc_params}
    # {'name': 'Random Forest', 'model': RandomForestClassifier, 'params': rf_params}
    {'name': 'Gaussian Naive Bayes', 'model': GaussianNB, 'params': gnb_params}
]

In [None]:
mdl.run_model_comparison(models, X_train, y_train, X_test, y_test)