# **Model and evaluation Notebook**

## Objectives

- Fit and evaluate a classification model to predict if a patient will suffer from heart disease or not.
- Fulfil business requirement 2.

## Inputs

* outputs/datasets/collection/heart.csv

## Outputs

* Test set (features and target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* Feature importance plot



---

# Set up the Working Directory

Define and confirm the working directory.

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

'/workspaces/heart-disease-analysis-and-prediction'

# Load data

Here we load the dataset and separate the target variable ('y') from the predictor variables ('X').

In [2]:
import numpy as np
import pandas as pd
df_raw_path = "outputs/datasets/collection/heart.csv"
df = pd.read_csv(df_raw_path)

# Separate predictors and target
X = df.drop(['target'], axis=1)
y = df['target']

print(X.shape)
X.head(3)

(1025, 13)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3


---

# ML Pipeline

ML Pipeline with Data Cleaning and Feature Engineering already discussed in previous notebooks

In [5]:
from sklearn.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer

def pipeline_transformation():
  pipeline_base = Pipeline([
         ("YeoJohnsonTransformer", YeoJohnsonTransformer(variables=['age','cp','trestbps','chol','restecg','thalach','oldpeak','slope','ca','thal', 'th']) )
  ])

  return pipeline_base

pipeline_transformation()

---

Checking all variable count to confirm unbalanced variables from countplot.

In [None]:
import pandas as pd

for column in df.columns:
    counts = df[column].value_counts()
    print(f"Value Counts for {column}:\n{counts}\n")

From the information above Developer decided that the unbalanced variables are the following

- sex
- FBS (Fasting Blood Sugar)- 
Restecg (Resting Electrocardiographic Results)- 
Exang (Exercise-Induced Angina- )
Slo- pe


As well developer decided to split data, feature engineering, evaluate performance of the model and if needed balance the variables described above.Thal

# Missing data

Checking for missing data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

There are no missing data.

---

# Split data into train and test set

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['target'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

As we notice the train and test set are divide as follow

- TrainSet shape: (820, 14)
- 
TestSet shape: (205, 14

With the train set with 80% of the data and test set with the remaining 20%.)

# Save new cleaned data

We then export and save the cleaned datas in their folders.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

# Conclusions and pushing file to repo

From this notebook we understood:
* Data did not need any cleaning
* No missing data
* There were few variables unbalanced
* Developer choose to see performance of model to decide how and if balance the variables

---