# **Data Cleaning Notebook**

## Objectives

- Clean the data
- Spli data for train and test set

## Inputs

* outputs/datasets/collection/heart.csv

## Outputs

* Generate Train and Test sets from cleaned data, saved under outputs/datasets/cleaned



---

# Set up the Working Directory

Define and confirm the working directory.

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

# Load data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/heart.csv"
df = pd.read_csv(df_raw_path)
df.head(3)

---

# Check inbalanced data

We noticed in the previous notebook that some of the variables were unbalanced.
Developer will visualize all features with countplots.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

for feature in df.columns:
    if feature != 'target':  # Exclude the target variable itself
        plt.figure(figsize=(8, 4))  # Adjust figure size as needed
        sns.countplot(x=feature, hue='target', data=df, palette='Set2')
        plt.title(f'Class Distribution by {feature}')
        plt.xlabel(feature)
        plt.ylabel('Count')
        plt.legend(title='Target', loc='upper right')
        plt.show()

After visualize all features and because of the nature of the business problem **Classification**, the needs balanced features developer decided to handle feature inbalance.

Checking all variable count to confirm unbalanced variables from countplot.

In [None]:
import pandas as pd

for column in df.columns:
    counts = df[column].value_counts()
    print(f"Value Counts for {column}:\n{counts}\n")

From the information above Developer decided that the unbalanced variables are the following

- sex
- FBS (Fasting Blood Sugar)- 
Restecg (Resting Electrocardiographic Results)- 
Exang (Exercise-Induced Angina- )
Slo- pe


As well developer decided to split data, feature engineering, evaluate performance of the model and if needed balance the variables described above.Thal

---

# Missing data

Checking for missing data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

There are no missing data.

---

# Split data into train and test set

Dropping the target variable as the developer think that this split of the data is the one that leaked to target leakage and to the bug of 100% accuracy in train and test set, in the model notebook.

In [None]:
from sklearn.model_selection import train_test_split

features = df.drop('target', axis=1)  # drop the target variable from the feature set
target = df['target']

TrainSet, TestSet, train_target, test_target = train_test_split(
    features,
    target,
    test_size=0.2,
    random_state=0
)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

As we notice the train and test set are divide as follow

- TrainSet shape: (820, 14) - 
TestSet shape: (205, 14

With the train set with 80% of the data and test set with the remaining 20 %.

# Save new cleaned data

We then export and save the cleaned datas in their folders.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

# Conclusions and pushing file to repo

From this notebook we understood:
* Data did not need any cleaning
* No missing data
* There were few variables unbalanced
* Developer choose to see performance of model to decide how and if balance the variables

---