# **Data Cleaning Notebook**

## Objectives

- Clean the data
- Spli data for train and test set

## Inputs

* outputs/datasets/collection/heart.csv

## Outputs

* Generate Train and Test sets from cleaned data, saved under outputs/datasets/cleaned



---

# Set up the Working Directory

Define and confirm the working directory.

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

# Load data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/heart.csv"
df = pd.read_csv(df_raw_path)
df.head(3)

---

# Check inbalanced data

We noticed in the previous notebook that some of the variables were unbalanced.
Developer will visualize all features with countplots.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

for feature in df.columns:
    if feature != 'target':  # Exclude the target variable itself
        plt.figure(figsize=(8, 4))  # Adjust figure size as needed
        sns.countplot(x=feature, hue='target', data=df, palette='Set2')
        plt.title(f'Class Distribution by {feature}')
        plt.xlabel(feature)
        plt.ylabel('Count')
        plt.legend(title='Target', loc='upper right')
        plt.show()

Checking all variable count to confirm unbalanced variables from countplot.

In [None]:
import pandas as pd

for column in df.columns:
    counts = df[column].value_counts()
    print(f"Value Counts for {column}:\n{counts}\n")

From the information above Developer decided that the unbalanced variables are the following:

* sex
* FBS (Fasting Blood Sugar)* Restecg (Resting Electrocardiographic Results)
* Exang (Exercise-Induced Angine )
* Slope- *- Thap


As well developer decided to split data, feature engineering, evaluate performance of the model and if needed balance the variables described ave.al

---

# Missing data

Checking for missing data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

There are no missing data.

---

# Split data into train and test set

Dropping the target variable as the developer think that this split of the data is the one that leaked to target leakage and to the bug of 100% accuracy in train and test set, in the model notebook.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['target'], axis=1),
    df['target'],
    test_size=0.2,
    random_state=0,
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

As we notice the train and test set are divide as follow: 
* TrainSet shape: (820, 13)
* TestSet shape: (205, )

With the train set with 80% of the data and test set with the remaining 20 %.

---

# Check for duplicates

This section was added in a second moment , when developer understood that In ModelAndEvaluation notebook, duplicates were bringing performance of train and test set to 100%.

Checks for duplicates

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split


# Combine train and test data into one DataFrame
X_combined = pd.concat([X_train, X_test], axis=0)
y_combined = pd.concat([y_train, y_test], axis=0)

# Check for duplicates
duplicates = X_combined.duplicated()
duplicates_exist = duplicates.any()

if duplicates_exist:
    print("Duplicates exist between train and test sets.")
else:
    print("No duplicates found between train and test sets.")

# Print the shapes of train and test sets
print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


Check for the number of duplicates

In [None]:
# Check the number of duplicates in the train set
train_duplicates = X_train[X_train.duplicated(keep='first')]
num_train_duplicates = len(train_duplicates)

# Check the number of duplicates in the test set
test_duplicates = X_test[X_test.duplicated(keep='first')]
num_test_duplicates = len(test_duplicates)

print("Number of duplicates in the train set:", num_train_duplicates)
print("Number of duplicates in the test set:", num_test_duplicates)


### There are duplicates in train and test set:

- Number of duplicates in the train set: 519
- Number of duplicates in the test set: 44

Removing duplicates from train and test set

In [None]:
# Remove duplicates from the test set
X_test = X_test.drop_duplicates(keep='first')
y_test = y_test.loc[X_test.index]

# Check the shape of the test set after removing duplicates
print("Test set shape after removing duplicates:", X_test.shape, y_test.shape)


In [None]:
# Remove duplicates from the train set
X_train = X_train.drop_duplicates(keep='first')
y_train = y_train.loc[X_train.index]

# Check the shape of the train set after removing duplicates
print("Train set shape after removing duplicates:", X_train.shape, y_train.shape)

---

# Save new cleaned data

We then export and save the cleaned datas in their folders.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

In [None]:
TrainSet = X_train
TrainSet.head()

In [None]:
TestSet = X_test
TestSet.head()

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

In [None]:
TargetTrainSet = y_train
TargetTrainSet.head()

In [None]:
TargetTestSet = y_test
TargetTestSet.head()

In [None]:
TargetTrainSet.to_csv("outputs/datasets/cleaned/TargetTrainSet.csv", index=False)
TargetTestSet.to_csv("outputs/datasets/cleaned/TargetTestSet.csv", index=False)

# Conclusions and pushing file to repo

From this notebook we understood:
* Data did not need any cleaning
* No missing data
* There were few variables unbalanced
* There were duplicates in train and test set, eliminated.
* Developer choose to see performance of model to decide how and if balance the variables

---