<a href="https://colab.research.google.com/github/merrymira/UPASS_ML_WEEK3/blob/main/UPASS_ML_WEEK7_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# When dealing with machine learning problems, there are generally two types of data (and machine learning models):
- Supervised data: always has one or multiple targets associated with it.
- Unsupervised data: does not have any target variable.

Here in this Dataset we have a Supervised Machine Learning Problem, For Heart Failure Prediction

Reference: https://www.kaggle.com/code/durgancegaur/a-guide-to-any-classification-problem/notebook

**About this Dataset**

Age : Age of the patient

Sex : Sex of the patient

exang: exercise induced angina (1 = yes; 0 = no)

ca: number of major vessels (0-3)

cp : Chest Pain type chest pain type

- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic

trtbps : resting blood pressure (in mm Hg)

chol : cholestoral in mg/dl fetched via BMI sensor

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

rest_ecg : resting electrocardiographic results

- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach : maximum heart rate achieved

target : 0= less chance of heart attack 1= more chance of heart attack

# Importing all the libraries needed

In [None]:
import os
import numpy as np
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
warnings.filterwarnings("ignore")
pd.set_option("display.max_rows",None)
from sklearn import preprocessing
import matplotlib
matplotlib.style.use('ggplot')
from sklearn.preprocessing import LabelEncoder

# Context

<img src="https://media.giphy.com/media/8cBhJBU2wlq6H6qY4W/giphy.gif">

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help

In [None]:
df=pd.read_csv("https://raw.githubusercontent.com/merrymira/UPASS_ML_WEEK3/main/heart.csv", index_col=0)
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,63,1,3,145,233,1,0,150,0,2.3,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,1
2,41,0,1,130,204,0,0,172,0,1.4,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,1


The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

In [None]:
df.dtypes

Unnamed: 0,0
Age,int64
Sex,int64
ChestPainType,int64
RestingBP,int64
Cholesterol,int64
FastingBS,int64
RestingECG,int64
MaxHR,int64
ExerciseAngina,int64
Oldpeak,float64


As we can see the string data in the dataframe is in the form of object, we need to convert it back to string to work on it

In [None]:
string_col = df.select_dtypes(include="object").columns
df[string_col]=df[string_col].astype("string")

In [None]:
df.dtypes

Unnamed: 0,0
Age,int64
Sex,int64
ChestPainType,int64
RestingBP,int64
Cholesterol,int64
FastingBS,int64
RestingECG,int64
MaxHR,int64
ExerciseAngina,int64
Oldpeak,float64


So, as we can see here the object data has been converted to string

## Getting the categorical columns

In [None]:
string_col=df.select_dtypes("string").columns.to_list()

In [None]:
num_col=df.columns.to_list()
#print(num_col)
for col in string_col:
    num_col.remove(col)
num_col.remove("HeartDisease")

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,303.0,54.366337,9.082101,29.0,47.5,55.0,61.0,77.0
Sex,303.0,0.683168,0.466011,0.0,0.0,1.0,1.0,1.0
ChestPainType,303.0,0.966997,1.032052,0.0,0.0,1.0,2.0,3.0
RestingBP,303.0,131.623762,17.538143,94.0,120.0,130.0,140.0,200.0
Cholesterol,303.0,246.264026,51.830751,126.0,211.0,240.0,274.5,564.0
FastingBS,303.0,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
RestingECG,303.0,0.528053,0.52586,0.0,0.0,1.0,1.0,2.0
MaxHR,303.0,149.646865,22.905161,71.0,133.5,153.0,166.0,202.0
ExerciseAngina,303.0,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
Oldpeak,303.0,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


# Exploratory Data Analysis

Out Come of this phase is as given below :

- Understanding the given dataset and helps clean up the given dataset.
- It gives you a clear picture of the features and the relationships between them.
- Providing guidelines for essential variables and leaving behind/removing non-essential variables.
- Handling Missing values or human error.
- Identifying outliers.
- EDA process would be maximizing insights of a dataset.
- This process is time-consuming but very effective,

## Correlation Matrix
### Its necessary to remove correlated variables to improve your model.One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using plotly express.
- Lighter shades represents positive correlation
- Darker shades represents negative correlation

In [None]:
px.imshow(df.corr(),title="Correlation Plot of the Heat Failure Prediction")

# 1. Handling Null Values :
In any real-world dataset, there are always few null values. It doesn’t really matter whether it is a regression, classification or any other kind of problem, no model can handle these NULL or NaN values on its own so we need to intervene.

> In python NULL is reprsented with NaN. So don’t get confused between these two,they can be used interchangably.


In [None]:
# Checking for Type of data
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             303 non-null    int64  
 1   Sex             303 non-null    int64  
 2   ChestPainType   303 non-null    int64  
 3   RestingBP       303 non-null    int64  
 4   Cholesterol     303 non-null    int64  
 5   FastingBS       303 non-null    int64  
 6   RestingECG      303 non-null    int64  
 7   MaxHR           303 non-null    int64  
 8   ExerciseAngina  303 non-null    int64  
 9   Oldpeak         303 non-null    float64
 10  ST_Slope        303 non-null    int64  
 11  HeartDisease    303 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 30.8 KB


In [None]:
# Checking for NULLs in the data
df.isnull().sum()

Unnamed: 0,0
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


# 2. Training the dataset using "Hold Out" Split


In [None]:
# prompt: "Hold Out" Split

from sklearn.model_selection import train_test_split

# Assuming 'HeartDisease' is your target variable
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


In [None]:
# prompt: fit classification model for X_train y_train, make prediction then evaluate the performance

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create a Logistic Regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print a classification report
print(classification_report(y_test, y_pred))


Accuracy: 0.8360655737704918
              precision    recall  f1-score   support

           0       0.81      0.86      0.83        29
           1       0.87      0.81      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



#3. Training the dataset using "K-folds" Split

In [None]:
from sklearn.model_selection import KFold

# Set up KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Loop through the k-folds
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Print the shapes of the resulting datasets for each fold
    print(f"Fold {fold+1}")
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
    print("\n")

Fold 1
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 2
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 3
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 4
X_train shape: (243, 11)
X_test shape: (60, 11)
y_train shape: (243,)
y_test shape: (60,)


Fold 5
X_train shape: (243, 11)
X_test shape: (60, 11)
y_train shape: (243,)
y_test shape: (60,)




In [None]:
# Lists to store the results for each fold
accuracy_scores = []
classification_reports = []

# Loop through the k-folds
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    #WRITE YOUR CODE

    # Fit the model to the training data of this fold
    #WRITE YOUR CODE

    # Make predictions on the test data of this fold
    #WRITE YOUR CODE

    # Evaluate the model's performance
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

    # Store the classification report for each fold
    report = classification_report(y_test, y_pred, output_dict=True)
    classification_reports.append(report)

    # Print the results for this fold
    print(f"Fold {fold+1} Accuracy: {accuracy}")
    print(f"Fold {fold+1} Classification Report:")
    print(classification_report(y_test, y_pred))
    print("\n")

# Print the overall average accuracy across all folds
average_accuracy = np.mean(accuracy_scores)
print("Average Accuracy across all folds:", average_accuracy)

Fold 1 Accuracy: 0.8360655737704918
Fold 1 Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.86      0.83        29
           1       0.87      0.81      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



Fold 2 Accuracy: 0.8524590163934426
Fold 2 Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.78      0.80        23
           1       0.87      0.89      0.88        38

    accuracy                           0.85        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.85      0.85      0.85        61



Fold 3 Accuracy: 0.819672131147541
Fold 3 Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.77      0.81        30
           1       0.79      0.87   

#4. Training the dataset using "Stratified-folds" Split

In [None]:
from sklearn.model_selection import StratifiedKFold

# Set up StratifiedKFold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Loop through the stratified k-folds
for fold, (train_index, test_index) in enumerate(skf.split(X, y)): # Make sure to include y for StratifiedKFold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Print the shapes of the resulting datasets for each fold
    print(f"Fold {fold+1}")
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
    print("\n")

Fold 1
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 2
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 3
X_train shape: (242, 11)
X_test shape: (61, 11)
y_train shape: (242,)
y_test shape: (61,)


Fold 4
X_train shape: (243, 11)
X_test shape: (60, 11)
y_train shape: (243,)
y_test shape: (60,)


Fold 5
X_train shape: (243, 11)
X_test shape: (60, 11)
y_train shape: (243,)
y_test shape: (60,)




In [None]:
# Lists to store the results for each fold
accuracy_scores = []
classification_reports = []

# Loop through the k-folds
for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
    #WRITE YOUR CODE

    # Fit the model to the training data of this fold
    #WRITE YOUR CODE

    # Make predictions on the test data of this fold
    #WRITE YOUR CODE

    # Evaluate the model's performance
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

    # Store the classification report for each fold
    report = classification_report(y_test, y_pred, output_dict=True)
    classification_reports.append(report)

    # Print the results for this fold
    print(f"Fold {fold+1} Accuracy: {accuracy}")
    print(f"Fold {fold+1} Classification Report:")
    print(classification_report(y_test, y_pred))
    print("\n")

# Print the overall average accuracy across all folds
average_accuracy = np.mean(accuracy_scores)
print("Average Accuracy across all folds:", average_accuracy)

Fold 1 Accuracy: 0.8360655737704918
Fold 1 Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.86      0.83        28
           1       0.87      0.82      0.84        33

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



Fold 2 Accuracy: 0.819672131147541
Fold 2 Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.82      0.81        28
           1       0.84      0.82      0.83        33

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



Fold 3 Accuracy: 0.7377049180327869
Fold 3 Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.61      0.68        28
           1       0.72      0.85   