# Heart Disease Indicator Analysis

## Introduction

**Objective:** Explore possible indicators for heart diseases.

**Questions:**
1. Which age range are you most likely to have heart issues?
2. Do the smoking habits differ between people with and without heart issues?
3. Which machine learning model among LogisticRegression, XGBClassifier, KNeighborsClassifier is the most suitable for predicting heart issues?


## Selection of Datasets

**DataSet:** 'Heart Disease Health Indicators Dataset' from Kaggle.

**Data:**
- [Heart Disease Health Indicators Dataset](https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset)
    - This is a cleaned and filtered dataset that is specific to heart disease in 2015. 
    - 253,680 survey responses from cleaned BRFSS 2015 dataset.
    - Optimized for binary classifications for heart disease. '
    - Strong class imbalance. 
        - 229,787 people have not had heart disease while 23,893 have had heart disease.


- Original Dataset [Behavioral Risk Factor Surveillance System](https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system)
  - Public health surveys of more than 400,000 people from 2011 to 2015
   - Data on preventive health practices and behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population.
    - Collected by Center for Disease and Prevention

## Methodologies

### 1. Data Import and Cleaning

- Use Pandas to:
  - Import and filter dataset.
  - Apply `groupby()` method for grouping certain rows based on a value of another column 
- Use Matplotlib and seaborn for visualization.
  - Used to plot bar graphs and pie charts for Q1 and Q2.
  - Used to plot the confusion matrices for Q3. 

### 2. Machine Learning Models

- For Q3, "Which machine learning model among LogisticRegression, XGBClassifier, DecisionTreeClassifier is the most suitable for predicting heart issues?"
- Sklearn models below were picked based on their suitability for binary classifications:

#### 2.1. Logistic Regression

- Models the probability that a given instance belongs to a particular class.
- Ideal for binary classification tasks.
- Well-suited for linear relationships, provides interpretable coefficients, and outputs probabilities.
- Considerations: Assumes a linear relationship between features and log-odds, which might limit its ability to analyze non-linear patterns.
  
#### 2.2. XGBClassifier (XGBoost)

- Gradient boosting framework that builds a group of weak learners.
- Well-suited for binary classification tasks and often outperforms other algorithms.
- High predictive performance, handles non-linear relationships well, and includes regularization to prevent overfitting.
- Considerations: May require tuning of hyperparameters.

#### 2.3. DecisionTreeClassifier
- Supervised learning algorithm that constructs a tree structure to make predictions based on feature conditions.
- Well-suited for binary classification tasks due to its ability to create decision boundaries based on feature conditions.
- Assumes axis-aligned decision boundaries, potentially limiting its ability to capture complex, non-linear patterns.
- 
### 3. Dataset Division:

  - Train (90%) and test (10%) subsets.
  - Input and output sets.
    - Predicting the output ['HeartDiseaseorAttack'] based on the input set. 
    - Input: ['HighBP,	HighChol,	CholCheck,	BMI,	Smoker,	Stroke,	Diabetes,	PhysActivity,	HvyAlcoholConsump,	AnyHealthcare,	NoDocbcCost,	GenHlth	MentHlth,	PhysHlth,	DiffWalk,	Sex,	Age']
        - Columns used for the input set are chosen based on information that is commonly asked or measured at doctor's visits. 
        - Columns that are ignored are ['Fruits', 'Veggies', 'Education', 'Income']
    - Output: ['HeartDiseaseorAttack']
  
### 4. Handling Imbalance: 

- **SMOTE (Synthetic Minority Over-sampling Technique):**
    - addresses class imbalance in binary classification tasks.
    - designed for the minority class, SMOTE generates synthetic. 
    - By introducing synthetic examples, SMOTE helps balance class distribution, enhancing the model's ability to learn from the minority class and improving overall classification performance. 
        
### 5. Evaluation Metrics with report:

The classification report provides performance metrics for a binary classification model. Each row represents a class (0 or 1), and the columns include precision, recall, and F1-score. 

- **Precision:** The ratio of true positive predictions to the total predicted positives, indicating the accuracy of positive predictions.
- **Recall:** The ratio of true positive predictions to the total actual positives, measuring the model's ability to capture all positive instances.
- **F1-score:** The harmonic mean of precision and recall, offering a balanced assessment of a model's performance.


### 6. Visualization with Confusion Matrix

A confusion matrix is a table that provides a detailed summary of the performance of a classification model. It compares predicted labels against actual labels, categorizing instances into four outcomes:

- **True Positive (TP):** Instances correctly predicted as positive.
- **True Negative (TN):** Instances correctly predicted as negative.
- **False Positive (FP):** Instances incorrectly predicted as positive.
- **False Negative (FN):** Instances incorrectly predicted as negative.

## Results
- **Disclaimer**: To run the code, go to Kernal -> Restart & Run All. 


In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

# import the data
data = pd.read_csv('heart_disease_health_indicators_BRFSS2015.csv')

### 4.1. Which age range are you most likely to have heart issues?

In [None]:
# Only choosing rows that have the condition 'HeartDiseaseorAttack' == 1
heart_disease_data = data[data['HeartDiseaseorAttack'] == 1]

# Avoiding SettingWithCopyWarning
heart_disease_data = heart_disease_data.copy()

# Mapping the age numbers from 1 to 14 to Age ranges
heart_disease_data['Age_Range'] = pd.cut(
    data['Age'],
    bins=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
    labels=[
        'Age 18-24', 'Age 25-29', 'Age 30-34', 'Age 35-39', 'Age 40-44',
        'Age 45-49', 'Age 50-54', 'Age 55-59', 'Age 60-64', 'Age 65-69',
        'Age 70-74', 'Age 75-79', 'Age 80 or older'
    ],
    include_lowest=True
)


# Group by 'Age_Range' and count the frequency of 'HeartDiseaseorAttack = 1'
frequency_by_age = heart_disease_data.groupby('Age_Range').size().reset_index(name='Frequency')

print(frequency_by_age)

import seaborn as sns
import matplotlib.pyplot as plt


# Set the style of seaborn
sns.set(style="whitegrid")

# Create a bar plot with rotated x-axis labels
plt.figure(figsize=(12, 6), dpi=400)
ax = sns.barplot(x='Age_Range', y='Frequency', data=frequency_by_age, palette='viridis')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Frequency of Heart Disease by Age Range')
plt.xlabel('Age Range')
plt.ylabel('Frequency')
plt.tight_layout()  # Adjust layout for better spacing
plt.show()


### 4.2. Do the smoking habits differ between people with and without heart issues?

- Helper function to plot the pie chart

In [None]:
import matplotlib.pyplot as plt
def plot_pie_chart(percentage_by_smoker, name):
    # Pie chart
    labels = percentage_by_smoker.index
    sizes = percentage_by_smoker.values
    colors = ['lightcoral', 'lightskyblue']
    
    plt.figure(figsize=(3,3), dpi=200)
    plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140)
    plt.title('Percentage of Smokers ' + name)
    plt.show()

#### 4.2.1. How many people with a heart disease smoke?

In [None]:
# Get the row where the HeartDiseaseorAttack is 1
heart_disease_data = data[data['HeartDiseaseorAttack'] == 1]

# Replace binary values with readable representations
heart_disease_data['Smoker_Label'] = heart_disease_data['Smoker'].map({1.0: 'Smoker', 0.0: 'Non-Smoker'})

# Group the dataset based on the Smoke_Label column 
frequency_by_smoker = heart_disease_data.groupby('Smoker_Label').size().reset_index(name='Frequency')

# Calculate percentages
total_cases = len(heart_disease_data)
percentage_by_smoker = heart_disease_data.groupby('Smoker_Label').size() / total_cases * 100

plot_pie_chart(percentage_by_smoker, 'with a Heart Disease')

#### 4.2.2. How many people without a heart disease smoke?

In [None]:
# Get the row where the HeartDiseaseorAttack is 0
non_heart_disease_data = data[data['HeartDiseaseorAttack'] == 0]

# Replace binary values with readable representations
non_heart_disease_data['Smoker_Label'] = non_heart_disease_data['Smoker'].map({1.0: 'Smoker', 0.0: 'Non-Smoker'})

# Calculate percentages
total_cases = len(heart_disease_data)
percentage_by_smoker = non_heart_disease_data.groupby('Smoker_Label').size() / total_cases * 100

plot_pie_chart(percentage_by_smoker, 'without a Heart Disease')

### 4.3. Can we accurately predict the anxiety, depression, and insomnia level based on their music taste or frequency using machine learning models?

### Helper Functions

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Plot confusion matrix of a given model and predictions
def visualize_performance(y_true, y_pred):
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 4.5), dpi=200)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()

    # Classification Report
    report = classification_report(y_true, y_pred)
    return report

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def test_model(X_test, y_test, model):
    predictions = model.predict(X_test)
    
    # Compute the accuracy score of the prediction
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: ", accuracy)
        
    return [predictions, y_test]

### Creating and Saving models
- Below code was used to create, train and save models so the models won't have to be created every time we test them 

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

# Create and train a model based on given inputset and output set and specified machine learning model 
def create_train_and_save_model(X, y, model, file):
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
    
    # Apply SMOTE to the training set.
    sm = SMOTE(random_state=42)
    X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

    # Train your machine learning model on the resampled training set.
    model.fit(X_train_res, y_train_res)
    
    joblib.dump(model, file)

In [None]:
# The code below was run to create and save models as joblib files
# It is not run when executing Restart & Run All
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

import joblib

def create_save_models():

    # Extract input set (X) and output set (y)
    X = data.drop(['HeartDiseaseorAttack', 'Education', 'Income', 'Fruits', 'Veggies'], axis=1)
    y = data['HeartDiseaseorAttack']

    model = LogisticRegression(max_iter=500)
    create_train_and_save_model(X, y, model, 'LogisticRegression.joblib')
    
    model = XGBClassifier()
    create_train_and_save_model(X, y, model, 'XGBClassifier.joblib')
    
    model = DecisionTreeClassifier()
    create_train_and_save_model(X, y, model, 'DecisionTreeClassifier.joblib')

### Testing the models

#### 4.3.1. LogisticRegressor

In [None]:
from sklearn.linear_model import LogisticRegression
import joblib

# Extract input set (X) and output set (y)
X = data.drop(['HeartDiseaseorAttack', 'Education', 'Income', 'Fruits', 'Veggies'], axis=1)
y = data['HeartDiseaseorAttack']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = joblib.load('LogisticRegression.joblib')

predictions, y_test = test_model(X_test, y_test, model)
report = visualize_performance(y_test, predictions)
print("Classification Report:\n", report)

#### 4.3.2. XBGClassifier

In [None]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
import joblib

# Extract input set (X) and output set (y)
X = data.drop(['HeartDiseaseorAttack', 'Education', 'Income', 'Fruits', 'Veggies'], axis=1)
y = data['HeartDiseaseorAttack']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = joblib.load('XGBClassifier.joblib')

predictions, y_test = test_model(X_test, y_test, model)
report = visualize_performance(y_test, predictions)
print("Classification Report:\n", report)

#### 4.3.3. DecisionTreeClassifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import joblib

# Extract input set (X) and output set (y)
X = data.drop(['HeartDiseaseorAttack', 'Education', 'Income', 'Fruits', 'Veggies'], axis=1)
y = data['HeartDiseaseorAttack']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = joblib.load('DecisionTreeClassifier.joblib')

predictions, y_test = test_model(X_test, y_test, model)
report = visualize_performance(y_test, predictions)
print("Classification Report:\n", report)