## Introduction 

This technical report documents the process and findings of a project aimed at classifying cardiovascular diseases using machine learning techniques. The project begins with exploratory data analysis (EDA) of a dataset obtained from Kaggle, focusing on understanding key attributes and identifying data inaccuracies. Subsequently, feature engineering is conducted to enhance the dataset's predictive power, followed by model design and evaluation. The report aims to provide insights into the data, methodology, and outcomes of the classification task, facilitating informed decision-making for future studies or applications in the field of healthcare analytics.


In [179]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1.1 Exploratory Data Analysis (EDA)
Conducted EDA using pandas, matplotlib, and seaborn to answer specific questions regarding the dataset, including:
Distribution of positive and negative cases for cardiovascular disease.
Distribution of cholesterol levels and age.
Proportion of men and women with Cardio Disease
Proportion of smokers, weight distribution, and gender distribution.
Summary of results presented in Markdown boxes within a notebook format and graph's.


In [195]:
# Load the data
df_cardio = pd.read_csv('../Lab/data/cardio_train.csv', sep=';', index_col='id')
# Convert age from days to years in the dataset and round the age
df_cardio['age'] = round(df_cardio['age'] / 365)
# Rename the columns
df_cardio.rename(columns={'ap_hi': 'systolic', 'ap_lo': 'diastolic', 'cardio': 'cardio_disease', 'gluc': 'glucose_level', 'alco': 'alcohol_intake', 'active': 'physical_activity', 'smoke': 'smoking', 'cholesterol': 'cholesterol_level'}, inplace=True)
# Drop the id column
#df_cardio.drop('id', axis=1, inplace=True)
df_cardio.head()
df_copy = df_cardio.copy()

### Data Preprocessing
- Age Conversion: The age column, originally containing values in days, was converted to represent years. 

- Column Renaming: Several column names were modified to improve clarity and consistency.

In [None]:
cardi_disease_positiv = df_cardio["cardio_disease"].sum() # Count the number of 1's
cardio_disease_negativ = len(df_cardio) - cardi_disease_positiv # Count the number of 0's
print(f"Antal positiva: {cardi_disease_positiv}")
print(f"Antal negativa: {cardio_disease_negativ}")

Distribution of positive and negative cases for cardiovascular disease:
Quantity positiv: 34979
Quantity negativ: 35021

Distribution of cholesterol levels:
Normal: 74.84%
Above normal: 13.64%
Well above normal: 11.52%

Proportion of smokers:
percentage smokers: 8.81%

Proportion of men and women with Cardio Disease:
Percentage of Men with Cardio Disease: 50.5 % 
Percentage of Women with Cardio Disease: 49.7 %


In [None]:
# Count how many in the cholesterol_level column are 1, 2 or 3 normal, above normal and well above normal count the percentage
cholesterol_level = df_cardio['cholesterol_level'].value_counts()
cholesterol_level_percentage = df_cardio['cholesterol_level'].value_counts(normalize=True) * 100
# print the percentage for the cholesterol_level and normal, above normal and well above normal
print(f"Normal: {cholesterol_level_percentage[1]:.2f}%")
print(f"Above normal: {cholesterol_level_percentage[2]:.2f}%")
print(f"Well above normal: {cholesterol_level_percentage[3]:.2f}%")

In [None]:
rökare = df_cardio[df_cardio["smoking"] == 1].shape[0] # Count the number of smokers
total = df_cardio.shape[0] # Count the total number of patients
print(f"Percentage smokers: {rökare / total:.2%}") # Print the percentage of smokers

In [None]:
# Make a histoplot of age distribution
plt.figure(figsize=(16, 2)) # Set the size of the plot
sns.histplot(data=df_cardio, x="age", element="step") # Create a histogram of the age column
plt.xlabel("Age") # Set the x-axis label
plt.ylabel("Quantity patient's") # Set the y-axis label
plt.tight_layout()
plt.show() # Show the plot

# Make a histoplot of the weight distribution
plt.figure(figsize=(16, 2))
sns.histplot(data=df_cardio, x="weight", element="poly", bins=30)
plt.xlabel("Weight (kg)")
plt.ylabel("Quantity patient's")
plt.show()

# Make a histoplot of height distribution
plt.figure(figsize=(16, 2))
sns.histplot(data=df_cardio, x="height", element="poly", bins=40)
plt.xlabel("Hight (cm)")
plt.ylabel("Quantity patient's")
plt.show()

In [None]:
# Calculate percentage of men and women with cardio disease
# Make a copy of df_cardio
df = df_cardio.copy()
# Map gender values to 'Men' and 'Women'
df['gender'] = df_cardio['gender'].map({1: 'Women', 2: 'Men'})
# Group by gender and calculate the mean of cardio_disease (since 1 represents having disease and 0 represents not having)
percentage_by_gender = df.groupby('gender')['cardio_disease'].mean() * 100
print(percentage_by_gender)

Summary: 
The dataset (70,000 individuals) has a balanced distribution of positive and negative cases (cardiovascular disease). Cholesterol levels show a majority with normal levels (74.84%). Age skews towards middle age (30-65), with most patients between 50-60. Smoking prevalence is low (8.81%). Further investigation is needed for weight and height data due to potential inconsistencies and gender imbalance. Interestingly, CVD prevalence is nearly equal between men and women despite the imbalanced data.

# 1.2 Model design

- The analysis reveales no significant correlations between any of the features and the target variable.

In [None]:
# Create a heatmap of the correlation matrix
sns.heatmap(df_cardio.corr(), annot=True, fmt=".1f", vmin=-1, vmax=1, cmap='coolwarm')
# Add title
plt.title('Correlation Matrix')
# Save the plot so i can display togheer with the other plots
plt.savefig('correlation_matrix.png')
# Show the plot
plt.tight_layout()
plt.show()

 Values in the "weight" column below 60 kg or exceeding 140 kg are replaced with the median weight calculated for the entire dataset (denoted by median_weight). Likewise, heights outside the 150 cm to 200 cm range in the "height" column were replaced with the median height (denoted by median_height).

In [None]:
median_weight = df_cardio['weight'].median() # Calculate the median weight
median_height = df_cardio['height'].median() # Calculate the median height
df_cardio['weight'] = np.where((df_cardio['weight'] < 60) | (df_cardio['weight'] > 140), median_weight, df_cardio['weight']) # Change the unrealistic values in weight to median values
df_cardio['height'] = np.where((df_cardio['height'] < 150) | (df_cardio['height'] > 200), median_height, df_cardio['height']) # Change the unrealistic values in height to median values
df_cardio['bmi'] = df_cardio['weight'] / (df_cardio['height'] / 100) ** 2 # Create a BMI column
df_cardio['bmi'] = round(df_cardio['bmi'], 1) # Round the BMI to one decimal place
def categorize_bmi(bmi): # Define a function to categorize BMI
  if bmi < 18.5:  # If BMI is less than 18.5
    return "Underweight" # Return the category
  elif 18.5<= bmi <= 24.9: # If BMI is between 18.5 and 24.9
    return "Normal"
  elif 25 <= bmi <= 29.9: # If BMI is between 25 and 29.9
    return "Overweight"
  elif 30 <= bmi <= 34.9: # If BMI is between 30 and 34.9
    return "Obese(1)"
  elif 35 <= bmi <= 39.9: # If BMI is between 35 and 39.9
    return "Obese(2)"
  else:
    return "Obese(3)"
df_cardio['bmi_cat'] = df_cardio['bmi'].apply(categorize_bmi)# Apply the function to 'bmi' column with operation apply

- Created a new feature for Body Mass Index (BMI) based on weight and height
- The BMI feature did not exhibit strong correlations with other features
- Categorized BMI into distinct classes: Underweight, Normal, Overweight, and different levels of Obesity.

Blood Pressure Outlier Handling:

Values considered physiologically implausible (systolic blood pressure  < 60 mmHg or > 220 mmHg, diastolic blood pressure < 40 mmHg or > 120 mmHg) were replaced with the median value of the respective column ("systolic" or "diastolic").

In [None]:
# Clean the blood pressure values and fill the outliers with median
df_cardio['systolic'] = df_cardio['systolic'].apply(lambda x: df_cardio['systolic'].median() if x > 220 or x < 60 else x)
df_cardio['diastolic'] = df_cardio['diastolic'].apply(lambda x: df_cardio['diastolic'].median() if x > 120 or x < 40 else x)

# Create a categorical variable for blood pressure
def categorize_blood_pressure(systolic, diastolic):
  if systolic < 120 and diastolic < 80:
    return "Healthy"
  elif 120 <= systolic <= 130 and diastolic < 80:
    return "Elevated"
  elif 130 <= systolic <= 139 or 80 <= diastolic <= 89:
    return "Stage 1 Hypertension"
  elif systolic >= 140 or diastolic >= 90:
    return "Stage 2 Hypertension"
  elif systolic >180 or diastolic > 120:
    return "Hypertension Crisis"
  else: 
    return "Invalid"
df_cardio['blood_pressure'] = df_cardio.apply(lambda x: categorize_blood_pressure(x['systolic'], x['diastolic']), axis=1)


Categorized blood pressure into relevant categories according to medical guidelines.

- The __blood_presure__ feature did not exhibit strong correlations with other features after creating dummy variables for the categories and creating a correlation matrix.

### 1.2.2 Create two data sets

In [None]:
df_cardio_copy1 = df_cardio.copy() # Create a copy of the dataset for the next task
df_cardio_copy2 = df_cardio.copy() # Create a copy of the dataset for the next task

# Remove the 'bmi' p hi, ap lo, height, weight from the first dataset
df_cardio_copy1.drop(['bmi', 'systolic', 'diastolic', 'height', 'weight'], axis=1, inplace=True)
# Do a hot encoding for the first dataset for one-hot encoding p ̊a BMI_cat, blod_pressure and gender
df_cardio_copy1 = pd.get_dummies(df_cardio_copy1, columns=['bmi_cat', 'blood_pressure', 'gender'], drop_first=False)
df_cardio_copy1.head()
# Remove the 'bmi_cat', 'blood_pressure from the second dataset
df_cardio_copy2.drop(['bmi_cat', 'blood_pressure', 'height', 'weight'], axis=1, inplace=True)
# Do a hot enconding for the second dataset on gender
df_cardio_copy2 = pd.get_dummies(df_cardio_copy2, columns=['gender'], drop_first=False)
df_cardio_copy2.head()


### 1.2.3 Execution

Implementation of three machine learning algorithms: Random Forest, Logistic Regression, and K-Nearest Neighbors (KNN).
Utilized GridSearchCV for hyperparameter tuning and cross-validation.
Split the dataset into training and testing sets for each algorithm and evaluated model performance using accuracy and other relevant metrics.
Analyzed the best-performing model for each algorithm and dataset combination.

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Split the data into features and target in the first dataset
X1, y1 = df_cardio_copy1.drop('cardio_disease', axis=1), df_cardio_copy1['cardio_disease']
# Split the data into training and testing sets
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.3, random_state=1)

# Split the data into features and target in the second dataset
X2, y2 = df_cardio_copy2.drop('cardio_disease', axis=1), df_cardio_copy2['cardio_disease'] # Split the data into features and target
# Split the data into training and testing sets
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3, random_state=42)

# Create a pipeline with a scaler and random forest classifier
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1))
# create a parameter grid: map the parameter names to the values that should be searched for the pipeline
param_grid = {
    'randomforestclassifier__n_estimators': [50, 100, 200], # Number of trees in the forest
    'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None] # Maximum depth of the tree
}
# Create a grid search object
grid_knn = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)

In [None]:
# Fit Grid Search on the first dataset
grid_knn.fit(X_train1, y_train1)

# Evaluate the best model on the testing data
best_model1 = grid_knn.best_estimator_
y_pred1 = best_model1.predict(X_test1)

# Calculate accuracy and other relevant metrics
accuracy1 = accuracy_score(y_test1, y_pred1)
report1 = classification_report(y_test1, y_pred1)

# Hyperparameter Analysis
best_params1 = grid_knn.best_params_

# Data Collection for Presentation
# Collect relevant data for presentation

# Repeat the above steps for the second dataset
grid_knn.fit(X_train2, y_train2)
best_model2 = grid_knn.best_estimator_
y_pred2 = best_model2.predict(X_test2)
accuracy2 = accuracy_score(y_test2, y_pred2)
report2 = classification_report(y_test2, y_pred2)
best_params2 = grid_knn.best_params_



In [None]:
from sklearn.linear_model import LogisticRegression
# Define the parameter grid for GridSearchCV
param_grid_logic = {
    'logisticregression__C': [0.01, 0.1, 1, 10, 100],  # Explore different regularization strengths
    'logisticregression__penalty': ['l2'],  # Limit search to L2 penalty (can be modified)
    'logisticregression__solver': ['lbfgs', 'saga' ],  # Consider including other solvers
    'logisticregression__max_iter': [10000],  # Increase max_iter for convergence
}
# Create a pipeline for Logistic Regression with preprocessing
pipe_logic = make_pipeline(StandardScaler(), LogisticRegression(random_state=1))
# Create the GridSearchCV object for hyperparameter tuning
grid_search_logic = GridSearchCV(pipe_logic, param_grid_logic, cv=10, n_jobs=-1, scoring='accuracy')
# Fit Grid Search on the first dataset
grid_search_logic.fit(X_train1, y_train1)

# Evaluate the best logistic regression model on the testing data
best_model_logic1 = grid_search_logic.best_estimator_
y_pred_logic1 = best_model_logic1.predict(X_test1)

# Calculate accuracy and other relevant metrics
accuracy_logic1 = accuracy_score(y_test1, y_pred_logic1)
report_logic1 = classification_report(y_test1, y_pred_logic1)

# Hyperparameter Analysis
best_params_logic1 = grid_search_logic.best_params_

# Data Collection for Presentation
# Collect relevant data for presentation

# Repeat the above steps for the second dataset
grid_search_logic.fit(X_train2, y_train2)
best_model_logic2 = grid_search_logic.best_estimator_
y_pred_logic2 = best_model_logic2.predict(X_test2)
accuracy_logic2 = accuracy_score(y_test2, y_pred_logic2)
report_logic2 = classification_report(y_test2, y_pred_logic2)
best_params_logic2 = grid_search_logic.best_params_



In [None]:
# import the necessary libraries
from sklearn.neighbors import KNeighborsClassifier
# Create a pipeline for the KNN model
pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
# Define the hyperparameter grid for the KNN model
param_grid_knn = {
    'kneighborsclassifier__n_neighbors': range(1, 21),
    'kneighborsclassifier__weights': ['uniform', 'distance'],
    'kneighborsclassifier__p': [1, 2], 
}
# Create the GridSearchCV object
grid_search_knn = GridSearchCV(pipe_knn, param_grid_knn, cv=5, n_jobs=-1)
# Fit Grid Search on the first dataset
grid_search_knn.fit(X_train1, y_train1)

# Evaluate the best KNN model on the testing data
best_model_knn1 = grid_search_knn.best_estimator_
y_pred_knn1 = best_model_knn1.predict(X_test1)

# Calculate accuracy and other relevant metrics
accuracy_knn1 = accuracy_score(y_test1, y_pred_knn1)
report_knn1 = classification_report(y_test1, y_pred_knn1)

# Hyperparameter Analysis
best_params_knn1 = grid_search_knn.best_params_

# Data Collection for Presentation
# Collect relevant data for presentation

# Repeat the above steps for the second dataset
grid_search_knn.fit(X_train2, y_train2)
best_model_knn2 = grid_search_knn.best_estimator_
y_pred_knn2 = best_model_knn2.predict(X_test2)
accuracy_knn2 = accuracy_score(y_test2, y_pred_knn2)
report_knn2 = classification_report(y_test2, y_pred_knn2)
best_params_knn2 = grid_search_knn.best_params_

In [None]:
# Create an empty list to store the results
results_list = []

# Define a function to append results to the list
def append_results(model_name, dataset_name, accuracy, best_params, classification_report):
    results_list.append({
        'Model': model_name,
        'Dataset': dataset_name,
        'Accuracy': accuracy,
        'Best Parameters': best_params,
        'Classification Report': classification_report
    })

# Append results for the RandomForestClassifier
append_results('Random Forest', 'Dataset 1', accuracy1, best_params1, report1)
append_results('Random Forest', 'Dataset 2', accuracy2, best_params2, report2)

# Append results for the Logistic Regression model
append_results('Logistic Regression', 'Dataset 1', accuracy_logic1, best_params_logic1, report_logic1)
append_results('Logistic Regression', 'Dataset 2', accuracy_logic2, best_params_logic2, report_logic2)

# Append results for the KNN model
append_results('KNN', 'Dataset 1', accuracy_knn1, best_params_knn1, report_knn1)
append_results('KNN', 'Dataset 2', accuracy_knn2, best_params_knn2, report_knn2)

# Convert the list of dictionaries into a DataFrame
results_df = pd.DataFrame(results_list)

# Display the results DataFrame
print(results_df)

Based on the analysis, while Logistic Regression may offer faster runtime compared to Random Forest, both models demonstrated competitive performance. Given the balance between speed and performance, Logistic Regression could be favored for tasks prioritizing computational efficiency, while Random Forest remains a strong choice for maximizing predictive accuracy

### 1.3.1 Save model

In [205]:
import joblib
# Load the data
df = pd.read_csv('../Lab/data/cardio_train.csv', sep=';', index_col='id')
# Convert age from days to years in the dataset and round the age
df['age'] = round(df['age'] / 365)

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,2,168,62.0,110,80,1,1,0,0,1,0
1,55.0,1,156,85.0,140,90,3,1,0,0,1,1
2,52.0,1,165,64.0,130,70,3,1,0,0,0,1
3,48.0,2,169,82.0,150,100,1,1,0,0,1,1
4,48.0,1,156,56.0,100,60,1,1,0,0,0,0


In [208]:
# Split the data into features and target

X, y = df.drop('cardio', axis=1), df['cardio']

# Randomly select 100 samples
X_test, _, y_test, _ = train_test_split(X, y, test_size=100, random_state=42)

# Export 100 samples to a CSV file
X_test.to_csv('test_samples.csv', index=False)

# Train the best model on the entire dataset
best_model_logic2.fit(X, y)  # 'best_model' represents your best-performing model

# Save the trained model
joblib.dump(best_model_logic2, 'trained_model.pkl', compress=True)


['trained_model.pkl']

### 1.3.2 Load Model 

In [209]:
# Load test samples and the trained model
test_samples = pd.read_csv('test_samples.csv')
model = joblib.load('trained_model.pkl')

# Make predictions on the test samples
predictions = model.predict(test_samples)

# Get probabilities for each class
probabilities = model.predict_proba(test_samples)
prob_class_0 = probabilities[:, 0]
prob_class_1 = probabilities[:, 1]

# Create a DataFrame for predictions
prediction_df = pd.DataFrame({
    'probability class 0': prob_class_0,
    'probability class 1': prob_class_1,
    'prediction': predictions
})

# Export predictions to CSV
prediction_df.to_csv('prediction.csv', index=False)
