# **_Stroke Prediction Project_**

Sirisha Mandava, Jeff Boczkaja, Mohamed Altoobli, Jesse Kranyak

Utilizing the Stroke Prediction Dataset from Kaggle we set out to make a machine learning program that will be able to accurately predict whether or not someone will have a stroke. We try out different models that provided us with varying results. We show our results using a few different metrics including balanced accuracy score, F1 scores, precision, and recall.

Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

## What do the metrics measure?

### <u>Precision</u>
Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total number of positive predictions made. In other words, it answers the question, "Of all the instances the model predicted as positive, how many are actually positive?" Precision is particularly important in scenarios where the cost of a false positive is high.

Formula: Precision = True Positives / (True Positives + False Positives)

### <u>Recall</u>
Recall, also known as sensitivity or true positive rate, measures the ability of a model to find all the relevant cases within a dataset. It is the ratio of true positive predictions to the total number of actual positives. Recall answers the question, "Of all the actual positives, how many did the model successfully identify?" Recall is crucial in situations where missing a positive instance is costly.

Formula: Recall = True Positives / (True Positives + False Negatives)

### <u>F1 Score</u>
The F1 Score is the mean of precision and recall. It provides a single metric that balances both the precision and recall of a classification model, which is particularly useful when you want to compare two or more models. The F1 Score is especially valuable when the distribution of class labels is imbalanced. A high F1 Score indicates that the model has low false positives and low false negatives, so it's correctly identifying real positives and negatives.

Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

### <u>Balanced Accuracy Score</u>
Balanced Accuracy Score is defined as the average of recall obtained on each class, meaning it considers both the true positive rate and the true negative rate. It calculates the accuracy of the model by taking into account the balance between classes. For a binary classification problem, it would be the average of the proportion of correctly predicted positive observations to the total positive observations and the proportion of correctly predicted negative observations to the total negative observations.

Formula: Balanced Accuracy Score = (1/2) * ((TP / (TP + FN)) + (TN / (TN + FP)))

In [101]:
!pip install optuna
!pip install --upgrade tensorflow-datasets tensorflow-hub tensorflow-io-gcs-filesystem tensorflow-metadata tensorflow-probability




In [None]:
!pip install -U imbalanced-learn
!pip install -U scikit-learn imbalanced-learn
!pip install -U scikit-learn




# Main Project

## 1. Importing Data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix, precision_recall_fscore_support
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df

## 2. Analyzing and Exploring our Data

In [None]:
#Lets go ahead and loop through all of our columns and see what data they reveal

def describe_df(df: pd.DataFrame):
    print(f"The dataset contains {df.shape[1]} columns and {len(df)} rows")
    for col in df.columns:
        col_dtype = df[col].dtype
        print(f"\nColumn: {col} ({col_dtype})")
        if col_dtype == 'object':
            print(f"--- Percentage of NaNs: {df[col].isna().sum() / len(df[col]) * 100}")
            print(f"--- Unique values:\n {df[col].unique()}")
        else:
            print(f"--- Summary statistics:\n {df[col].describe()}")
describe_df(df)

### Check balance of our target which is 'stroke'

In [None]:
df['stroke'].value_counts() # We have pretty imbalanced data!

### Drop unneeded column of 'id'

In [None]:
df = df.drop('id', axis=1)

### Check nulls and use Imputation to replace them

In [None]:
df.isnull().sum()

In [None]:
df_original = df.copy()

In [None]:
from sklearn.impute import SimpleImputer
import pandas as pd

# Create an imputer object with a mean filling strategy
mean_imputer = SimpleImputer(strategy='mean')

# Apply the imputer to the 'bmi' column
df['bmi'] = mean_imputer.fit_transform(df[['bmi']])

# Check if any null values remain
print(df['bmi'].isnull().sum())

Lets see if that was the best method

In [None]:
print("After Imputation:")
print(df['bmi'].describe())
print("Before Imputation:")
print(df_original['bmi'].describe())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.histplot(df_original['bmi'], color="red", label="Before Imputation", kde=True, stat="density", linewidth=0)
sns.histplot(df['bmi'], color="blue", label="After Imputation", kde=True, stat="density", linewidth=0)
plt.legend(title="BMI Distribution")
plt.title("Comparison of BMI Distribution Before and After Imputation")
plt.xlabel("BMI")
plt.ylabel("Density")
plt.show()
#

In [None]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Plotting the distribution of ages
plt.figure(figsize=(10, 5))
sns.histplot(df['age'], bins=30, kde=True, color="skyblue")
plt.title('Distribution of Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

### Heatmap of Numerical Factors

In [None]:
df_corr = df.drop(['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'], axis=1)

corr = df_corr.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation between Individual Factors and Stroke')
plt.show()

### Glucose Levels by Different Age Groups

In [None]:
# Creating age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 30, 40, 50, 60, 70, 80, 90, 100], labels=['0-18', '19-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100'])

plt.figure(figsize=(12, 8))
sns.boxplot(x='age_group', y='avg_glucose_level', data=df, palette="coolwarm")
plt.title('Distribution of Average Glucose Levels Across Different Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Average Glucose Level')
plt.xticks(rotation=45)
plt.show()

## Who is having the strokes?

In [None]:
# Calculate the minimum age of someone who had a stroke
min_age_stroke = df[df['stroke'] == 1]['age'].min()
print(f'Youngest person in data with stroke: {min_age_stroke} years')

In [None]:
# # Assuming 'df' is your DataFrame containing the stroke data
# min_age_stroke = df[df['stroke'] == 1]['age'].min()

under_50_stroke = df[(df['age'] < 50) & (df['stroke'] == 1)].shape[0]
total_under_50 = df[df['age'] < 50].shape[0]
percentage_under_50_stroke = (under_50_stroke / total_under_50) * 100

over_50_stroke = df[(df['age'] >= 50) & (df['stroke'] == 1)].shape[0]
total_over_50 = df[df['age'] >= 50].shape[0]
percentage_over_50_stroke = (over_50_stroke / total_over_50) * 100

# Pie Charts for strokes based on age
labels = ['Had Stroke', 'No Stroke']
sizes_under_50 = [percentage_under_50_stroke, 100 - percentage_under_50_stroke]
sizes_over_50 = [percentage_over_50_stroke, 100 - percentage_over_50_stroke]
fig, axs = plt.subplots(1, 2, figsize=(14, 7))

# Pie chart for individuals under 50
axs[0].pie(sizes_under_50, labels=labels, autopct='%1.1f%%', startangle=140, colors=['lightcoral', 'lightblue'])
axs[0].set_title('Percentage of People Under 50 Having a Stroke')

# Pie chart for individuals 50 and older
axs[1].pie(sizes_over_50, labels=labels, autopct='%1.1f%%', startangle=140, colors=['lightcoral', 'lightblue'])
axs[1].set_title('Percentage of People 50 and Older Having a Stroke')

plt.show()

## 3. Encoding our data for use in machine learning

In machine learning, encoding data is essential for preparing categorical variables to be used as input in algorithms. Since most machine learning models require numerical data, categorical variables such as gender, smoking status, or work type need to be encoded into numerical form. This process ensures that the model can effectively interpret and learn from these features, enabling it to make accurate predictions or classifications based on the input data.

### Check data types, we will convert objects into categorical variables to be encoded

In [None]:
df.dtypes

In [None]:
# Define categorical features for encoding
catFeatures = ['gender','ever_married','work_type','Residence_type','smoking_status']
# Describe the categorical features to see the number of unique categories in each
df[catFeatures].describe(include='all').loc['unique', :]

### Convert objects to categorical variables

In [None]:
# Convert categorical columns to 'category' dtype for efficient encoding
df[['gender','ever_married','work_type','Residence_type','smoking_status']] = df[['gender','ever_married','work_type','Residence_type','smoking_status']].astype('category')
df.dtypes

In [None]:
# Encode categorical features as integers
for column in ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']:
    df[column] = df[column].astype('category').cat.codes

In [None]:
# Print the unique values in the encoded categorical columns for verification
for column in ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']:
    unique_values = df[column].unique()
    print(f"Unique values in '{column}': {unique_values}")

### Check counts on gender, see if it is significant

In [None]:
# Check the distribution of values in the 'gender' column
df['gender'].value_counts() # We'll treat it as a binary!

In [None]:
df.head()

### Create synthetic balance in the dataset using SMOTE

Due to the imbalance in our dataset we utilize SMOTE and SMOTENC to create synthetic data to improve the outcomes of our machine learning models.

**You can choose either model, press 'ctrl + /' to uncomment or comment out code choice.** \
Rerun model with new choices for different outcomes

## SMOTE
We will use SMOTE and create synthetic data for both training and test.

In [None]:
oversampled = SMOTE()
eval_df = df[['gender','age','hypertension','heart_disease','smoking_status','avg_glucose_level','bmi','stroke']].sample(int(df.shape[0]*0.2),random_state=42)
train_df = df.drop(index=eval_df.index)

X_test,y_test = eval_df[['gender','age','hypertension','heart_disease','smoking_status','avg_glucose_level','bmi']], eval_df['stroke']
X_train,y_train = train_df[['gender','age','hypertension','heart_disease','smoking_status','avg_glucose_level','bmi']], train_df['stroke']


X_train, y_train = oversampled.fit_resample(X_train,y_train)
usampled_df = X_train.assign(Stroke = y_train)

X_test,y_test = oversampled.fit_resample(X_test,y_test)
usampled_eval_df = X_test.assign(Stroke = y_test)

## SMOTENC
Another option is to use SMOTENC that creates only synthetic data for the training data

In [None]:
# # Run train test split
# X = df.drop(['stroke'], axis=1)
# y = df['stroke']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=85)

In [None]:
# oversample = SMOTENC(categorical_features=[0,2,3,4,5,6,9],
#                     random_state=27,  # for reproducibility
#                     sampling_strategy='auto')

# X_train, y_train = oversample.fit_resample(X_train, y_train)
# X_test, y_test = oversample.fit_resample(X_test, y_test)

# print('Original class distribution: \n')
# print(y_train.value_counts())
# print('-----------------------------------------')
# print('Synthetic sample class distribution: \n')
# print(pd.Series(y_train_res).value_counts())

## 4. Choose scaling method

**You can choose either model, press 'ctrl + /' to uncomment or comment out code choice.**

 <u>Normalization<u/> rescales the features to a fixed range, usually 0 to 1.

Advantages:

 - Useful when you need to bound your values between a specific range.
 - Maintains the original distribution without distorting differences in the ranges of values.

Disadvantages:

 - If your data contains outliers, normalization can squash the "normal" data into a small portion of the range, reducing the      algorithm's ability to learn from it.

<u>Standardization<u/> rescales data so that it has a mean of 0 and a standard deviation of 1.

Advantages:

 - Standardization does not bound values to a specific range, which might be useful for certain algorithms that assume no specific range.
 - More robust to outliers compared to normalization.

Disadvantages:

 - The resulting distribution will have a mean of 0 and a standard deviation of 1, but it might not be suitable for algorithms that expect input data to be within a bounded range.

### Normalization

In [None]:
# from sklearn.preprocessing import MinMaxScaler

# # Selecting numerical columns that need normalization
# numerical_cols = ['age', 'avg_glucose_level', 'work_type', 'bmi', 'smoking_status']

# # Initialize the MinMaxScaler
# scaler = MinMaxScaler()

# # Fit on training data
# scaler.fit(X_train[numerical_cols])

# # Transform both training and testing data
# X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
# X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

### Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

# Selecting numerical columns that need normalization
numerical_cols = ['age', 'avg_glucose_level', 'smoking_status', 'bmi']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on training data
scaler.fit(X_train[numerical_cols])

# Transform both training and testing data
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

In [None]:
# Verify processing worked
X_train.head(3)

In [None]:
# Verify processing worked
X_test.head(3)

In [None]:
display(X_train.shape)
display(X_train.info())
display(X_train.describe())
display(X_train.columns)

## 5. Decision Tree

A decision tree is a hierarchical model that helps in making decisions by mapping out possible outcomes based on different conditions. It's a visual representation where each branch represents a decision based on features in the data, ultimately leading to a prediction or classification.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train,y_train)
y_pred = model_dt.predict(X_test)

In [None]:
y_pred_train = model_dt.predict(X_train)
y_pred_test = model_dt.predict(X_test)
print(y_pred_test)

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_test))

In [None]:
dt_bas = round(balanced_accuracy_score(y_test, y_pred),2)
print(f'Decision Tree balanced accuracy score {dt_bas}')

In [None]:
# Place scores in dictionar
metrics_test = precision_recall_fscore_support(y_test, y_pred, average='binary')
dt_results = {
    'Method': 'Decision Tree',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': dt_bas
}

In [None]:
dt_results

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

### We will find a good max_depth to run with our model to see if we can improve

In [None]:
models = {'train_score': [], 'test_score': [], 'max_depth': []}

for depth in range(1,15):
    models['max_depth'].append(depth)
    model = DecisionTreeClassifier( max_depth=depth)
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    models['train_score'].append(balanced_accuracy_score(y_train, y_train_pred))
    models['test_score'].append(balanced_accuracy_score(y_test, y_test_pred))

models_df = pd.DataFrame(models)

In [None]:
models_df.plot(x='max_depth')

You want to pick the max_depth where the test_score peaks.

In [None]:
model = DecisionTreeClassifier(max_depth=7, random_state=42) # Insert max_depth from above graph
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
dt_md_bas = round(balanced_accuracy_score(y_test, y_pred),2)

In [None]:
print(f'Random Forest with adjusted max_depth balanced accuracy score: {dt_md_bas}') # The tuning should increase the score!

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Displaying the confusion matrix as a heatmap using Seaborn
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_test))

In [None]:
# Place scores in dictionary
metrics_test = precision_recall_fscore_support(y_test, y_pred, average='binary')
dt_md_results = {
    'Method': 'Decision Tree max_depth',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': dt_md_bas
}

In [None]:
dt_md_results

### 5.5 PCA

Applying PCA before using a Random Forest classifier can help reduce dimensionality and computational costs, potentially improve model generalization by removing noise, but it may obscure the interpretability of feature importance and, depending on the dataset, could either improve or degrade performance. We are choosing to run it here.

In [None]:
pca_model = PCA(n_components = 7) # 7 for SMOTE, 10 for SMOTENC
pca_model.fit(X_train)

X_train_pca = pd.DataFrame(pca_model.transform(X_train))
X_test_pca = pd.DataFrame(pca_model.transform(X_test))
X_train_pca

### 6. Random Forest

A Random Forest is a machine learning method used in both classification and regression tasks. It operates by constructing a multitude of decision trees during training time and outputs the mode or average prediction of the individual trees.

In [None]:
model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model.fit(X_train_pca, y_train)

In [None]:
y_test_pred = model.predict(X_test_pca)
rf_bas = round(balanced_accuracy_score(y_test, y_test_pred),2)
print(f'Random Forest balanced accuracy score: {rf_bas}')

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Displaying the confusion matrix as a heatmap using Seaborn
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_test))

In [None]:
# Place scores in dictionary
metrics_test = precision_recall_fscore_support(y_test, y_pred, average='binary')
rf_results = {
    'Method': 'Random Forest',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': rf_bas
}

In [None]:
rf_results

### We will find a good max_depth to run with our model to see if we can improve

In [None]:
models = {'train_score': [], 'test_score': [], 'max_depth': []}

for depth in range(1,10):
    models['max_depth'].append(depth)
    model = RandomForestClassifier(n_estimators=100, max_depth=depth)
    model.fit(X_train_pca, y_train)
    y_test_pred = model.predict(X_test_pca)
    y_train_pred = model.predict(X_train_pca)

    models['train_score'].append(balanced_accuracy_score(y_train, y_train_pred))
    models['test_score'].append(balanced_accuracy_score(y_test, y_test_pred))

models_df = pd.DataFrame(models)

In [None]:
models_df.plot(x='max_depth')

You want to pick the max_depth where the test_score peaks.

### Apply best max_depth to Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=42) # Insert max_depth from above graph
model.fit(X_train, y_train)
forest_score = model.score(X_train, y_train)
forest_test = model.score(X_test, y_test)
y_pred = model.predict(X_test)
rf_md_bas = round(balanced_accuracy_score(y_test, y_pred),2)

In [None]:
print(f'Random Forest with adjusted max_depth balanced accuracy score: {rf_md_bas}') # The tuning should increase the score!

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Displaying the confusion matrix as a heatmap using Seaborn
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_test))

In [None]:
# Place scores in dictionary
metrics_test = precision_recall_fscore_support(y_test, y_pred, average='binary')
rf_md_results = {
    'Method': 'Random Forest max_depth',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': rf_md_bas
}

In [None]:
rf_md_results

## 7. K Nearest Neighbors

The k-nearest neighbors algorithm predicts the label of a data point based on the labels of its 'k' closest neighbors in the dataset. To classify a new instance, KNN calculates the distance between the instance and all points in the training set, identifies the 'k' nearest points, and then uses a majority vote among these neighbors to determine the instance's label. For regression tasks, it averages the values of these neighbors instead.

In [None]:
param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
    'weights': ['uniform', 'distance'],
    'leaf_size': [10, 50, 100, 500]
}
random_knn = RandomizedSearchCV(KNeighborsClassifier(), param_grid, verbose=3)

random_knn.fit(X_train_pca, y_train)

In [None]:
y_pred = random_knn.predict(X_test_pca)
knn_bas = round(balanced_accuracy_score(y_test, y_pred),2)
print(f'KNN balanced accuracy score: {knn_bas}')

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Displaying the confusion matrix as a heatmap using Seaborn
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_test))

In [None]:
# Place scores in dictionary
metrics_test = precision_recall_fscore_support(y_test, y_pred, average='binary')
knn_results = {
    'Method': 'KNN max_depth',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': knn_bas
}

In [None]:
knn_results

### Let's tune our KNN

In [None]:
# Define ranges and settings to explore
n_neighbors_range = range(1, 50)
weights_options = ['uniform', 'distance']
scores = {weight: [] for weight in weights_options}

for weight in weights_options:
    for n_neighbors in n_neighbors_range:
        knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weight)
        knn.fit(X_train_pca, y_train)
        y_pred = knn.predict(X_test_pca)
        score = balanced_accuracy_score(y_test, y_pred)
        scores[weight].append(score)

In [None]:
plt.figure(figsize=(12, 6))
for weight in weights_options:
    plt.plot(n_neighbors_range, scores[weight], label=f'Weights: {weight}')
plt.xlabel('Number of Neighbors')
plt.ylabel('Balanced Accuracy Score')
plt.legend()
plt.title('KNN Performance: n_neighbors vs. Balanced Accuracy')
plt.xticks(list(n_neighbors_range))
plt.show()

Select the either 'uniform' or 'distance' line which has the highest peak for optimal_weights, select corresponding number of neighbors for optimal_n_neighbors.

In [None]:
optimal_n_neighbors = 44 # Select fro above graph
optimal_weights = 'distance'

optimal_knn = KNeighborsClassifier(n_neighbors=optimal_n_neighbors, weights=optimal_weights)
optimal_knn.fit(X_train, y_train)
y_pred_optimal = optimal_knn.predict(X_test)

In [None]:
knn_md_bas = round(balanced_accuracy_score(y_test, y_pred_optimal),2)
print(f'KNN with adjusted neighbors accuracy score: {knn_md_bas}')

In [None]:
cm = confusion_matrix(y_test, y_pred_optimal)

# Displaying the confusion matrix as a heatmap using Seaborn
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print('Classification Report for Testing:')
print(classification_report(y_test, y_pred_optimal))

In [None]:
# Place scores in dictionary
metrics_test = precision_recall_fscore_support(y_test, y_pred_optimal, average='binary')
knn_md_results = {
    'Method': 'KNN modified neighbors',
    'Precision': round(metrics_test[0],2),
    'Recall': round(metrics_test[1],2),
    'F1 Score': round(metrics_test[2],2),
    'Balanced Accuracy': knn_md_bas
}

In [None]:
knn_md_results

## Conclusion

In [None]:
all_results = []
all_results.append(dt_results)
all_results.append(dt_md_results)
all_results.append(rf_results)
all_results.append(rf_md_results)
all_results.append(knn_results)
all_results.append(knn_md_results)

df_results = pd.DataFrame(all_results)
df_results.set_index('Method', inplace=True)

In [None]:
ax = df_results.plot(kind='bar', figsize=(10, 6), width=0.8)
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', linewidth=0.7)

for p in ax.patches:
    ax.annotate(str(round(p.get_height(), 2)), (p.get_x() * 1.005, p.get_height() * 1.005), fontsize=9)

plt.legend(title='Method', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
metrics = ['Precision', 'F1 Score', 'Recall', 'Balanced Accuracy']
for metric in metrics:
    max_value = df_results[metric].max()
    max_model = df_results[df_results[metric] == max_value].index[0]
    print(f"Model with highest {metric}: {max_model} ({max_value})")

The initial analysis of the dataset revealed a significant imbalance, raising concerns about data leakage potential. To mitigate this, we experimented with both SMOTE and SMOTENC for oversampling, with SMOTE demonstrating greater performance in addressing the imbalance.

Upon evaluating various machine learning models for classification purposes, it was observed that prior to tuning the models did not exhibit strong predictive capabilities. However post-tuning improvements were notable, particularly in terms of balanced accuracy scores. Examining other projects that used our dataset had similar findings. An interesting discovery during our investigation was that datasets incorporating bloodwork data tend to yield more accurate stroke predictions. This suggests that lifestyle-based predictive models might best serve as preliminary tools for healthcare professionals, guiding at-risk patients towards more definitive bloodwork analyses.

Despite the challenges presented by lifestyle data, the Random Forest Classifier was the standout model upon tuning, specifically when adjusted to the optimal max depth. This model achieved a balanced accuracy score of 80%, marking it as the most effective among the classifiers we tested for predicting stroke potential. The Random Forest Classifier with an appropriate max depth is what we would recommended as a tool for stroke prediction, emphasizing the model's utility in clinical settings for early stroke risk assessment.



---



---



#Deep Learning & interface
At this point in the project we can begin implementing a deep learning model for stroke prediction. We could potentially improve the performance of our predictions. Deep learning is particularly good at capturing non-linear interactions between features. Here’s how we will approach the next part of this project:

##Deep Learning Model Approach
###Data Preprocessing:
1. Normalize or standardize the input features to ensure that the model trains efficiently using One-hot encode categorical variables

###Model Architecture:
1. Use a simple feedforward neural network with several dense layers as a starting point.
2. Include dropout layers to prevent overfitting.
3. Use activation function ReLU for hidden layers and a sigmoid activation function at the output layer for binary classification (stroke or no stroke).

###Compilation:
1. Compile the model using the optimizer Adam.
2. Use binary cross-entropy as the loss function since this is a binary classification problem.
3. Track accuracy as a metric and hypertune our model using Optuna

###Training:
1. Train the model using a suitable batch size and number of epochs.
2. Utilize callbacks like ModelCheckpoint for saving the best model and EarlyStopping to halt training when performance plateaus, to overcome overfitting we will impliment a ReduceLR into our epochs

###Evaluation:
1. Evaluate the model on a validation set to check for overfitting and underfitting.
2. Adjust the model architecture and hyperparameters based on performance metrics.

###Deployment:
1. Once the model is trained and validated, deploy it in a Gradio interface to make it interactive.

##Extra Analysis
##Preprocessing
###OHE
###TT&S
###Smote
###Val

In [None]:
df.drop('age_group', axis=1, inplace=True)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the original dataset with continuous age values
df_continuous = df.copy()

# Scatter Plot for Age vs. Avg Glucose Level colored by Stroke outcome
plt.figure(figsize=(10, 6))
sns.scatterplot(x='age', y='avg_glucose_level', hue='stroke', data=df_continuous)
plt.title('Age vs. Avg Glucose Level by Stroke Outcome')
plt.show()

# Line Plot for Age vs. Stroke Rate
# Calculate stroke rate by age
age_stroke_rate = df_continuous.groupby('age')['stroke'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(x='age', y='stroke', data=age_stroke_rate)
plt.title('Stroke Rate by Age')
plt.xlabel('Age')
plt.ylabel('Stroke Rate')
plt.show()

# Pairplot for Age, BMI, Glucose and Stroke
sns.pairplot(df_continuous, vars=['age', 'bmi', 'avg_glucose_level'], hue='stroke')
plt.suptitle('Pairwise Relationships for Age, BMI, Glucose Level')
plt.show()

# Heatmap for Age and Stroke Correlation with Other Variables
corr_matrix = df_continuous.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
pip install catboost


In [None]:
!pip install optuna


In [None]:
from sklearn.metrics import precision_recall_curve
import numpy as np

# Predict probabilities
probabilities = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall pairs for different probability thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, probabilities)

# Find the threshold that gives the best precision while maintaining reasonable recall
threshold_index = np.argmax(precisions >= 0.95)  # Change 0.95 to the desired precision level
best_threshold = thresholds[threshold_index]
print("Best threshold for high precision:", best_threshold)


In [None]:
# Apply threshold to positive probabilities to create binary output
predictions = (probabilities >= best_threshold).astype(int)

# Evaluate the final model precision and other metrics
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, predictions))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))


In [None]:
# Bin 'BMI' into categories
df['bmi_bins'] = pd.cut(df['bmi'], bins=[0, 18.5, 24.9, 29.9, 34.9, 39.9, np.inf], labels=[0, 1, 2, 3, 4, 5])

# Bin 'avg_glucose_level' into categories based on common medical knowledge or quartiles
df['glucose_bins'] = pd.cut(df['avg_glucose_level'], bins=[0, 90, 140, 200, np.inf], labels=[0, 1, 2, 3])



In [None]:
df.head()

In [None]:
df.drop(['bmi', 'avg_glucose_level'], axis=1, inplace=True)

In [None]:
print(df.dtypes)


In [None]:
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Convert bin columns to type 'category' if not already
df[['bmi_bins', 'glucose_bins']] = df[['bmi_bins', 'glucose_bins']].astype('category')

# Convert category columns to integers
df['bmi_bins'] = df['bmi_bins'].cat.codes
df['glucose_bins'] = df['glucose_bins'].cat.codes

# Set stroke as the target
X = df.drop('stroke', axis=1)  # Features
y = df['stroke']  # Target

# Apply one-hot encoding to X
X = pd.get_dummies(X)

X.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scale 'age' feature
scaler = MinMaxScaler()
X['age'] = scaler.fit_transform(X[['age']])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)


In [None]:
X_train_smote, X_val, y_train_smote, y_val = train_test_split(X_train_smote, y_train_smote, test_size=0.2, random_state=42)


In [None]:
from catboost import CatBoostClassifier

# Initialize the CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    depth=10,
    learning_rate=0.05,
    random_strength=2,
    bagging_temperature=0.2,
    od_type='IncToDec',
    l2_leaf_reg=3,
    loss_function='Logloss',
    eval_metric='Precision',  # Focus on Precision during training
    verbose=False
)

# Train the model
model.fit(X_train_smote, y_train_smote, eval_set=(X_val, y_val))


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Random Forest model
rf_model = RandomForestClassifier(max_depth=7, random_state=42)
rf_model.fit(X_train_smote, y_train_smote)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Performance:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree model
dt_model = DecisionTreeClassifier(max_depth=7, random_state=42)
dt_model.fit(X_train_smote, y_train_smote)
y_pred_dt = dt_model.predict(X_test)

print("Decision Tree Performance:")
print(classification_report(y_test, y_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))



In [None]:
from sklearn.ensemble import VotingClassifier

# Create a voting classifier that includes Random Forest and Decision Tree
ensemble_model = VotingClassifier(estimators=[
    ('rf', rf_model),
    ('dt', dt_model)
], voting='hard')

ensemble_model.fit(X_train_smote, y_train_smote)
y_pred_ensemble = ensemble_model.predict(X_test)

print("Ensemble Model Performance:")
print(classification_report(y_test, y_pred_ensemble))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ensemble))



In [None]:
import matplotlib.pyplot as plt

# Extracting feature importance from the Random Forest model
feature_importances = rf_model.feature_importances_
features = X_train.columns
importance_df = pd.DataFrame({'Features': features, 'Importance': feature_importances}).sort_values(by='Importance', ascending=False)

# Plotting feature importances
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Features'], importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importance from Random Forest')
plt.gca().invert_yaxis()  # Invert y-axis to show the most important at the top
plt.show()



In [None]:
X_train_smote.head()

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Calculate the ratio for the positive class weight
ratio = float(np.sum(y_train_smote == 0)) / np.sum(y_train_smote == 1)

# Instantiate an XGBClassifier with scale_pos_weight
xgb_clf = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=7,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=1,
    scale_pos_weight=ratio,  # Setting the class weight
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# Fit the classifier to the training data
xgb_clf.fit(X_train_smote, y_train_smote)

# Predict the labels for the test set
y_pred_xgb = xgb_clf.predict(X_test)

# Evaluate the classifier
print("Adjusted XGBoost Performance:")
print(classification_report(y_test, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1])
print("ROC-AUC Score:", roc_auc)


Looking good, but lets double check

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score

# Get cross-validated estimates for each data point
y_pred = cross_val_predict(clf, X_train_smote, y_train_smote, cv=5)

# Compute confusion matrix
conf_mat = confusion_matrix(y_train_smote, y_pred)
print("Confusion Matrix:")
print(conf_mat)

# Calculate precision, recall, and F1 score
precision = precision_score(y_train_smote, y_pred)
recall = recall_score(y_train_smote, y_pred)
f1 = f1_score(y_train_smote, y_pred)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

# Compute ROC-AUC
# ROC-AUC might require probability scores instead of binary predictions, depending on your use case
y_scores = cross_val_predict(clf, X_train_smote, y_train_smote, cv=5, method='predict_proba')
roc_auc = roc_auc_score(y_train_smote, y_scores[:, 1])  # Assuming the positive class is labeled '1'

print(f"ROC-AUC Score: {roc_auc}")


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Assuming X_train_smote and y_train_smote are defined
clf = RandomForestClassifier()
clf.fit(X_train_smote, y_train_smote)

# Get feature importances
importances = clf.feature_importances_
feature_names = X_train_smote.columns

# Sort feature importances in descending order and plot
indices = np.argsort(importances)[::-1]

plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train_smote.shape[1]), importances[indices])
plt.xticks(range(X_train_smote.shape[1]), feature_names[indices], rotation=90)
plt.show()


###Model Architecture

####Model log:
Model 1:
- loss: 0.4380
- accuracy: 0.7573
- Test Loss: 0.43803343176841736
- Test Accuracy: 0.7573385238647461

Model 1.2: used binning to solve issues w age, blood glucose and bmi
- loss: 0.5950
- accuracy: 0.5564
- Test Loss: 0.59500360488891
- Test Accuracy: 0.5564253330230713.

- Precision: 0.10252996005326231
- Recall: 0.8651685393258427
- F1-Score: 0.18333333333333332
- AUC-ROC: 0.766799464658097

model 1.3: bayesian hypertuning, optimized model layering structure
- loss: 0.4934
- accuracy: 0.8108
- Test Loss: 0.49342384934425354
- Test Accuracy: 0.810828447341919
- Accuracy: 0.8101761252446184
- Precision: 0.13405797101449277
- Recall: 0.4157303370786517
- F1 Score: 0.2027397260273973
- Confusion Matrix:
  [[1205  239]
  [  52   37]]
- ROC AUC Score: 0.7156035046219926

model 1.3.1: ensemble with 1.3



In [None]:
!pip list | grep tensorflow


In [None]:
!pip install --upgrade tensorflow tensorflow-datasets tensorflow-estimator tensorflow-gcs-config tensorflow-hub tensorflow-io-gcs-filesystem tensorflow-metadata tensorflow-probability


Bayesian tuning to produce new fixed model

In [None]:
import optuna
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Function to create model for Optuna
def create_fixed_model():
    model = Sequential()
    # Using the best parameters from Keras Tuner
    model.add(Dense(416, activation='relu', input_shape=(X_train_smote.shape[1],)))  # units_input from Keras Tuner
    model.add(Dropout(0.1))  # dropout_input from Keras Tuner

    # Additional layers based on the best configuration found
    num_layers = 1  # from num_layers in Keras Tuner
    if num_layers > 0:
        model.add(Dense(384, activation='relu'))  # units_layer_0 from Keras Tuner
        model.add(Dropout(0.2))  # dropout_layer_0 from Keras Tuner

    model.add(Dense(1, activation='sigmoid'))
    learning_rate = 10 ** (-3.201883719724095)  # Convert log_learning_rate to actual learning rate
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Objective function for Optuna
def objective(trial):
    model = create_model(trial)
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
        ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=1e-6)
    ]
    history = model.fit(
        X_train_smote, y_train_smote,
        validation_data=(X_val, y_val),
        epochs=100, batch_size=32,
        callbacks=callbacks,
        verbose=0
    )
    val_accuracy = model.evaluate(X_val, y_val, verbose=1)[1]
    return val_accuracy

# Create a study object and specify the direction is 'maximize'.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)
best_trial = study.best_trial
print(f'Accuracy: {best_trial.value}')
print(f"Best hyperparameters: {best_trial.params}")


In [None]:
!pip install keras-tuner


Hypetuning the bayesian model


In [None]:
import optuna
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Define a fixed model with best hyperparameters from previous testing
def create_fixed_model(trial):
    model = Sequential()
    # Base configuration from previous Bayesian optimization
    model.add(Dense(416, activation='relu', input_shape=(X_train_smote.shape[1],)))
    model.add(Dropout(0.1))
    model.add(Dense(384, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))

    # Tunable learning rate
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2)
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Objective function for Optuna to fine-tune other parameters
def objective(trial):
    model = create_fixed_model(trial)
    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])

    callbacks = [
        EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
        ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, min_lr=1e-6)
    ]

    history = model.fit(
        X_train_smote, y_train_smote,
        validation_data=(X_val, y_val),
        epochs=50,  # Optimize for quicker iterations
        batch_size=batch_size,
        callbacks=callbacks,
        verbose=0
    )

    val_accuracy = model.evaluate(X_val, y_val, verbose=0)[1]
    return val_accuracy

# Create and run the study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)  # Increase trials if needed
best_trial = study.best_trial

print(f'Best Validation Accuracy: {best_trial.value}')
print(f"Best hyperparameters: {best_trial.params}")


In [None]:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout
# from tensorflow.keras.optimizers import Adam
# from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# #untuned 48/48 [==============================] - 0s 2ms/step - loss: 0.5950 - accuracy: 0.5564
# #Test Loss: 0.595003604888916, Test Accuracy: 0.5564253330230713
# # 48/48 [==============================] - 0s 2ms/step - loss: 0.5038 - accuracy: 0.8082
# # Test Loss: 0.5038195848464966, Test Accuracy: 0.8082191944122314


# # Building the model with optimal parameters


# # Best hyperparameters
# #hyperparameters = {
#     'n_layers': 3,
#     'dropout_rate': 0.1931642791147507,
#     'lr': 0.0003865222724920752,
#     'n_units_first': 74,
#     'n_units_l0': 77,
#     'n_units_l1': 119,
#     'n_units_l2': 27
# }

# # Define the model with hyperparameters
# model = Sequential([
#     Dense(hyperparameters['n_units_first'], activation='relu', input_shape=(X_train_smote.shape[1],)),  # First layer
#     Dense(hyperparameters['n_units_l0'], activation='relu'),  # Second layer
#     Dense(hyperparameters['n_units_l1'], activation='relu'),  # Third layer
#     Dropout(hyperparameters['dropout_rate']),  # Dropout layer
#     Dense(hyperparameters['n_units_l2'], activation='relu'),  # Fourth layer
#     Dense(1, activation='sigmoid')  # Output layer
# ])

# # Compile the model with the optimal learning rate
# model.compile(optimizer=Adam(learning_rate=hyperparameters['lr']),
#               loss='binary_crossentropy',
#               metrics=['accuracy'])

# # Early stopping to prevent overfitting
# early_stopping = EarlyStopping(
#     monitor='val_loss',
#     patience=10,
#     restore_best_weights=True
# )

# # Adjust ReduceLROnPlateau to be more patient
# reduce_lr = ReduceLROnPlateau(
#     monitor='val_loss',
#     factor=0.1,  # Reduce the learning rate by a factor of 0.1
#     patience=5,  # Increased patience
#     min_lr=1e-6  # Lower bound on the learning rate
# )

# # Train the model with the SMOTE-augmented training data and callbacks
# history = model.fit(
#     X_train_smote, y_train_smote,
#     epochs=100,
#     batch_size=32,
#     validation_data=(X_val, y_val),
#     callbacks=[early_stopping, reduce_lr]
# )

# # Model summary
# model.summary()


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import os

# Set the best hyperparameters from Optuna
best_learning_rate = 0.002363396022137484
best_batch_size = 64

# Build the model with the optimal hyperparameters
def build_final_model():
    model = Sequential()
    model.add(Dense(416, activation='relu', input_shape=(X_train_smote.shape[1],)))
    model.add(Dropout(0.1))
    model.add(Dense(384, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=Adam(learning_rate=best_learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create the model
model = build_final_model()

# Callbacks for early stopping and learning rate reduction
callbacks = [
    EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, min_lr=1e-6),
    ModelCheckpoint(filepath='best_model.h5', monitor='val_accuracy', save_best_only=True)
]

# Train the model
history = model.fit(
    X_train_smote, y_train_smote,
    validation_data=(X_val, y_val),
    epochs=100,  # Set a higher epoch if needed
    batch_size=best_batch_size,
    callbacks=callbacks,
    verbose=1
)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

# Load the best model
best_model = tf.keras.models.load_model('best_model.h5')


In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_acc}")


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import numpy as np

# Load the best model (already loaded as `best_model` in the previous step)
best_model = tf.keras.models.load_model('best_model.h5')

# Predict probabilities for the test set
y_probs = best_model.predict(X_test)
y_pred = np.round(y_probs)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_probs)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"ROC AUC Score: {roc_auc}")

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')  # Dashed diagonal
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()



In [None]:
# import numpy as np
# from sklearn.model_selection import StratifiedKFold
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout
# from tensorflow.keras.optimizers import Adam

# # Assuming X_train_smote and y_train_smote are pandas DataFrame and Series respectively
# def build_model():
#     model = Sequential([
#         Dense(83, activation='relu', input_shape=(X_train_smote.shape[1],)),  # Adjust input_shape to match feature size
#         Dense(121, activation='relu'),
#         Dropout(0.3475302741733841),  # Adjust dropout rate if necessary
#         Dense(1, activation='sigmoid')
#     ])
#     model.compile(optimizer=Adam(learning_rate=0.07826801938092297),
#                   loss='binary_crossentropy',
#                   metrics=['accuracy'])
#     return model

# n_splits = 5
# kfold = StratifiedKFold(n_splits=n_splits, shuffle=True)

# scores = []
# for train, test in kfold.split(X_train_smote, y_train_smote):
#     model = build_model()
#     # Use .iloc for proper indexing when using pandas data structures
#     model.fit(X_train_smote.iloc[train], y_train_smote.iloc[train], epochs=100, batch_size=32, verbose=0)
#     score = model.evaluate(X_train_smote.iloc[test], y_train_smote.iloc[test], verbose=0)
#     scores.append(score)

# # Print the cross-validation scores
# print(f"Cross-validated scores: {scores}")


In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()


Error Analysis

In [None]:
# Get the predicted probabilities
y_pred_probs = model.predict(X_test)

# Convert probabilities to binary predictions using a threshold (default is 0.5)
y_pred = (y_pred_probs > 0.5).astype(int)

# Get the actual predictions and the indices where the predictions were incorrect
incorrect_indices = np.where(y_pred.flatten() != y_test)[0]


In [None]:
# Creating a DataFrame to compare actual labels and predicted probabilities/predictions
errors_df = pd.DataFrame({'Actual': y_test, 'Predicted_Prob': y_pred_probs.flatten(), 'Predicted': y_pred.flatten()})
errors_df['Error'] = errors_df['Actual'] != errors_df['Predicted']

# Extract the subset of the DataFrame where predictions are incorrect
misclassified = errors_df[errors_df['Error']]

# Filter out the misclassified cases
misclassified = errors_df[errors_df['Error']]

# Analyze the distribution of probabilities for misclassified cases
sns.histplot(misclassified['Predicted_Prob'], bins=30, kde=False)
plt.title('Distribution of Predicted Probabilities for Misclassified Cases')
plt.show()


In [None]:
# Cases where the model was very wrong
high_confidence_errors = misclassified[(misclassified['Predicted_Prob'] > 0.9) | (misclassified['Predicted_Prob'] < 0.1)]
X_test_errors = X_test.loc[high_confidence_errors.index]

# Check for common features among high confidence errors
common_features = X_test_errors.mean() - X_test.mean()

# This would give you the top 5 features with the highest divergence
top_divergent_features = common_features.abs().sort_values(ascending=False).head(5).index.tolist()
print(top_divergent_features)


In [None]:
for feature in top_divergent_features:
    plt.figure(figsize=(10, 5))
    sns.histplot(X_test_errors[feature], color='red', label='High Confidence Errors', kde=True)
    sns.histplot(X_test[feature], color='blue', label='All Test Data', kde=True)
    plt.title(f'Distribution of {feature} for All Test Data vs High Confidence Errors')
    plt.xlabel(feature)
    plt.ylabel('Density')
    plt.legend()
    plt.show()


In [None]:
# Assuming 'high_confidence_errors' is a DataFrame with cases where the model was very wrong
for feature in common_features.sort_values(key=abs, ascending=False).index[:5]:  # Top 5 features
    plt.figure(figsize=(10, 4))
    sns.kdeplot(X_test_errors[feature], label='Errors', fill=True)
    sns.kdeplot(X_test[feature], label='All Test Data', fill=True)
    plt.title(f'Distribution of {feature} for All Test Data vs Errors')
    plt.xlabel(feature)
    plt.ylabel('Density')
    plt.legend()
    plt.show()


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Assume y_test are the true labels and model_predictions are the predictions from the model
# Replace model.predict(X_test) with your model's prediction method
model_predictions = (model.predict(X_test) > 0.5).astype("int32")

precision = precision_score(y_test, model_predictions)
recall = recall_score(y_test, model_predictions)
f1 = f1_score(y_test, model_predictions)
roc_auc = roc_auc_score(y_test, model.predict(X_test))

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"AUC-ROC: {roc_auc}")
