In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

## Introduction

Heart disease remains one of the leading causes of death in both [Canada](https://www150.statcan.gc.ca/n1/en/daily-quotidien/190530/dq190530c-eng.pdf?st=g0roC2Py) and the [USA](https://www.medicalnewstoday.com/articles/282929.php#heart-disease). It can be influenced by a multitude of factors that are related to nutrition, environmental factors, lifestyle, and genetics. In this analysis we will explore some of the potential physiological indicators of heart disease and determine whether we can obtain a model that gives reliable predictions. 

Historically, women have been underepresented in clinical trials for [medical treatments](https://journals.lww.com/psychopharmacology/Abstract/2019/05000/Systematic_Review_of_Gender_Bias_in_the_Clinical.15.aspx). In ways this has hurt female health because we are using research that is optimized for men and using that to treat the [physiolocigally different](https://www.health.harvard.edu/newsletter_article/mars-vs-venus-the-gender-gap-in-health) female body. Instead it is perhaps more beneficial to analyze data for each sex independently. Based on this idea, we will explore models that train on features that are important to each sex.

Other kernels associated with this dataset do a great job of exploring the data from a visual standpoint so I encourage you to check the other ones out and give them a thumbs up if you find them helpful.  

Some of the goals for this analysis are as follows:
2. explore using Altair, rather than seaborn 
3. a bonus would be to implement a heroku dashboard (unfinished)
4. Obtain a model that is a good predictor of heart disease based on the features in this dataset 
5. Get insight into which features are the most important 
3. Should gender-focused models be considered when predicting on gender-specific data?

Without further ado, let's get started!

## Initial Exploration

In [None]:
# Read in the kaggle dataset
heart_df = pd.read_csv("/kaggle/input/heart-disease-dataset/heart.csv")
heart_df.head()

It's important to first understand the 13 features involved with this dataset:
- **age** *age in years*
- **sex** *(1 = male; 0 = female)*
- **cp** *chest pain type (0 - Typical Angina (Heart related) 1 - Atypical Angina (Non-heart related) 2 - Non-Anginal pain (Non-heart related) 3 - Asymptomatic (No disease))*
- **trestbps** *resting blood pressure (in mm Hg on admission to the hospital)*
- **chol** *serum cholestoral in mg/dl (health levels are < 200mg/dl)*
- **fbs** *(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)*
- **restecg** *resting electrocardiographic results ( 0 = normal, 1 = ST-T wave abnormality, 2= probable or definite left ventricular hypertrophy by Estes' criteria )*
- **thalach** *maximum heart rate achieved*
- **exang** *exercise induced angina (1 = yes; 0 = no)*
- **oldpeak** *ST depression induced by exercise relative to rest*
- **slope** *the slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)*
- **ca** *number of major vessels (0-3) colored by flourosopy*
- **thal** *(1 = normal; 2 = fixed defect; 3 = reversable defect)*

Result: 
- **target** *(1 -heart problem or 0 - no heart problem)*


In [None]:
heart_df.hist(figsize = (15,15))

Some observations about the feature histograms:
- chol contains what seems to be a potential outlier at above 500 mg/dl
- age varies but most of the density exists at >50 years
- target imbalance isn't a problem (approximately equal numbers of "heart problem" and "no heart problem" outcomes)
- there seems to be more males(~700) than females(~300) in the dataset
- In data description, ca should have 0-3, but there is also a 4 value? (not sure why this is)
- "Thal" is not well-defined in the UCI dataset description



In [None]:
# Check for null values
heart_df.isnull().sum()

The data has already been formatted well based on the above observations so we won't focus on any further data cleaning here, although I encourage you to comment or make any suggestions that come to mind. For example, is it appropriate to consider the extreme **chol** value (~500 mg/dl) as an outlier and remove it before analysis?

## Initial Data Split

At this point, we want to make sure we don't make any biased decisions regarding the data we will ultimately test our models on. Before exploring the relationships in the data, we will abide by the golden rule and split the data for the final testing phase.

In [None]:
X_training, X_vault, y_training, y_vault = train_test_split(heart_df.drop(columns = 'target'),
                                                            heart_df[['target']],
                                                            test_size = 0.2, 
                                                            random_state = 1)

## Altair Exploration

I know a lot of people tend to use seaborn or built-in plotting methods while exploring kaggle datasets. Altair is an emerging plotting package for python that I believe has a lot of potential. Therefore, we will explore the data using this package and my hope is that I can inspire more people to add this package to their repetoire. Firstly, we need to add some code that will allow altair plots to render on the kaggle notebook [properly](https://www.kaggle.com/jakevdp/altair-kaggle-renderer).

In [None]:
# The following code was taken from (https://www.kaggle.com/jakevdp/altair-kaggle-renderer, 
# attributed to the user"no data sources"). This code allows altair charts to render properly
# within the kaggle kernel.

import altair as alt
import json
from IPython.display import HTML

KAGGLE_HTML_TEMPLATE = """
<style>
.vega-actions a {{
    margin-right: 12px;
    color: #757575;
    font-weight: normal;
    font-size: 13px;
}}
.error {{
    color: red;
}}
</style>
<div id="{output_div}"></div>
<script>
requirejs.config({{
    "paths": {{
        "vega": "{base_url}/vega@{vega_version}?noext",
        "vega-lib": "{base_url}/vega-lib?noext",
        "vega-lite": "{base_url}/vega-lite@{vegalite_version}?noext",
        "vega-embed": "{base_url}/vega-embed@{vegaembed_version}?noext",
    }}
}});
function showError(el, error){{
    el.innerHTML = ('<div class="error">'
                    + '<p>JavaScript Error: ' + error.message + '</p>'
                    + "<p>This usually means there's a typo in your chart specification. "
                    + "See the javascript console for the full traceback.</p>"
                    + '</div>');
    throw error;
}}
require(["vega-embed"], function(vegaEmbed) {{
    const spec = {spec};
    const embed_opt = {embed_opt};
    const el = document.getElementById('{output_div}');
    vegaEmbed("#{output_div}", spec, embed_opt)
      .catch(error => showError(el, error));
}});
</script>
"""

class KaggleHtml(object):
    def __init__(self, base_url='https://cdn.jsdelivr.net/npm'):
        self.chart_count = 0
        self.base_url = base_url
        
    @property
    def output_div(self):
        return "vega-chart-{}".format(self.chart_count)
        
    def __call__(self, spec, embed_options=None, json_kwds=None):
        # we need to increment the div, because all charts live in the same document
        self.chart_count += 1
        embed_options = embed_options or {}
        json_kwds = json_kwds or {}
        html = KAGGLE_HTML_TEMPLATE.format(
            spec=json.dumps(spec, **json_kwds),
            embed_opt=json.dumps(embed_options),
            output_div=self.output_div,
            base_url=self.base_url,
            vega_version=alt.VEGA_VERSION,
            vegalite_version=alt.VEGALITE_VERSION,
            vegaembed_version=alt.VEGAEMBED_VERSION
        )
        return {"text/html": html}
    
alt.renderers.register('kaggle', KaggleHtml())
print("Define and register the kaggle renderer. Enable with\n\n"
      "    alt.renderers.enable('kaggle')")

The following scatter matrix was not particularly impactful, but I decided to keep it here in case someone wants to utilize something similar in the future. It is a nice way to identify potential variable-variable relationships, and in our case, see if there are any detectable differences between the different sexes (male = 1, female = 0). 

In [None]:
alt.renderers.enable('kaggle')
# https://altair-viz.github.io/gallery/scatter_matrix.html
alt.Chart(X_training).mark_circle().encode(
    alt.X(alt.repeat("column"), type = "quantitative"),
    alt.Y(alt.repeat("row"), type = "quantitative"),
    alt.Color("sex:N")
).properties(
    width = 120,
    height = 120
).repeat(
    row = ["age", "chol", "oldpeak","thalach", "trestbps"],
    column = ["trestbps", "thalach", "oldpeak", "chol", "age"]
)

Instead of using a scatter matrix, we can use a heatmap which can display correlational relationships in a dataset. To this point, I haven't observed any correlational heatmaps, so I had to create my own method using some inspiration from this [heatmap](https://altair-viz.github.io/gallery/layered_heatmap_text.html). The trick to making the heatmap lies in how we construct the DataFrame. Below I will go through the DataFrame processing in a step-wise manner.

In [None]:
# Combine the features and target of the training dataset 
# (Note: I didn't do this on the original dataframe to avoid bias entering our analysis)
pd.concat([X_training, y_training], sort = False, axis = 1)

In [None]:
# .corr() finds correlational relationships for each column
# I use reset_index() for the melting step to treat column names as a variable instead of an index (see below)
pd.concat([X_training, y_training], sort = False, axis = 1).corr().reset_index()

In [None]:
# added rounding and melt data frame columns into "var2"
pd.concat([X_training, y_training], sort = False, axis = 1).corr().round(2).reset_index().melt(id_vars = "index",var_name = "var2", value_name = "corr_val")

Next, it's just a matter of making the plots:

In [None]:
# Inspired by https://altair-viz.github.io/gallery/layered_heatmap_text.html
def make_heatmap(corr_df):
    '''
    Take in a correlational dataframe and create a heatmap
    
    Arguments: 
    corr_df (DataFrame) - Dataframe of correlational values
    '''
     
    base = alt.Chart(corr_df).encode(
        alt.X("index"),
        alt.Y("var2")
    )
    heatmap = base.mark_rect().encode(
        alt.Color("corr_val", scale = alt.Scale(scheme = "viridis"))
    )
    
    text = base.mark_text().encode(
    text = "corr_val"
    )
    return (heatmap + text).properties(height = 500, width = 500)

In [None]:

corr_df = pd.concat([X_training, y_training], sort = False, axis = 1).corr().round(2).reset_index().melt(id_vars = "index",var_name = "var2", value_name = "corr_val")
make_heatmap(corr_df)

Creating a correlational heatmap allows us to discover potential relationships between variables and the target. Thalach, cp, exang, and oldpeak had the highest Pearson correlations with the target value. The variable slope and oldpeak have the highest absolute correlational value, at -0.58.

But what happens if we assess the correlations by sex. Do the feature correlations change?

In [None]:
male_corr = pd.concat([X_training, y_training],
                      sort = False, 
                      axis = 1).query("sex == 1").drop(["sex"], 
                                                       axis = 1).corr().round(2).reset_index().melt(id_vars = "index",var_name = "var2", value_name = "corr_val")
make_heatmap(male_corr)

In [None]:
female_corr = pd.concat([X_training, y_training],
                      sort = False, 
                      axis = 1).query("sex == 0").drop(["sex"], 
                                                       axis = 1).corr().round(2).reset_index().melt(id_vars = "index",var_name = "var2", value_name = "corr_val")
make_heatmap(female_corr)

In [None]:
heart_df.query("sex == 1").shape

In [None]:
heart_df.query("sex == 0").shape

There are [differences in heart disease prevalance](https://www.ncbi.nlm.nih.gov/pubmed/10863872) between males and females.One of the issues with the dataset is that there is more than double the amount of data for males than females. This could explain some the differences in correlations that we observe with the features when assessing each sex independently. 

### Generating Models

The plan is to train 3 different types of models, on all the data, using the top 7 features with regards to absolute correlational values:
1. An overall model based on highest correlated features inclusive of both sexes
2. A model trained on features highly correlated in male heart disease
3. A model trained on features highly correlated in female heart disease

The question is do the models perform relatively the same when predicting heart disease outcomes for men vs women.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_training, y_training, test_size = 0.2, random_state = 2)

numerical_categories = ["age", "chol", "oldpeak", "thalach", "trestbps"]

preprocessor = ColumnTransformer(transformers=[
        ('scale', StandardScaler(), numerical_categories)], remainder = "passthrough")

new_columns = ["age", "chol", "oldpeak", "thalach", "trestbps", "sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal"]

X_train_scaled = pd.DataFrame(preprocessor.fit_transform(X_train), columns = new_columns)
X_test_scaled = pd.DataFrame(preprocessor.transform(X_test), columns = new_columns)

### Comprehensive Model

Trained on highest correlated features while not considering gender.

In [None]:
X_train_comp = X_train_scaled[["thalach", "oldpeak","exang","cp","ca","thal","slope"]]
X_test_comp = X_test_scaled[["thalach", "oldpeak","exang","cp","ca","thal","slope"]]

In [None]:
comp_RF_model = RandomForestClassifier(n_estimators = 150)
comp_RF_model.fit(X_train_comp, y_train.to_numpy().ravel())
print(f"Training Accuracy: {comp_RF_model.score(X_train_comp, y_train)}")
print(f"Test Accuracy: {comp_RF_model.score(X_test_comp, y_test)}")

In [None]:
comp_XGBoost = XGBClassifier()
comp_XGBoost.fit(X_train_comp, y_train.to_numpy().ravel())
print(f"Training Accuracy: {comp_XGBoost.score(X_train_comp, y_train)}")
print(f"Test Accuracy: {comp_XGBoost.score(X_test_comp, y_test)}")

### Male-focused Model

Trained on highest-correlated features when only consider male data.

In [None]:
X_train_male = X_train_scaled[["thalach", "oldpeak","exang","cp","slope","ca","age"]]
X_test_male = X_test_scaled[["thalach", "oldpeak","exang","cp","slope","ca","age"]]

In [None]:
male_RF_model = RandomForestClassifier(n_estimators = 150)
male_RF_model.fit(X_train_male, y_train.to_numpy().ravel())
print(f"Training Accuracy: {male_RF_model.score(X_train_male, y_train)}")
print(f"Test Accuracy: {male_RF_model.score(X_test_male, y_test)}")

In [None]:
male_XGBoost = XGBClassifier()
male_XGBoost.fit(X_train_male, y_train.to_numpy().ravel())
print(f"Training Accuracy: {male_XGBoost.score(X_train_male, y_train)}")
print(f"Test Accuracy: {male_XGBoost.score(X_test_male, y_test)}")

### Female-focused Model

Trained on highest-correlated features when only consider female data.

In [None]:
X_train_female = X_train_scaled[["cp", "exang","thal","oldpeak","ca","slope","trestbps"]]
X_test_female = X_test_scaled[["cp", "exang","thal","oldpeak","ca","slope","trestbps"]]

In [None]:
female_RF_model = RandomForestClassifier(n_estimators = 150)
female_RF_model.fit(X_train_female, y_train.to_numpy().ravel())
print(f"Training Accuracy: {female_RF_model.score(X_train_female, y_train)}")
print(f"Test Accuracy: {female_RF_model.score(X_test_female, y_test)}")

In [None]:
female_XGBoost = XGBClassifier()
female_XGBoost.fit(X_train_female, y_train.to_numpy().ravel())
print(f"Training Accuracy: {female_XGBoost.score(X_train_female, y_train)}")
print(f"Test Accuracy: {female_XGBoost.score(X_test_female, y_test)}")

## Testing on the hidden vault data

Finally, we will test these models on the vault data to see how well it does when predicting male heart disease vs female heart disease.

In [None]:
vault_male = pd.concat([X_vault, y_vault], axis = 1, sort = False).query("sex == 1")
vault_female = pd.concat([X_vault, y_vault], axis = 1, sort = False).query("sex == 0")
print(vault_male.shape)
print(vault_female.shape)

In [None]:
# Apply the scaling
X_vault_scaled = pd.DataFrame(preprocessor.transform(X_vault), columns = new_columns)
X_vault_male_scaled = pd.DataFrame(preprocessor.transform(vault_male.drop(columns = ['target'])), columns = new_columns)
X_vault_female_scaled = pd.DataFrame(preprocessor.transform(vault_female.drop(columns = ['target'])), columns = new_columns)

In [None]:
comp_features = ["thalach", "oldpeak","exang","cp","ca","thal","slope"]
male_features = ["thalach", "oldpeak","exang","cp","slope","ca","age"]
female_features = ["cp", "exang","thal","oldpeak","ca","slope","trestbps"]

XG_score_dict = {"XG Boost Scores":["Both Genders", "Male Data","Female Data"]}

XG_score_dict["Comprehensive Model Acc"] = [comp_XGBoost.score(X_vault_scaled[comp_features], y_vault),
                                        comp_XGBoost.score(X_vault_male_scaled[comp_features], vault_male[['target']]),
                                       comp_XGBoost.score(X_vault_female_scaled[comp_features], vault_female[['target']])]

XG_score_dict["Male-focused Model Acc"] = [male_XGBoost.score(X_vault_scaled[male_features], y_vault),
                                        male_XGBoost.score(X_vault_male_scaled[male_features], vault_male[['target']]),
                                       male_XGBoost.score(X_vault_female_scaled[male_features], vault_female[['target']])]

XG_score_dict["Female-focused Model Acc"] =[female_XGBoost.score(X_vault_scaled[female_features], y_vault),
                                        female_XGBoost.score(X_vault_male_scaled[female_features], vault_male[['target']]),
                                       female_XGBoost.score(X_vault_female_scaled[female_features], vault_female[['target']])]
pd.DataFrame(XG_score_dict).round(2)

The comprehensive model performed best when predicting for data including both genders, as well as the male-exclusive data. Interestingly, the female-focused model predicted the female data with 6% less accuracy than both the comprehensive and male-focused models.

In [None]:
RF_score_dict = {"RF Scores":["Both Genders", "Male Data","Female Data"]}

RF_score_dict["Comprehensive Model Acc"] = [comp_RF_model.score(X_vault_scaled[comp_features], y_vault),
                                        comp_RF_model.score(X_vault_male_scaled[comp_features], vault_male[['target']]),
                                       comp_RF_model.score(X_vault_female_scaled[comp_features], vault_female[['target']])]

RF_score_dict["Male-focused Model Acc"] = [male_RF_model.score(X_vault_scaled[male_features], y_vault),
                                        male_RF_model.score(X_vault_male_scaled[male_features], vault_male[['target']]),
                                       male_RF_model.score(X_vault_female_scaled[male_features], vault_female[['target']])]

RF_score_dict["Female-focused Model Acc"] =[female_RF_model.score(X_vault_scaled[female_features], y_vault),
                                        female_RF_model.score(X_vault_male_scaled[female_features], vault_male[['target']]),
                                       female_RF_model.score(X_vault_female_scaled[female_features], vault_female[['target']])]

pd.DataFrame(RF_score_dict).round(2)

The RF model performed better in general than the XGBoost model, with the lowest prediction score being 98%. Once again the female-focused model performed worse when predicting female-exclusive data. The male-focused model performed slightly better than the comprehensive model when predicting male-exclusive data. 

## Conclusion

In this study, different models were constructed based on highest-correlated features for each gender. Overall, the RF model performed better than the XGBoost model, on both the test and hidden vault data. The main question asked was whether training a model on features important to a specific gender improved performance on gender-specific data, compared to a model trained on features important to both genders. The only case where an improvement was observed with the male-focused RF model which performed slightly better than the comprehensive model (~1%) on male-specific data. 

Interestingly, both the female-focused RF and XG models performed worse on female-specific data, compared to both the comprehensive and male-focused models. One potential explanation for this is the small number of female observations in the vault data (53). The small sample size may not be sufficient to give a reliable accuracy score. It is also important to note that the correlations used were based on a subset the data and don't necessarily provide an accurate representation of the population data. In hindsight, it may have been better to separate the data first by gender and then split them into independent training and test sets.

In this study, we explored potential differences in models constructed with a specific gender in mind. Perhaps, there is a better method of feature selection that could be implemented. Furthermore, more models could be explored along with their hyperparameters. 

If anyone has any comments or suggestions, I encourage you to voice them as I am always looking for feedback. If you enjoyed this study or learned something useful, I would appreciate an upvote. I may consider creating a heroku dashboard to visualize the heart disease data in the future (using plotly dash).

