# Part I. Introduction

The great role of machine learning in the field of disease diagnosis comes from its capability to predict outcomes with high dimensional patterns, where human beings have huge limitations. The goal of the group project is to apply machine learning models on heart disease dataset and make predictions.

## Part II. Description

### Attribute Information:
- age: The person's age in years
- sex: The person's sex (1 = male, 0 = female)
- cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
- chol: The person's cholesterol measurement in mg/dl
- fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
- restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
- thalach: The person's maximum heart rate achieved
- exang: Exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
- slope: the slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping)
- ca: The number of major vessels (0-3)
- thal: A blood disorder called thalassemia (1 = normal; 2 = fixed defect; 3 = reversable defect)
- target: Heart disease (0 = no, 1 = yes)

# Part III. Data Processing and Model Application

In [1]:
# import basic modules
import pydotplus
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
from IPython.display import Image
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# import ML model here
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier

# shut down warning massages
import warnings
warnings.filterwarnings("ignore")

# set random seed
RANDOM_SEED = 42

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/jiajieyuan1010/kaggle/master/heart.csv')

In [3]:
# fix column names
data.columns=["age","sex","chest_pain_experienced","resting_blood_pressure","cholesterol_measurement",
              "fasting_blood_sugar","resting_electrocardiographic_measurement","maximum_heart_rate",
              "exercise_induced_angina","st_depression","peak_exercise_slope","number_of_major_vessels",
              "thalassemia","target"]

# categorical to dummy
data['sex'][data['sex']==0]='female'
data['sex'][data['sex']==1]='male'

data['chest_pain_experienced'][data['chest_pain_experienced']==0]='typical angina'
data['chest_pain_experienced'][data['chest_pain_experienced']==1]='atypical angina'
data['chest_pain_experienced'][data['chest_pain_experienced']==2]='non-anginal pain'
data['chest_pain_experienced'][data['chest_pain_experienced']==3]='asymptomatic'

data['fasting_blood_sugar'][data['fasting_blood_sugar']==0]='lower than 120mg/ml'
data['fasting_blood_sugar'][data['fasting_blood_sugar']==1]='higher than 120mg/ml'

data['resting_electrocardiographic_measurement'][data['resting_electrocardiographic_measurement']==0]='normal'
data['resting_electrocardiographic_measurement'][data['resting_electrocardiographic_measurement']==1]='ST-T wave abnormality'
data['resting_electrocardiographic_measurement'][data['resting_electrocardiographic_measurement']==2]='left ventricular hypertrophy'

data['exercise_induced_angina'][data['exercise_induced_angina']==0]='no'
data['exercise_induced_angina'][data['exercise_induced_angina']==1]='yes'

data['peak_exercise_slope'][data['peak_exercise_slope']==0]='upsloping'
data['peak_exercise_slope'][data['peak_exercise_slope']==1]='flat'
data['peak_exercise_slope'][data['peak_exercise_slope']==2]='downsloping'

data['thalassemia'][data['thalassemia'] == 1] = 'normal'
data['thalassemia'][data['thalassemia'] == 2] = 'fixed defect'
data['thalassemia'][data['thalassemia'] == 3] = 'reversable defect'

data.head()

Unnamed: 0,age,sex,chest_pain_experienced,resting_blood_pressure,cholesterol_measurement,fasting_blood_sugar,resting_electrocardiographic_measurement,maximum_heart_rate,exercise_induced_angina,st_depression,peak_exercise_slope,number_of_major_vessels,thalassemia,target
0,63,male,asymptomatic,145,233,higher than 120mg/ml,normal,150,no,2.3,upsloping,0,normal,1
1,37,male,non-anginal pain,130,250,lower than 120mg/ml,ST-T wave abnormality,187,no,3.5,upsloping,0,fixed defect,1
2,41,female,atypical angina,130,204,lower than 120mg/ml,normal,172,no,1.4,downsloping,0,fixed defect,1
3,56,male,atypical angina,120,236,lower than 120mg/ml,ST-T wave abnormality,178,no,0.8,downsloping,0,fixed defect,1
4,57,female,typical angina,120,354,lower than 120mg/ml,ST-T wave abnormality,163,yes,0.6,downsloping,0,fixed defect,1


In [4]:
data_final = pd.get_dummies(data, drop_first=True)
display(data_final.head(), data_final.describe(), data_final.shape)

# get variable names
Y_NAME = 'target'
X_NAME = [x for x in data_final.columns if x != Y_NAME]

Unnamed: 0,age,resting_blood_pressure,cholesterol_measurement,maximum_heart_rate,st_depression,number_of_major_vessels,target,sex_male,chest_pain_experienced_atypical angina,chest_pain_experienced_non-anginal pain,chest_pain_experienced_typical angina,fasting_blood_sugar_lower than 120mg/ml,resting_electrocardiographic_measurement_left ventricular hypertrophy,resting_electrocardiographic_measurement_normal,exercise_induced_angina_yes,peak_exercise_slope_flat,peak_exercise_slope_upsloping,thalassemia_fixed defect,thalassemia_normal,thalassemia_reversable defect
0,63,145,233,150,2.3,0,1,1,0,0,0,0,0,1,0,0,1,0,1,0
1,37,130,250,187,3.5,0,1,1,0,1,0,1,0,0,0,0,1,1,0,0
2,41,130,204,172,1.4,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0
3,56,120,236,178,0.8,0,1,1,1,0,0,1,0,0,0,0,0,1,0,0
4,57,120,354,163,0.6,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0


Unnamed: 0,age,resting_blood_pressure,cholesterol_measurement,maximum_heart_rate,st_depression,number_of_major_vessels,target,sex_male,chest_pain_experienced_atypical angina,chest_pain_experienced_non-anginal pain,chest_pain_experienced_typical angina,fasting_blood_sugar_lower than 120mg/ml,resting_electrocardiographic_measurement_left ventricular hypertrophy,resting_electrocardiographic_measurement_normal,exercise_induced_angina_yes,peak_exercise_slope_flat,peak_exercise_slope_upsloping,thalassemia_fixed defect,thalassemia_normal,thalassemia_reversable defect
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,131.623762,246.264026,149.646865,1.039604,0.729373,0.544554,0.683168,0.165017,0.287129,0.471947,0.851485,0.013201,0.485149,0.326733,0.462046,0.069307,0.547855,0.059406,0.386139
std,9.082101,17.538143,51.830751,22.905161,1.161075,1.022606,0.498835,0.466011,0.371809,0.453171,0.500038,0.356198,0.114325,0.500606,0.469794,0.499382,0.254395,0.498528,0.236774,0.487668
min,29.0,94.0,126.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,120.0,211.0,133.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,55.0,130.0,240.0,153.0,0.8,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,61.0,140.0,274.5,166.0,1.6,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
max,77.0,200.0,564.0,202.0,6.2,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


(303, 20)

#  Part IV. App

In [5]:
# We add all Plotly and Dash necessary librairies
import plotly.graph_objects as go

import dash
import dash_core_components as dcc
import dash_html_components as html
#import dash_daq as daq
from dash.dependencies import Input, Output

# We rename columns as industrial parameters
col_names = [x for x in data_final.columns if x != Y_NAME]

# We train a simple RF model
model = RandomForestRegressor()
model.fit(data_final.drop("target", axis=1), data_final["target"])

# We create a DataFrame to store the features' importance and their corresponding label
df_feature_importances = pd.DataFrame(model.feature_importances_*100,columns=["Importance"],index=col_names)
df_feature_importances = df_feature_importances.sort_values("Importance", ascending=False)

In [6]:
df = data_final

# We create a Features Importance Bar Chart
fig_features_importance = go.Figure()
fig_features_importance.add_trace(
    go.Bar(x=df_feature_importances.index,
           y=df_feature_importances["Importance"],
           marker_color='indianred'))
fig_features_importance.update_layout(
    title_text='<b>The importance of various factors that cause heart disease<b>', title_x=0.5)
# The command below can be activated in a standard notebook to display the chart
#fig_features_importance.show()

# We record the name, min, mean and max of the three most important features
slider_1_label = 'Have you ever experienced chest pain, typical angina ?' 

slider_2_label = 'Do you have symptoms of thalassemia repair defects ?'  

slider_3_label = 'How many number of major vessels in your heart ?'  

slider_4_label = 'Is your fasting blood sugar lower than 120g/ml ?'  

slider_5_label = 'Do you have sympotoms of exercise induced angina ?'  

slider_6_label = 'Is the resting electrocardiographic measurement normal'

slider_7_label = 'Please choose your gender'  

In [7]:
app = dash.Dash()

group_colors = {"control": "light blue", "reference": "red"}

app = dash.Dash(__name__,
                meta_tags=[{
                    "name": "viewport",
                    "content": "width=device-width"
                }])
server = app.server

app.layout = html.Div(
    style={
        'textAlign': 'center',
        'width': '1000px',
        'font-family': 'Verdana'
    },
    children=[

        # Title display
        html.H1(children="Heart Disease Prediction"),

        # Dash Graph Component calls the fig_features_importance parameters
        dcc.Graph(figure=fig_features_importance),

        # We display the most important feature's name
        html.H4(children=slider_1_label),
        html.Div([
            dcc.RadioItems(id='X1_slider',
                           options=[{
                               'label': 'Yes',
                               'value': 1
                           }, {
                               'label': 'No',
                               'value': 0
                           }],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),

        # The same logic is applied to the following names / sliders
        html.H4(children=slider_2_label),
        html.Div([
            dcc.RadioItems(id='X2_slider',
                           options=[{
                               'label': 'Yes',
                               'value': 1
                           }, {
                               'label': 'No',
                               'value': 0
                           }],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),
        html.H4(children=slider_3_label),
        html.Div([
            dcc.RadioItems(id='X3_slider',
                           options=[
                               {
                                   'label': '0',
                                   'value': 0
                               },
                               {
                                   'label': '1',
                                   'value': 1
                               },
                               {
                                   'label': '2',
                                   'value': 2
                               },
                               {
                                   'label': '3',
                                   'value': 3
                               },
                               {
                                   'label': '4',
                                   'value': 4
                               },
                           ],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),
        html.H4(children=slider_4_label),
        html.Div([
            dcc.RadioItems(id='X4_slider',
                           options=[
                               {
                                   'label': 'Yes',
                                   'value': 1
                               },
                               {
                                   'label': 'No',
                                   'value': 0
                               },
                           ],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),
        html.H4(children=slider_5_label),
        html.Div([
            dcc.RadioItems(id='X5_slider',
                           options=[
                               {
                                   'label': 'Yes',
                                   'value': 1
                               },
                               {
                                   'label': 'No',
                                   'value': 0
                               },
                           ],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),
        
        html.H4(children=slider_6_label),
        html.Div([
            dcc.RadioItems(id='X6_slider',
                           options=[
                               {
                                   'label': 'Yes',
                                   'value': 1
                               },
                               {
                                   'label': 'No',
                                   'value': 0
                               },
                           ],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),
        
        html.H4(children=slider_7_label),
        html.Div([
            dcc.RadioItems(id='X7_slider',
                           options=[
                               {
                                   'label': 'Male',
                                   'value': 1
                               },
                               {
                                   'label': 'Female',
                                   'value': 0
                               },
                           ],
                           value=1),
        ],
            style={
            'width': '48%',
                     'display': 'inline-block'
        }),

        # The predictin result will be displayed and updated here
        html.H2(id="prediction_result"),
        html.A("Link to external site: How to prevent heart disease ?",
               href='https://www.cdc.gov/heartdisease/prevention.htm',
               target="_blank"),
    ])

In [8]:
# The callback function will provide one "Ouput" in the form of a string (=children)
@app.callback(
    Output(component_id="prediction_result", component_property="children"),
    # The values correspnding to the three sliders are obtained by calling their id and value property
    [
        Input("X1_slider", "value"),
        Input("X2_slider", "value"),
        Input("X3_slider", "value"),
        Input("X4_slider", "value"),
        Input("X5_slider", "value"),
        Input("X6_slider", "value"),
        Input("X7_slider", "value"),
    ])
# The input variable are set in the same order as the callback Inputs
def update_prediction(X1, X2, X3, X4, X5, X6, X7):

    input_X = np.array([
        df["age"].mean(), df["resting_blood_pressure"].mean(),
        df['cholesterol_measurement'].mean(), df["maximum_heart_rate"].mean(),
        df['st_depression'].mean(), X3, X7,
        df['chest_pain_experienced_atypical angina'].mean(),
        df['chest_pain_experienced_non-anginal pain'].mean(), X2, X4, X6,
        df['resting_electrocardiographic_measurement_normal'].mean(), X5,
        df['peak_exercise_slope_flat'].mean(),
        df['peak_exercise_slope_upsloping'].mean(), X1,
        df['thalassemia_normal'].mean(),
        df['thalassemia_reversable defect'].mean()
    ]).reshape(1, -1)

    # Prediction is calculated based on the input_X array
    prediction = model.predict(input_X)[0]

    # And retuned to the Output of the callback function
    return "The probability of having a heart disease is {}%.".format(
        prediction * 100)


if __name__ == "__main__":
    app.run_server()

Dash is running on http://127.0.0.1:8050/

Dash is running on http://127.0.0.1:8050/

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
127.0.0.1 - - [10/Dec/2020 16:50:12] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [10/Dec/2020 16:50:13] "[37mGET /_dash-dependencies HTTP/1.1[0m" 200 -
127.0.0.1 - - [10/Dec/2020 16:50:13] "[37mGET /_dash-layout HTTP/1.1[0m" 200 -
127.0.0.1 - - [10/Dec/2020 16:50:13] "[37mPOST /_dash-update-component HTTP/1.1[0m" 200 -
