#Diagnosing Heart Disease with AI: Creating a Model
*Daniela Ganelin,
[Inspirit AI](inspiritai)*

Heart disease is one of the world's biggest heath problems! [**Almost half**](https://www.heart.org/en/news/2019/01/31/cardiovascular-diseases-affect-nearly-half-of-american-adults-statistics-show) of American adults have some kind of heart disease.

Usually, heart disease is diagnosed through a [special X-ray](https://www.mayoclinic.org/tests-procedures/coronary-angiogram/about/pac-20384904) where dye is injected into the body. Of course, that's a pretty complicated and expensive procedure!

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Herzkatheterlabor_modern.jpeg" alt="drawing" height="250px"/>

<img src="https://upload.wikimedia.org/wikipedia/commons/3/30/Cerebral_angiography%2C_arteria_vertebralis_sinister_injection.JPG" alt="drawing" height="250px"/>

What if we could instead use AI to diagnose heart disease based on **some simple lab tests** that any doctor or nurse could perform? **How could AI help people?**

Let's try it! In this lab, we'll:
- Explore a heart disease dataset
- Make graphs to visualize the data
- Try to diagnose heart disease with simple rules
- Make and improve a machine learning model to diagnose heart disease!

#Understanding Our Data

In [None]:
#@title Run this to load our tools and data!

#Check out this post for more details! https://www.kaggle.com/ronitf/heart-disease-uci/discussion/105877

import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv). 
import os # Good for navigating your computer's files 
import gdown
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

gdown.download('https://drive.google.com/uc?id=1JmrQ7RAIWQR7NK9ziHpW9FTpathgGSI3', 'heart.csv', True)
patient_data = pd.read_csv("heart.csv")
patient_data = patient_data[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'thalach', 'exang', 'target']]
column_names = {'age':'age','sex':'sex','cp':'chest_pain', 'trestbps':'blood_pressure','chol':'cholesterol','fbs':'high_blood_sugar','thalach':'heart_rate','exang':'exercise_pain','target':'disease'}
patient_data = patient_data.rename(column_names,axis=1)
patient_data['chest_pain'] = (patient_data['chest_pain'] > 0).astype(int) #1 for yes, 0 for no
patient_data['disease'] = 1 - patient_data['disease'] #1 for yes, 0 for no
patient_data = patient_data[['age', 'blood_pressure',  'cholesterol', 'heart_rate', 'sex', 'high_blood_sugar', 'chest_pain', 'exercise_pain', 'disease']]

def show_predictions(predictions):
  df = patient_data[['heart_rate','disease']].copy()
  df['prediction'] = predictions
  print ("Percent accurate:", accuracy_score(patient_data['disease'], predictions))
  display(df)

def visualize_tree(model, input_data):
  fig_size = min(model.max_depth * 2, 40)
  fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (fig_size,fig_size), dpi=800)
  tree.plot_tree(model, 
                class_names=['no disease', 'disease'], 
                feature_names = input_data.columns, 
                filled = True,
                impurity = False)

Let's take a look at the Cleveland Heart Disease Dataset from real patients!

In [None]:
display (patient_data)

Let's understand our data! Work with your team to answer the questions, and click Play to check your answers.

In [None]:
#@title How many patients are there? (Choose a number and then click play!)
num_patients = 0 #@param {type:"slider", min:0, max:500, step:1}
 
if num_patients == len(patient_data):
  print ("Correct!")
else:
  print ("Try again!")


In [None]:
#@title Our output - the column we're trying to predict - is:
to_predict = "Choose here!" #@param ["Choose here!", 'age', 'sex', 'chest_pain', 'blood_pressure', 'cholesterol','high_blood_sugar', 'heart_rate', 'exercise_pain', 'disease']

if to_predict == "disease":
  print ("Yes! We're going to use all the other columns, or features, to predict whether someone has heart disease.")
else:
  print ("Try again!")

In [None]:
#@title How do we interpret the "disease" feature?

healthy = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]
heart_disease = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]

if healthy == "0" and heart_disease == "1":
  print ("Correct!")
else:
  print ("Try again!")

In [None]:
#@title How many input features (other columns) will we use to make our predictions?
num_features = 0 #@param {type:"slider", min:0, max:12, step:1}
 
if num_features == len(patient_data.columns)-1:
  print ("Correct! They are:", list(patient_data.drop('disease',axis=1).columns))
else:
  print ("Try again!")


#Exploring our Numerical Data

**Discuss:** Of our input features, which do you think would be **most** useful for predicting whether someone has heart disease? Make a guess:


In [None]:
most_useful = "Choose here!" #@param ['Choose here!', 'age', 'blood_pressure', 'cholesterol', 'heart_rate', 'sex', 'high_blood_sugar', 'chest_pain', 'exercise_pain']

print ("Let's test your guess!")

Let's explore one feature at a time! We'll start with the **numerical** features: features where the values are **numbers**.

##Exploring `age`

Here's the `Age` column:

In [None]:
display(patient_data[['age']])

In [None]:
#@title Age is measured in:

age_units = "Choose here!" #@param ["Choose here!", "months", "days", "years", "centuries"]

if age_units == "years":
  print ("Correct!")
else:
  print ("Try again!")

Let's graph our data! We'll use a libary called "seaborn", or "sns", to make graphs.

**Discuss:** How do we read this graph? What patterns do you notice? How much would "age" help us predict whether someone has heart disease?

In [None]:
sns.catplot(x="disease", y="age", data=patient_data)

##Exploring `blood_pressure`

Now, let's check out the `blood_pressure` variable! Please **show that column of the data**, like we did before:

In [None]:
#YOUR CODE HERE (1 line)

In [None]:
#@title Solution
display(patient_data[['blood_pressure']])

In [None]:
#@title "blood_pressure" is measured in "mm Hg". I predict that's it's healthier for blood pressure to be:
healthier_blood_pressure = "Choose here!" #@param ["Choose here!", "Lower", "Higher"]

if healthier_blood_pressure == "Lower":
  print ("Good prediction!")
else:
  print ("Try again!")

Please **make a graph** showing how blood pressure and heart disease are related! **Discuss:** How useful is this feature? Any surprises?

In [None]:
#YOUR CODE HERE (1 line)

In [None]:
#@title Solution
sns.catplot(x="disease", y="blood_pressure", data=patient_data)

##Exploring `cholesterol`

Next up is the `cholesterol` feature.

In [None]:
#@title "cholesterol" is measured in "mg/dl". I predict that's it's healthier for cholesterol to be:
healthier_cholesterol = "Choose here!" #@param ["Choose here!", "Lower", "Higher"]

if healthier_cholesterol == "Lower":
  print ("Good prediction!")
else:
  print ("Try again!")

As before, please **print out the column** and **make and discuss a graph!** Any surprises?

In [None]:
#YOUR CODE HERE to print the column 

In [None]:
#YOUR CODE HERE to make a graph

In [None]:
#@title Solution
display(patient_data[['cholesterol']])
sns.catplot(x="disease", y="cholesterol", data=patient_data)


#Exploring `heart_rate`

Last numerical feature: heart rate!

In [None]:
#@title Heart rate is the patient's highest heart rate while they exercise, in beats per minute. I predict that it's healthier for the maximum heart rate to be...
healthier_heart_rate = "Choose here!" #@param ["Choose here!", "Lower", "Higher"]

if healthier_heart_rate == "Higher":
  print ("Good prediction! It's healthier for to have a high exercise heart rate and a low resting heart rate.")
else:
  print ("Try again!")

Let's test your prediction! You know what to do: look at the feature and the graph.

In [None]:
#YOUR CODE HERE

In [None]:
#@title Solution
sns.catplot(x="disease", y="heart_rate", data=patient_data)
display(patient_data[['heart_rate']])


#Exploring our Categorical Data

Now, let's check out our **categorical features**. For each of these, there are two possible values: for example, "sick" and "healthy". We represent these as 0 and 1 so our algorithms can do math with the data!

##Exploring the `sex` Feature

Please print out the `sex` column, as before:

In [None]:
#YOUR CODE HERE

In [None]:
#@title Solution
display(patient_data[['sex']])

In [None]:
#@title How do we interpret the "sex" feature?

female = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]
male = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]

if female == "0" and male == "1":
  print ("Correct!")
else:
  print ("Try again!")

We're going to make a different kind of graph for the categorical variables!

**Discuss:** What's the relationship between gender and heart disease? In our overall dataset, do we have more men or women? Why might this be a problem?

In [None]:
sns.catplot(x="disease", hue="sex", kind = "count",  data=patient_data)

## Exploring `high_blood_sugar`

Let's check out the next categorical feature.

In [None]:
#@title How do we interpret the "high_blood_sugar" features?

blood_sugar_is_high = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]
blood_sugar_is_normal = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]

if blood_sugar_is_high == "1" and blood_sugar_is_normal == "0":
  print ("Correct!")
else:
  print ("Try again!")

Please output the column from the data and a graph - what do the results show?

In [None]:
#YOUR CODE HERE

In [None]:
#@title Solution
sns.catplot(x="disease", hue="high_blood_sugar", kind = "count",  data=patient_data)
display(patient_data[['high_blood_sugar']])

## Exploring the `chest_pain` feature

Next, let's explore two similar features: `chest_pain` and `exercise_pain`.

In [None]:
#@title The "chest_pain" features shows whether people have chest pain in general. How do we interpret it?

no_pain = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]
has_pain = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]

if no_pain == "0" and has_pain == "1":
  print ("Correct!")
else:
  print ("Try again!")

As before, please output the column from the data and a graph.

**Any surprising results here?** How could you explain them?

In [None]:
#YOUR CODE HERE

In [None]:
#@title Solution
sns.catplot(x="disease", hue="chest_pain", kind = "count",  data=patient_data)
display(patient_data[['chest_pain']])

##Exploring `exercise_pain`

You made it - last feature!



In [None]:
#@title The "exercise_pain" features shows whether people have chest pain during exercise. How do we interpret it?

no_pain = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]
has_pain = "Choose here!" #@param ["Choose here!", "-1", "0", "1", "2"]

if no_pain == "0" and has_pain == "1":
  print ("Correct!")
else:
  print ("Try again!")

As before, please output the column from the data and a graph.

**How is this feature different than the previous one? What could explain it?**

In [None]:
#YOUR CODE HERE

In [None]:
#@title Solution
sns.catplot(x="disease", hue="exercise_pain", kind = "count",  data=patient_data)
display(patient_data[['exercise_pain']])

# Making Predictions with Rules

Now that we're familiar with our features, let's use them to predict `disease`!

**Based on the graphs, which features do you think would be most helpful for predicting whether someone has heart disease?**

First, let's try making a tiny decision tree ourselves. Then we'll use machine learning!

Let's use just **one** feature for now: `heart_rate`. 

If you had to guess whether someone had heart disease based on their heart rate, what **cutoff value** would you use? (Check out your graphs!)

In [None]:
def predict_disease(heart_rate):
  cutoff = __ #YOUR CODE HERE: choose a number!
  if heart_rate < cutoff: 
    return 1 #predict heart disease
  else:
    return 0 #predict no heart disease

In [None]:
#@title Solution
def predict_disease(heart_rate):
  cutoff = 141 #YOUR CODE HERE: choose a number!
  if heart_rate < cutoff: 
    return 1 #predict heart disease
  else:
    return 0 #predict no heart disease

Let's check out our predictions!

**Discuss:** Can you explain how each prediction was made? How often are your predictions correct?

In [None]:
predictions = patient_data['heart_rate'].apply(predict_disease)
show_predictions(predictions)

Experiment with **changing your cutoff** to see what accuracy you can achieve!

**Optional:** Also experiment with using a different feature. What if you use `cholesterol` or `sex` instead of `heart_rate`? Is a different feature more useful?

# Making Predictions with Machine Learning

It takes a while to guess a rule by hand - and that's just with one feature!

Instead, let's use **machine learning** to make predictions automatically. Here are the steps:

### Step 1: Prepare our Data

We need to select out `input_data` and `output_data`. Please enter the name of each column! (We're still using one input feature for now.)

In [None]:
input_data = patient_data[[]] #FILL ME IN 
output_data = patient_data[[]] #FILL ME IN

display(input_data)
display(output_data)

In [None]:
#@title Solution
input_data = patient_data[['heart_rate']] #FILL ME IN 
output_data = patient_data[['disease']] #FILL ME IN
display (input_data)
display (output_data)

### Step 2: Set up our Model

Now, we need to set up the model (machine learning tool) we'll use. In this case, that's a decision tree!

In [None]:
tree_model = DecisionTreeClassifier(max_depth=1)

### Step 3: Train our Model

Now, we'll need to feed our `input_data` and `output_data` data into the model and `train` it! 

Please use `tree_model.fit()` and fill in the `input_data` and `output_data` so our model can learn.

In [None]:
#YOUR CODE HERE (1 line)

In [None]:
#@title Solution
tree_model.fit(input_data, output_data)

### Step 4: Make Predictions

Now, let's see how good our model's predictions are! 

We'll use `tree_model.predict` and fill in the `input`. **Discuss:** why don't we need to plug in the output?

In [None]:
predictions = tree_model.predict(input_data)

show_predictions(predictions)

### Step 5: Visualize our Tree

Finally, let's visualize our decision tree to see **how** it makes decisions.

Try **explaining this tree**! Discuss:
*   What "cutoff" value did the computer choose? Can you get the same results on your own?
*   What does "samples" mean? How does the tree split up the data into "classes"?
*   What does "value" mean? (Hint: this shows us how many people are sick or healthy in each class.)


In [None]:
visualize_tree(tree_model, input_data)

Congratulations - you've trained and evaluated your first machine learning model for diagnosing heart disease! 

**Would you trust a model like this** to make diagnoses? Could doctors use it instead of X-rays?


Now, let's make it better.

**Make sure to save this notebook for next time!**

# Improving our Model
 
Let's try a few approaches!

## Using a Different Feature

So far, we've been predicting disease based just on `heart_rate`. But that might not be the best way of predicting!

Below, please **copy over and run the code for steps 1-5 again, with one difference: use a different feature than heart_rate.**

In [None]:
#FILL IN YOUR CODE BELOW!

#STEP 1: Prepare your data
#Use a different input feature!

#STEP 2: Prepare your model

#STEP 3: Train your model

#STEP 4: Make predictions

#STEP 5: Visualize your tree

In [None]:
#@title Solution
#@title Solution


#STEP 1: Prepare your data
input_data = patient_data[['sex']] #FILL ME IN 
output_data = patient_data[['disease']] #FILL ME IN

#STEP 2: Prepare your model
#Use a bigger max_depth this time!
tree_model = DecisionTreeClassifier(max_depth=1)

#STEP 3: Train your model
tree_model.fit(input_data, output_data)

#STEP 4: Make predictions
predictions = tree_model.predict(input_data)
show_predictions(predictions)

#STEP 5: Visualize your tree
visualize_tree(tree_model, input_data)

Experiment with a few features! Which single feature seems most useful? Does that surprise you?

Can you interpret the tree for each feature?

## Changing Max_Depth

Before, we used a DecisionTreeClassifier with `max_depth = 1`.

**Can you guess what max_depth means? Let's experiment with changing it!**

Below, please **copy over and run the code for steps 1-5 again, with one difference: try increasing `max_depth` a little bit.**

You can use any feature!

In [None]:
#FILL IN YOUR CODE BELOW!

#STEP 1: Prepare your data

#STEP 2: Prepare your model
#Use a bigger max_depth this time!

#STEP 3: Train your model

#STEP 4: Make predictions

#STEP 5: Visualize your tree

In [None]:
#@title Solution


#STEP 1: Prepare your data
input_data = patient_data[['heart_rate']] #FILL ME IN 
output_data = patient_data[['disease']] #FILL ME IN


#STEP 2: Prepare your model
#Use a bigger max_depth this time!
tree_model = DecisionTreeClassifier(max_depth=4)

#STEP 3: Train your model
tree_model.fit(input_data, output_data)

#STEP 4: Make predictions
predictions = tree_model.predict(input_data)
show_predictions(predictions)

#STEP 5: Visualize your tree
visualize_tree(tree_model, input_data)

Based on the diagram, **what does max_depth represent?** How do we interpret this new diagram?

Does changing it improve your model?

##Optional: Using Multiple Features

We've tried using different features! But we're still using only one feature at a time, which means we're ignoring most of our data. 

Let's let the computer **learn from multiple features**. To do this, just use multiple features for the input_data, for example:

`input_data = patient_data[['heart_rate','cholesterol','sex']]`

You'll probably want to use a bigger `max_depth`!

In [None]:
#YOUR CODE HERE to set up, train, and test your model with multiple features!

In [None]:
#@title Solution


#STEP 1: Prepare your data
input_data = patient_data[['age', 'blood_pressure', 'cholesterol', 'heart_rate', 'sex',
       'high_blood_sugar', 'chest_pain', 'exercise_pain']] #FILL ME IN 
output_data = patient_data[['disease']] #FILL ME IN


#STEP 2: Prepare your model
#Use a bigger max_depth this time!
tree_model = DecisionTreeClassifier(max_depth=3)

#STEP 3: Train your model
tree_model.fit(input_data, output_data)

#STEP 4: Make predictions
predictions = tree_model.predict(input_data)
show_predictions(predictions)

#STEP 5: Visualize your tree
visualize_tree(tree_model, input_data)

**How is your model** making decisions now?

Keep playing around with your model to improve it! You can explore some other options [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

And consider - can you get it good enough that you'd **feel comfortable recommending doctors use it**?

One key note: we didn't separate our data in this intro notebook, but to see how your model works in the real world, it's important to use separate [training data and testing data](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture). That way, you'll be able to detect [overfitting](https://machinelearningmastery.com/overfitting-machine-learning-models/)!