**Heart Disease**

#**Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag #.



In [5]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


#**Context**

Coronary heart disease (CHD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CAD, however, this diagnostic method is costly and associated with morbidity and mortality in CAD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

A number of 303 consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.


In [39]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://my.clevelandclinic.org/-/scassets/images/org/health/articles/24129-heart-disease-illustration'

# Display the image
Image(url=image_url)

#**About the Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary artery disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

| **Item**    | **Description**                                                                                   |
|-------------|---------------------------------------------------------------------------------------------------|
| AGE         | Displays the age of the individual.                                                               |
| SEX         | Displays the gender of the individual using the following format: 1 = male, 0 = female.          |
| CP          | Displays the type of chest pain experienced by the individual:                                    |
|             | - 0 = typical angina                                                                             |
|             | - 1 = atypical angina                                                                            |
|             | - 2 = non-anginal pain                                                                           |
|             | - 3 = asymptotic                                                                                 |
| TRESTBPS    | Displays the resting blood pressure value of an individual in mmHg (unit).                        |
| CHOL        | Displays the serum cholesterol in mg/dl (unit).                                                  |
| FBS         | Compares the fasting blood sugar value of an individual with 120mg/dl:                             |
|             | - 1: fasting blood sugar >120mg/dl                                                               |
|             | - 0: fasting blood sugar ≤ 120mg/dl                                                              |
| RESTECG     | Displays resting electrocardiographic results:                                                   |
|             | - 0 = normal                                                                                     |
|             | - 1 = having ST-T wave abnormality                                                               |
|             | - 2 = left ventricular hypertrophy                                                               |
| THALACH     | Displays the max heart rate achieved by an individual.                                           |
| EXANG       | Exercise-induced angina:                                                                         |
|             | - 1 = yes                                                                                        |
|             | - 0 = no                                                                                         |
| OLDPEAK     | ST depression induced by exercise relative to rest. Displays the value (integer or float).       |
| SLOPE       | Peak exercise ST segment:                                                                        |
|             | - 1 = upsloping                                                                                  |
|             | - 2 = flat                                                                                       |
|             | - 3 = downsloping                                                                                |
| CA          | Number of major vessels (0-3) colored by fluoroscopy. Displays the value (integer or float).      |
| THAL        | Displays the thalassemia:                                                                       |
|             | - 1 = normal                                                                                     |
|             | - 2 = fixed defect                                                                              |
|             | - 3 = reversible defect                                                                         |
| TARGET      | Displays whether the individual is suffering from heart disease or not:                           |
|             | - 0 = absence                                                                                   |
|             | - 1 = present                                                                                   |



Let's take a look at the data. To do this, first we import it directly from the url below.





**Data**

In [2]:
url="https://archive.ics.uci.edu/static/public/45/data.csv"
df=pd.read_csv(url)

Next, we can display the data by **typing the name** of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [3]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


#**ASSIGNMENT 1 - Descriptive Statistics: Graphical and Numerical Summary**

**INSTRUCTIONS**

Use Python to analyze the data set and complete each of the following. For problems that require a written response, type the answer below.

##**QUESTION 1**
(8 POINTS)

Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| **Variable**              | **Type**          | **Description**                                      |
|---------------------------|-------------------|------------------------------------------------------|
| **Age**                   | **Quan** or Qual      | **Continuous** or Discrete or N/A                  |
| **Sex**                   | Quan or **Qual**       | Continuous or Discrete or **N/A**        |
| **Chest Pain Type (CP)**  | Quan or **Qual**       | Continuous or Discrete or **N/A**       |
| **Resting BP (TRESTBPS)** | **Quan** or Qual      | Continuous or **Discrete**  |

##**QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of chest pain type. State any fact that jumps out to you.

**2a**  (2 POINTS) Construct a frequency table for chest pain type.

In [6]:
#Frequency table
variable = df['cp']  #variable = df[...]
freq_table = pd.value_counts(variable)
freq_table

Unnamed: 0_level_0,count
cp,Unnamed: 1_level_1
4,144
3,86
2,50
1,23


**2b** (2 POINTS) Construct a relative frequency table for chest pain type.

In [7]:
#Relative frequency table
relative_freq_table=freq_table/len(df)   #relative_freq_table=freq_table/...
relative_freq_table

Unnamed: 0_level_0,count
cp,Unnamed: 1_level_1
4,0.475248
3,0.283828
2,0.165017
1,0.075908


**2c** (2 POINTS) Construct a relative frequency bar chart to describe the distribution of chest pain type. State any fact that jumps out to you.

In [13]:
dff = pd.DataFrame(relative_freq_table)

fig = px.bar(x=dff.index,y=dff['count'],
             title='Relative Frequency Distribution Bar Chart', #title=...,
             labels={'x':'Chest Pain Type', 'y':'Percent'})     #labels={'x':..., 'y':...})
fig.show()

**Notable Fact:**

## **QUESTIONS 3-6**

For questions 3-6: Find your variable based on your last name and use that variable when answering questions #3 to #6.  

| **Last Name Variable** | **Description**                                      |
|------------------------|------------------------------------------------------|
| A-F                    | Resting blood pressure                    |
| G-M                    | Serum cholesterol                              |
| N-S                    | Maximum heart rate achieved                |
| T-Z                    | ST depression induced by exercise relative to rest  |



###**QUESTION 3**
(6 POINTS)

Construct a histogram for your variable. Use number of bins = 15.

In [14]:
fig = px.histogram(x=df['trestbps'],nbins = 15,                           #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Resting Blood Pressure',                 #title=...,
             labels={'x':'Resting Blood Pressure', 'y':'Frequency'})      #labels={'x':...,'y':...})
fig.show()

In [15]:
fig = px.histogram(x=df['chol'],nbins = 15,                           #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Serum Cholesterol',                  #title=...,
             labels={'x':'Serum Cholesterol', 'y':'Frequency'})       #labels={'x':...,'y':...})
fig.show()

In [16]:
fig = px.histogram(x=df['thalach'],nbins = 15,                                 #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Maximum Heart Rate Achieved',                 #title=...,
             labels={'x':'Maximum Heart Rate Achieved', 'y':'Frequency'})      #labels={'x':...,'y':...})
fig.show()

In [None]:
fig = px.histogram(x=df['oldpeak'],nbins = 15,                                                        #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of ST Depression Induced by Exercise Relative to Rest',                 #title=...,
             labels={'x':'ST Depression Induced by Exercise Relative to Rest', 'y':'Frequency'})      #labels={'x':...,'y':...})
fig.show()

###**QUESTION 4**
(6 POINTS)

Construct a boxplot for your variable.  

In [35]:
px.box(x=df['trestbps'],                                 #px.box(x=df[...],
       title='Boxplot of Resting Blood Pressure',        #title=...,
       labels={'x':'Resting Blood Pressure'})            #labels={'x':...})

In [34]:
px.box(x=df['chol'],                                         #px.box(x=df[...],
       title='Boxplot of Serum Cholesterol',                 #title=...,
       labels={'x':'Serum Cholesterol'})                     #labels={'x':...})

In [33]:
px.box(x=df['thalach'],                                                #px.box(x=df[...],
       title='Boxplot of Maximum Heart Rate Achieved',                 #title=...,
       labels={'x':'Maximum Heart Rate Achieved'})                     #labels={'x':...})

In [32]:
px.box(x=df['oldpeak'],                                                                       #px.box(x=df[...],
       title='Boxplot of ST Depression Induced by Exercise Relative to Rest',                 #title=...,
       labels={'x':'ST Depression Induced by Exercise Relative to Rest'})                     #labels={'x':...})

###**QUESTION 5**
(4 POINTS)

Calculate the following summary statistics for your variable: minimum, maximum, mean, median, standard deviation, Q1, and Q3. Round to three decimal places.

In [None]:
descriptive_stats = df[['cp','trestbps','chol','thalach','oldpeak']].describe().round(3)        #descriptive_stats = df[[...]].describe().round(3)

descriptive_stats                                                                               #...


Unnamed: 0,cp,trestbps,chol,thalach,oldpeak
count,303.0,303.0,303.0,303.0,303.0
mean,0.967,131.624,246.264,149.647,1.04
std,1.032,17.538,51.831,22.905,1.161
min,0.0,94.0,126.0,71.0,0.0
25%,0.0,120.0,211.0,133.5,0.0
50%,1.0,130.0,240.0,153.0,0.8
75%,2.0,140.0,274.5,166.0,1.6
max,3.0,200.0,564.0,202.0,6.2


###**QUESTION 6**
(8 POINTS)

Use information from questions #3, #4, and #5 to describe your variable in terms of shape, center, spread, and outliers. Interpret your findings.

The distribution of blood pressure is skewed right. The median is 130 mmHg and IQR is 20 mmHg. There are 7 outliers.

The distribution of serum cholesterol is skewed right. The median is 240 mg/dl and IQR is 64 mg/dl. There are 5 outliers.

The distribution of the maximum heart rate achieved is skewed left. The median is 153 mmHg and IQR is 33 mmHg. There is 1 outlier.

The distribution of ST depression induced by exercise relative to rest is skewed right. The median is 0.8 and IQR is 1.6. There are 4 outliers.

##**QUESTION 7**
(8 POINTS)

Calculate and state the mean age, mean resting blood pressure, and mean cholesterol for those with heart disease, and for those without heart disease. Round to two decimal places. Compare the results and answer a question about the code.

In [36]:
heart_disease_means = df[df['exang'] == 1][['age', 'trestbps', 'chol']].mean().round(2)
no_heart_disease_means = df[df['exang'] == 0][['age', 'trestbps', 'chol']].mean().round(2)

sec_stat = pd.DataFrame({'age': [no_heart_disease_means['age'], heart_disease_means['age']],
                         'trestbps': [no_heart_disease_means['trestbps'], heart_disease_means['trestbps']],
                         'chol': [no_heart_disease_means['chol'], heart_disease_means['chol']]},
                        index=['0', '1'])

print(sec_stat)

     age  trestbps    chol
0  53.86    130.90  244.49
1  55.63    133.32  251.24


**Compare Results**: Patients with heart disease tended to be older, with higher blood pressure and cholesterol levels.

| **Last Name Initial** | **Code**                                      |
|------------------------|------------------------------------------------------|
| A-F                    | `df['exang'] == 1`                    |
| G-M                    | `.mean().round(2)`                              |
| N-S                    | `df[df['exang'] == 0]`               |
| T-Z                    | `df[df['exang'] == 1][['age', 'trestbps', 'chol']]`  |

**Interpret Code:** State what you think the purpose of the code snippet based on your last name.

##**QUESTION 8**

Generate a paragraph of at least 100 words to address one of the following questions:


### **QUESTION 8a**
(2 POINTS)

Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

### **QUESTION 8b**
(2 POINTS)

Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?