# **Heart Disease - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [5]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [6]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://my.clevelandclinic.org/-/scassets/images/org/health/articles/24129-heart-disease-illustration'

# Display the image
Image(url=image_url)

# **Context**

Coronary heart disease (CHD), also referred to as coronary artery disease (CAD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CHD. However, this diagnostic method is costly and associated with morbidity (count of people with the disease) and mortality (count of deaths) in CHD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

Three hundred three (303) consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history of or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.


# **About the Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. A feature is a variable that is believed to contribute to CHD, and is also referred to as a predictive variable. A target variable is the variable you want to predict (CHD, in this situation). The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary heart disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

| **Variable**| **Description**                                          |
|:------------|:---------------------------------------------------------|
| AGE         | The age of the individual.                               |
| SEX         | Gender of the individual: 0 = female, 1 = male.          |
| CP          | The type of chest pain experienced by the individual:    |
|             | * 0 = typical angina                                     |
|             | * 1 = atypical angina                                    |
|             | * 2 = non-anginal pain                                   |
|             | * 3 = asymptotic                                         |
| TRESTBPS    | Resting blood pressure of an individual (mmHg)           |
| CHOL        | Serum cholesterol in mg/dL                               |
| FBS         | Compares the fasting blood sugar value with 120mg/dL:    |
|             | * 0: fasting blood sugar ≤ 120mg/dL                      |
|             | * 1: fasting blood sugar >120mg/dL                       |
| RESTECG     | Resting electrocardiographic results:                    |
|             | * 0 = normal                                             |
|             | * 1 = having ST-T wave abnormality                       |
|             | * 2 = left ventricular hypertrophy                       |
| THALACH     | Max heart rate achieved, in beats per minute (bpm)       |
| EXANG       | Exercise-induced angina:                                 |
|             | * 0 = No                                                 |
|             | * 1 = Yes                                                |
| OLDPEAK     | ST depression (mm) induced by exercise relative to rest. |
| SLOPE       | Peak exercise ST segment:                                |
|             | * 1 = upsloping                                          |
|             | * 2 = flat                                               |
|             | * 3 = downsloping                                        |
| CA          | Number of major vessels (0-3) colored by fluoroscopy.    |
| THAL        | Thalassemia:                                             |
|             | * 1 = normal                                             |
|             | * 2 = fixed defect                                       |
|             | * 3 = reversible defect                                  |
| TARGET      | Whether the individual is suffering from heart disease:  |
|             | * 0 = absence                                            |
|             | * 1 = present                                            |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [7]:
url='https://raw.githubusercontent.com/ksuaray/STAT108F24_Projects_Jupyter/main/ProjectDataSets/heart.csv'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [8]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**
Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable | Classification |
|------------------------|----------------|
|Age                     |                |
|Sex                     |                |
|Chest pain type         |                |
|Resting blood pressure  |                |

## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of chest pain type. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for chest pain type. Round relative frequency to 3 decimal places.

In [9]:
# Define the name of the variable to be analyzed
variable = df['cp']                                                        #variable = df[...]

# Create the frequency table and sort the categories in numerical order.
# .sort_index works here because the category names are numerical.
# rename "count" to "frequency"
freq_table = pd.value_counts(variable).sort_index()
freq_table = freq_table.rename('Frequency')

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table = freq_table/len(df)                                            #relative_freq_table=freq_table/... HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('Relative Frequency').round(3)     # relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
combined_table=pd.concat([freq_table, relative_freq_table], axis=1)                 # combined_table=pd.concat([..., ...], axis=1)

# Print the combined table.
combined_table                                                          # ...


Unnamed: 0_level_0,Frequency,Relative Frequency
cp,Unnamed: 1_level_1,Unnamed: 2_level_1
0,143,0.472
1,50,0.165
2,87,0.287
3,23,0.076


**2b)** Construct a relative frequency bar chart to describe the distribution of chest pain type.

In [10]:
dfrf = pd.DataFrame(relative_freq_table)

fig = px.bar(x=dfrf.index,y=dfrf['Relative Frequency'],
             title='Relative Frequency Distribution Bar Chart') #title=...,

fig.update_layout(xaxis_title="Chest Pain Type")        #xaxis_title=...
fig.update_layout(yaxis_title="Relative Frequency")             #yaxis_title=...
fig.show()

**2c)** Describe the distribution of chest pain type. When discussing pain types, use the descriptors in addition to the number to help the reader understand. For example, if you want to talk about chest pain type 2, you could say something like "... non-anginal pain (type 2) ...".

The most common chest pain type is typical angina (type 0), followed by non-anginal pain (type 2), atypical (type 1), and then asymptotic (type 3).

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| **Last Name** | **Variable Description**                                      |
|------------------------|------------------------------------------------------|
| A-F                    | Resting blood pressure                    |
| G-M                    | Serum cholesterol                              |
| N-S                    | Maximum heart rate achieved                |
| T-Z                    | ST depression induced by exercise relative to rest  |



**3a)** Construct a histogram for your variable. Use number of bins = 20.

In [11]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.

fig = px.histogram(x=df['trestbps'], nbins = 20,                          #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Resting Blood Pressure',                 #title=...,
             labels={'x':'Resting Blood Pressure', 'y':'Frequency'})      #labels={'x':...,'y':...})

# Update the vertical axis title.
fig.update_layout(yaxis_title="Frequency")                       #fig.update_layout(yaxis_title="...")

# Print the histogram.
fig.show()

In [12]:
fig = px.histogram(x=df['chol'],nbins = 20,                           #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Serum Cholesterol',                  #title=...,
             labels={'x':'Serum Cholesterol', 'y':'Frequency'})       #labels={'x':...,'y':...})
fig.show()

In [13]:
fig = px.histogram(x=df['thalach'],nbins = 20,                                 #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Maximum Heart Rate Achieved',                 #title=...,
             labels={'x':'Maximum Heart Rate Achieved', 'y':'Frequency'})      #labels={'x':...,'y':...})
fig.show()

In [14]:
fig = px.histogram(x=df['oldpeak'],nbins = 20,                                                        #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of ST Depression Induced by Exercise Relative to Rest',                 #title=...,
             labels={'x':'ST Depression Induced by Exercise Relative to Rest', 'y':'Frequency'})      #labels={'x':...,'y':...})
fig.show()

**3b)** Construct a boxplot for your variable.  

In [15]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['trestbps'],                                 #px.box(x=df[...],
       title='Boxplot of Resting Blood Pressure',        #title=...,
       labels={'x':'Resting Blood Pressure'})            #labels={'x':...})

In [16]:
px.box(x=df['chol'],                                         #px.box(x=df[...],
       title='Boxplot of Serum Cholesterol',                 #title=...,
       labels={'x':'Serum Cholesterol'})                     #labels={'x':...})

In [17]:
px.box(x=df['thalach'],                                                #px.box(x=df[...],
       title='Boxplot of Maximum Heart Rate Achieved',                 #title=...,
       labels={'x':'Maximum Heart Rate Achieved'})                     #labels={'x':...})

In [18]:
px.box(x=df['oldpeak'],                                                                       #px.box(x=df[...],
       title='Boxplot of ST Depression Induced by Exercise Relative to Rest',                 #title=...,
       labels={'x':'ST Depression Induced by Exercise Relative to Rest'})                     #labels={'x':...})

# **Heart Disease - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [19]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [20]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://my.clevelandclinic.org/-/scassets/images/org/health/articles/24129-heart-disease-illustration'

# Display the image
Image(url=image_url)

# **Context**

Coronary heart disease (CHD), also referred to as coronary artery disease (CAD) involves the reduction of blood flow to the heart muscle due to the build-up of plaque in the arteries of the heart. It is the most common form of cardiovascular disease. Currently, invasive coronary angiography represents the gold standard for establishing the presence, location, and severity of CHD. However, this diagnostic method is costly and associated with morbidity (count of people with the disease) and mortality (count of deaths) in CHD patients. Therefore, it would be beneficial to develop a non-invasive alternative to replace the current gold standard.

Other less invasive diagnostics methods have been proposed in the scientific literature including exercise electrocardiogram, thallium scintigraphy, and fluoroscopy of coronary calcification. However, the diagnostic accuracy of these tests only ranges between 35%-75%. Therefore, it would be beneficial to develop a computer-aided diagnostic tool that could utilize the combined results of these non-invasive tests in conjunction with other patient attributes to boost the diagnostic power of these non-invasive methods with the aim of ultimately replacing the current invasive gold standard.

Three hundred three (303) consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984 participated in the experiment. No patient had a history of or electrocardiographic evidence of prior myocardial infarction or known valvular or cardiomyopathic diseases.


# **About the Dataset**

The dataset comprises 303 observations, 13 features, and 1 target attribute. A feature is a variable that is believed to contribute to CHD, and is also referred to as a predictive variable. A target variable is the variable you want to predict (CHD, in this situation). The 13 features include the results of the aforementioned non-invasive diagnostic tests along with other relevant patient information. The target variable includes the result of the invasive coronary angiogram which represents the presence or absence of coronary heart disease in the patient. The 14 variables (13 features and 1 target attribute) are described below.

| **Variable**| **Description**                                          |
|:------------|:---------------------------------------------------------|
| AGE         | The age of the individual.                               |
| SEX         | Gender of the individual; 0 = female, 1 = male.          |
| CP          | The type of chest pain experienced by the individual     |
|             | * 0 = typical angina                                     |
|             | * 1 = atypical angina                                    |
|             | * 2 = non-anginal pain                                   |
|             | * 3 = asymptomatic                                       |
| TRESTBPS    | Resting blood pressure of an individual (mmHg)           |
| CHOL        | Serum cholesterol in mg/dL                               |
| FBS         | Compares the fasting blood sugar value with 120mg/dL     |
|             | * 0: fasting blood sugar ≤ 120mg/dL                      |
|             | * 1: fasting blood sugar >120mg/dL                       |
| RESTECG     | Resting electrocardiographic results                     |
|             | * 0 = normal                                             |
|             | * 1 = having ST-T wave abnormality                       |
|             | * 2 = left ventricular hypertrophy                       |
| THALACH     | Max heart rate achieved, in beats per minute (bpm)       |
| EXANG       | Exercise-induced angina                                  |
|             | * 0 = No                                                 |
|             | * 1 = Yes                                                |
| OLDPEAK     | ST depression (mm) induced by exercise relative to rest. |
| SLOPE       | Peak exercise ST segment                                 |
|             | * 1 = upsloping                                          |
|             | * 2 = flat                                               |
|             | * 3 = downsloping                                        |
| CA          | Number of major vessels (0-3) colored by fluoroscopy.    |
| THAL        | Thalassemia                                              |
|             | * 1 = normal                                             |
|             | * 2 = fixed defect                                       |
|             | * 3 = reversible defect                                  |
| TARGET      | Whether the individual is suffering from heart disease   |
|             | * 0 = heart disease absent                               |
|             | * 1 = heart disease present                              |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [21]:
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/Heart%20Disease.csv'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [22]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**
Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable | Classification |
|------------------------|----------------|
|Age                     | Quantitative, continuous |
|Sex                     | Qualitative              |
|Chest pain type         | Qualitative              |
|Resting blood pressure  | Quantitative, discrete   |

## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of chest pain type. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for chest pain type. Round relative frequency to 3 decimal places.

In [23]:
# Define the name of the variable to be analyzed
variable = df['cp']                                                        #variable = df[...]

# Create the frequency table and sort the categories in numerical order.
# .sort_index works here because the category names are numerical.
freq_table = pd.value_counts(variable).sort_index()

# Rename "count" to "Frequency", and "cp" to "Chest Pain Type"
freq_table = freq_table.rename('Frequency')
freq_table = freq_table.rename_axis('Chest Pain Type')

# Create a mapping for the chest pain categories, from number to names.
cp_names = {
    0: "Typical Angina",
    1: "Atypical Angina",
    2: "Non-Anginal Pain",
    3: "Asymptomatic"
}

# Replace numeric CP names with descriptive names.
freq_table.index = freq_table.index.map(cp_names)

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table = freq_table/len(df)                                            #relative_freq_table=freq_table/... HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('Relative Frequency').round(3)     # relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
combined_table=pd.concat([freq_table, relative_freq_table], axis=1)                 # combined_table=pd.concat([..., ...], axis=1)

# Print the combined table.
combined_table                                                          # ...


Unnamed: 0_level_0,Frequency,Relative Frequency
Chest Pain Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Typical Angina,143,0.472
Atypical Angina,50,0.165
Non-Anginal Pain,87,0.287
Asymptomatic,23,0.076


**2b)** Construct a relative frequency bar chart to describe the distribution of chest pain type.

In [24]:
# Create a DataFrame
dfrf = relative_freq_table.reset_index()

# Rename the columns for clarity
dfrf.columns = ['Chest Pain Type', 'Relative Frequency']


fig = px.bar(x=dfrf['Chest Pain Type'],y=dfrf['Relative Frequency'],
             title='Relative Frequency Distribution Bar Chart') #title=...,

fig.update_layout(xaxis_title="Chest Pain Type")        #xaxis_title=...
fig.update_layout(yaxis_title="Relative Frequency")             #yaxis_title=...
fig.show()


**2c)** Describe the distribution of chest pain type. When discussing pain types, use the descriptors in addition to the number to help the reader understand. For example, if you want to talk about chest pain type 2, you could say something like "... non-anginal pain (type 2) ...".

The patients are most likely to have typical angina (type 0), followed by non-anginal pain (type 2), atypical (type 1), and then asymptomatic (type 3).

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| **Last Name** | **Variable Description**                                      |
|------------------------|------------------------------------------------------|
| A-F                    | Resting blood pressure                    |
| G-M                    | Serum cholesterol                              |
| N-S                    | Maximum heart rate achieved                |
| T-Z                    | ST depression induced by exercise relative to rest  |



**3a)** Construct a histogram for your variable. Use number of bins = 20.

In [25]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.

fig = px.histogram(x=df['trestbps'], nbins = 20,                          #fig = px.histogram(x=df[...],nbins = ...,
             title='Histogram of Resting Blood Pressure',                 #title=...,
             labels={'x':'Resting Blood Pressure', 'y':'Frequency'})      #labels={'x':...,'y':...})

# Update the vertical axis title.
fig.update_layout(yaxis_title="Frequency")                       #fig.update_layout(yaxis_title="...")

# Print the histogram.
fig.show()


**3c)** Calculate the following summary statistics for your variable: 5 number summary, mean, and standard deviation. Round to three decimal places.

In [26]:
# Calculate the numerical summaries
# Indicate your variable.
descriptive_stats = df[['trestbps','chol','thalach','oldpeak']].describe().round(3)       #descriptive_stats = df[[...]].describe().round(3)

# Print the results.
descriptive_stats                                                                               #...


Unnamed: 0,trestbps,chol,thalach,oldpeak
count,303.0,303.0,303.0,303.0
mean,131.624,246.264,149.647,1.04
std,17.538,51.831,22.905,1.161
min,94.0,126.0,71.0,0.0
25%,120.0,211.0,133.5,0.0
50%,130.0,240.0,153.0,0.8
75%,140.0,274.5,166.0,1.6
max,200.0,564.0,202.0,6.2


**3d)** Use information from (3a), (3b) and 3(c) to describe your variable in terms of shape, outliers, center, and spread.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which center and which spread you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list a minimum of 3 values. If there are 1 or 2 outliers, list their values.
* Include units, if any, for all numbers.

The distribution of resting blood pressure is skewed right. There are several outliers, such as 172 mmHg, 180 mmHg, and 200mmHg. The median is 130 mmHg and IQR is 20 mmHg.

The distribution of serum cholesterol is skewed right. There are several outliers, including 394 mg/dL, 417 mg/dL, and 564 mg/dL.The median is 240 mg/dl and IQR is 63.5 mg/dl.

The distribution of the maximum heart rate achieved is skewed left. There is 1 outlier; 71 bpm.The median is 153 bpm and IQR is 32.5 bpm.

There are 4 outliers, including 4.2 mm, 5.6 mm and 6.2 mm. The distribution of ST depression induced by exercise relative to rest is skewed right. The median is 0.8 mm and IQR is 1.6 mm.

## **QUESTION 4**

Remember that the goal of this study is to determine if there is a nonsurgical way to predict if a patient has heart disease. Before using statistical inference, let's analyze the relationship between the several features (variables) and the target variable.

Our target variable is whether or not a patient has heart disease. The feature variables we will consider are: age, resting blood pressure, and serum cholesterol.

Calculate and state the mean age, mean resting blood pressure, and mean cholesterol for those with heart disease, and for those without heart disease. Round to two decimal places. Compare the results and answer a question about the code.

**4a)** Calculate and state the mean age, mean resting blood pressure, and mean cholesterol for those with heart disease, and for those without heart disease. Round to two decimal places.

In [27]:
heart_disease_means = df[df['target'] == 1][['age', 'trestbps', 'chol']].mean().round(2)                        #heart_disease_means = df[df['exang'] == 1][['...', '...', '...']].mean().round(...)
no_heart_disease_means = df[df['target'] == 0][['age', 'trestbps', 'chol']].mean().round(2)                     #no_heart_disease_means = df[df['exang'] == 0][['...', '...', '...']].mean().round(...)

sec_stat = pd.DataFrame({'age': [no_heart_disease_means['age'], heart_disease_means['age']],
                         'trestbps': [no_heart_disease_means['trestbps'], heart_disease_means['trestbps']],
                         'chol': [no_heart_disease_means['chol'], heart_disease_means['chol']]},
                        index=['No Heart Disease', 'Heart Disease'])

print(sec_stat)

                   age  trestbps    chol
No Heart Disease  56.6     134.4  251.09
Heart Disease     52.5     129.3  242.23


**4b)** Compare the results for people with and without heart disease.

Patients with heart disease tend to be younger patients who have lower resting blood pressure and lower serum cholesterol.

-OR-

People with no heart disease have a higher mean age, higher mean resting blood pressure, and higher mean serum cholesterol.

**4c)** In the table below are some snippets of code.

| **Last Name Initial** | **Code**                                      |
|------------------------|------------------------------------------------------|
| A-F                    | `df['target'] == 1`                    |
| G-M                    | `.mean().round(2)`                              |
| N-S                    | `df[df['target'] == 0]`               |
| T-Z                    | `df[df['target'] == 1][['age', 'trestbps', 'chol']]`  |

Based on your last name, interpret the snippet of code.

A-F: df['target'] == 1 checks to see if the patient has heart disease.


## **QUESTION 5**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 5a or 5b, but not both.

**5a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**5b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...
