In [2]:
# importing modules
import pandas as pd
import numpy as np

In [13]:
# load data
heart = pd.read_csv('heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

# Examine main data
print(heart)

     age     sex  trestbps  chol                cp  exang  fbs  thalach  \
0     63    male       145   233    typical angina      0    1      150   
1     67    male       160   286      asymptomatic      1    0      108   
2     67    male       120   229      asymptomatic      1    0      129   
3     37    male       130   250  non-anginal pain      0    0      187   
4     41  female       130   204   atypical angina      0    0      172   
..   ...     ...       ...   ...               ...    ...  ...      ...   
298   45    male       110   264    typical angina      0    0      132   
299   68    male       144   193      asymptomatic      0    1      141   
300   57    male       130   131      asymptomatic      1    0      115   
301   57  female       130   236   atypical angina      0    0      174   
302   38    male       138   175  non-anginal pain      0    0      173   

    heart_disease  
0         absence  
1        presence  
2        presence  
3         absence  

The full dataset has been loaded for you as heart, then split into two subsets:

- `yes_hd`, which contains data for patients with heart disease
- `no_hd`, which contains data for patients without heart disease


For this project, we’ll investigate the following variables:

- `chol`: serum cholestorol in mg/dl
- `fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (`1` = true; `0` = false)


To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset `yes_hd` to save cholesterol levels for patients with heart disease as a variable named `chol_hd`.

In [7]:
# saving a series for chol with people with heard disease
chol_hd = yes_hd['chol']

In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

In [26]:
mean_hd_chol = np.average(chol_hd)
sd_hd_chol = np.std(chol_hd)
print(f'The mean cholesterol for people diagnosed with heart disease is: {np.round(mean_hd_chol,2)} mg/dL.')
print(f'The standard deviation cholesterol for people diagnosed with heart disease is: {np.round(sd_hd_chol,2)} mg/dL.')

The mean cholesterol for people diagnosed with heart disease is: 251.47 mg/dL.
The standard deviation cholesterol for people diagnosed with heart disease is: 49.31 mg/dL.


Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

- Null: People with heart disease have an average cholesterol level equal to 240 mg/dl
- Alternative: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

Note: Unfortunately, the `scipy.stats` function we’ve been using does not (at the time of writing) have an `alternative` parameter to change the alternative hypothesis for this test. Therefore, you’ll have to run a two-sided test. However, since you calculated earlier that the average cholesterol level for heart disease patients is greater than 240 mg/dl, you can calculate the p-value for the one-sided test indicated above simply by dividing the two-sided p-value in half.

**This is no longer the case and I have used the alternative as greater. The null hypthesis can be rejected**

In [38]:
# import 1 sample ttest as we are comparing to a single number
from scipy.stats import ttest_1samp

The t-statistic is 2.73 and the p-value is 0.0035411033905155707.


In [39]:
# manually running this results in the following
x_bar = mean_hd_chol
mu = 240
standard_dev = sd_hd_chol
n = len(yes_hd)
print(n)
tval = (x_bar - mu) / (standard_dev/(n**0.5))
print(tval)
# this is virtually the same.

139
2.74366742227647


Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

In [47]:
# run a ttest to examine the null hypothesis
hd_t_value, hd_p_value = ttest_1samp(chol_hd, 240, alternative='greater')
print(f'The t-statistic is {np.round(hd_t_value,2)} and the p-value is {hd_p_value}.')
if hd_p_value > 0.05:
    print('The null hypothesis is accepted as the p-value is greater than 0.05.')
else:
    print('The null hypothesis is rejected as the p-value is less than 0.05.')

The t-statistic is 2.73 and the p-value is 0.0035411033905155707.
The null hypothesis is rejected as the p-value is less than 0.05.


Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were **not** diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

In [48]:
# Repeating the above for no HD
chol_nhd = no_hd['chol']
mean_nhd_chol = np.average(chol_nhd)
sd_nhd_chol = np.std(chol_nhd)
print(f'The mean cholesterol for people diagnosed without heart disease is: {np.round(mean_nhd_chol,2)} mg/dL.')
print(f'The standard deviation cholesterol for people diagnosed without heart disease is: {np.round(sd_nhd_chol,2)} mg/dL.')

# run a ttest to examine the null hypothesis
nhd_t_value, nhd_p_value = ttest_1samp(chol_nhd, 240, alternative='greater')
print(f'The t-statistic is {np.round(nhd_t_value,2)} and the p-value is {nhd_p_value}.')
if nhd_p_value > 0.05:
    print('The null hypothesis is accepted as the p-value is greater than 0.05.')
else:
    print('The null hypothesis is rejected as the p-value is less than 0.05.')

The mean cholesterol for people diagnosed without heart disease is: 242.64 mg/dL.
The standard deviation cholesterol for people diagnosed without heart disease is: 53.29 mg/dL.
The t-statistic is 0.63 and the p-value is 0.26397120232220506.
The null hypothesis is accepted as the p-value is greater than 0.05.


Let’s now return to the full dataset (saved as `heart`). How many patients are there in this dataset? Save the number of patients as `num_patients` and print it out.

In [50]:
num_patients = len(heart)
print(f'The total number of patients is {num_patients}.')

The total number of patients is 303.


Remember that the `fbs` column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (`1` means that their fasting blood sugar was greater than 120 mg/dl; `0` means it was less than or equal to 120 mg/dl).

Calculate the number of patients with fasting blood sugar greater than 120. Save this number as `num_highfbs_patients` and print it out.

In [55]:
num_highfbs_patients = np.sum(heart[heart['fbs'] ==1]['fbs'])
print(f'How many patients fasting blood sugar levels are above 120 mg/dL? {num_highfbs_patients}')

How many patients fasting blood sugar levels are above 120 mg/dL? 45


Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [57]:
pc8_of_sample = num_patients * 0.08
print(f'The number of patients expected to have fbs > 120 mg/dL is {np.round(pc8_of_sample,2)}.')
if num_highfbs_patients > pc8_of_sample: 
    print('There are more patients than would be expected to have diabetes in this study.')

The number of patients expected to have fbs > 120 mg/dL is 24.24.
There are more patients than would be expected to have diabetes in this study.


Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from `scipy.stats` that you can use to test the following null and alternative hypotheses:

- Null: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
- Alternative: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/dl

In [58]:
# import the binomial test from scipy stats
from scipy.stats import binom_test


Run the hypothesis test indicated in task 9 and print out the p-value. Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?

In [67]:
# running binomial test to see if our cohort are statistically different from the population
binom_ptest = binom_test(num_highfbs_patients, num_patients, 0.08, alternative='greater')
print(f'From this test we see the p-value is {binom_ptest} meaning the samples comes from a population where more than 8% of people have fbs > 120 mg/dL.')

From this test we see the p-value is 4.689471951449078e-05 meaning the samples comes from a population where more than 8% of people have fbs > 120 mg/dL.
