#### HYPOTHESIS TESTING

<br>

## Heart Disease Research Pt. 1
<hr>

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, binom_test

### Cholestorol Analysis

`yes_hd` contains data for patients <b>with</b> heart disease <br>
`no_hd` contains data for patients <b>without</b> heart disease

In [3]:
heart = pd.read_csv('heart_disease.csv')
yes_hd = heart[heart.heart_disease == 'presence']
no_hd = heart[heart.heart_disease == 'absence']

In [5]:
heart.head()

Unnamed: 0,age,sex,trestbps,chol,cp,exang,fbs,thalach,heart_disease
0,63.0,male,145.0,233.0,typical angina,0.0,1.0,150.0,absence
1,67.0,male,160.0,286.0,asymptomatic,1.0,0.0,108.0,presence
2,67.0,male,120.0,229.0,asymptomatic,1.0,0.0,129.0,presence
3,37.0,male,130.0,250.0,non-anginal pain,0.0,0.0,187.0,absence
4,41.0,female,130.0,204.0,atypical angina,0.0,0.0,172.0,absence


<br>

`chol`: Serum cholestorol in mg/dl <br>
`fbs`: An indicator for whether fasting blood sugar is greater than 120 mg/dl (`1` = true, `0` = false)

In [4]:
chol_hd = yes_hd.chol
chol_hd.head()

1    286.0
2    229.0
6    268.0
8    254.0
9    203.0
Name: chol, dtype: float64

<hr>

In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). What is the mean cholesterol level for patients who were diagnosed with heart disease? Is it higher than 240 mg/dl?

In [9]:
avg_chol_hd = np.mean(chol_hd)
print(avg_chol_hd)

251.4748201438849


<hr>

Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average?

<br>

<b>Null</b>: People with heart disease have an average cholesterol level equal to 240 mg/dl <br>
<b>Alternative</b>: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

In [10]:
chol_hd_array = np.array(chol_hd)
#print(chol_hd_array)

#need to divide by 2 to get the one-sided p-value
tstat, pval = ttest_1samp(chol_hd_array, 240)
print(pval / 2)

0.0035411033905155707


Since 0.0035 is less than 0.05, this suggests that heart disease patients have an average cholesterol level significantly higher than 240 mg/dl; and we reject the null hypothesis.

<br>
<hr>

Run the same tests again for those with no heart disease, do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

In [11]:
chol_no_hd = no_hd.chol
chol_no_hd.head()

0    233.0
3    250.0
4    204.0
5    236.0
7    354.0
Name: chol, dtype: float64

In [12]:
avg_chol_no_hd = np.mean(chol_no_hd)
print(avg_chol_no_hd)

242.640243902439


<b>Null</b>: People that don't have heart disease have an average cholesterol level equal to 240 mg/dl <br>
<b>Alternative</b>: People that don't have heart disease have an average cholesterol level that is greater than 240 mg/dl

In [15]:
chol_no_hd_array = np.array(chol_no_hd)
#print(chol_no_hd_array)

tstat, pval = ttest_1samp(chol_no_hd_array, 240)
print(pval / 2)

0.26397120232220506


Since 0.264 is greater than 0.05, we don't reject the null hypothesis, which suggests that people who don't have heart disease have an average cholesterol level equal to 240 mg/dl.

<hr>

### Fasting Blood Sugar Analysis

Find the total number of patients:

In [17]:
num_patients = len(heart)
print(num_patients)

303


<hr>

The `fbs` column indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (`1` = greater than 120 mg/dl, and `0` = less than or equal to 120 mg/dl.

<br>

Find the number of patients that have a `fbs` greater than 120 mg/dl:

In [18]:
num_highfbs_patients = np.sum(heart.fbs[heart.fbs == 1])
print(num_highfbs_patients)
#the observed success is 45 patients

45.0


<hr>

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. If this sample were representative of the population, approximately how many people would you expect to have diabetes? Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [19]:
num_of_us_pop_diabetes = num_patients * 0.08
print(num_of_us_pop_diabetes)

24.240000000000002


Comes out to 24 patients, which is almost half the number of `num_highfbs_patients` (45 samples)

<br>
<hr>

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%?

In [20]:
p_value = binom_test(num_highfbs_patients, n = num_patients, p = 0.08, alternative = 'greater')
print("{:.8f}".format(float(p_value)))

0.00004689


0.0000469 is less than 0.05 which means we reject the null hypothesis; this indicates that more than 8% of the population has fbs > 120 mg/dl.