# Heart Disease Research Part I

In this project, you’ll investigate some data from a sample patients who were evaluated for heart disease at the Cleveland Clinic Foundation. The data was downloaded from the 
[UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease)  and then cleaned for analysis. The principal investigators responsible for data collection were:

##### Data citation:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

##### Additional Information

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The "goal" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  
   
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

One file has been "processed", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.

To see Test Costs (donated by Peter Turney), please see the folder "Costs" 

##### Initialize the UCI Machine Learning Repository API

In [35]:
# Import the dataset into your code
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 


##### Import library

In [36]:
import pandas as pd
import numpy as np

##### Data Acquisition
Loading the heart disease database for analysis. Let's take a first look at our Dataframe.

In [37]:
# convert dataframe
heart = pd.concat([heart_disease.data.features, heart_disease.data.targets], axis=1).rename(columns={'num': 'target'})
heart

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


##### Data Dictionary/Variable Notes

Complete attribute documentation:
- ``id``: patient identification number;
- ``age``: age in years;
- ``sex``: 1 = male; 0 = female;
- ``cp``: chest pain type (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic);
- ``trestbps``: resting blood pressure (in mm Hg on admission to the hospital);
- ``chol``: serum cholestoral in mg/dl;
- ``fbs``: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false);
- ``restecg``: resting electrocardiographic results (0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV),
2:showing probable or definite left ventricular hypertrophy by Estes' criteria);
- ``thalach``: maximum heart rate achieved;
- ``exang``: exercise induced angina (1 = yes; 0 = no);
- ``oldpeak`` = ST depression induced by exercise relative to rest;
- ``slope``: the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping);
- ``ca``: number of major vessels (0-3) colored by flourosopy;
- ``thal``: 3: normal, 6: fixed defect, 7: reversable defect;


### Task 1

Split the dataframe into two subsets:

- ``yes_hd``, which contains data for patients with heart disease
- ``no_hd``, which contains data for patients without heart disease

In [38]:
yes_hd = heart[heart.target != 0]
no_hd = heart[heart.target == 0]

For this project, we’ll investigate the following variables __chol__, __fbs__.
To start, we’ll investigate cholesterol levels for patients with heart disease. Use the dataset ``yes_hd`` to save cholesterol levels for patients with heart disease as a variable named ``chol_hd``

In [39]:
chol_hd = yes_hd['chol']
chol_hd.iloc[:5]

1    286
2    229
6    268
8    254
9    203
Name: chol, dtype: int64

### Task 2

In general, total cholesterol over 240 mg/dl is considered “high” (and therefore unhealthy). Calculate the mean cholesterol level for patients who were diagnosed with heart disease and print it out. Is it higher than 240 mg/dl?

In [40]:
print(f'The mean cholesterol for patients diagnosed with heart disease: {round(chol_hd.mean(),2)}')

The mean cholesterol for patients diagnosed with heart disease: 251.47


### Task 3

Do people with heart disease have high cholesterol levels (greater than or equal to 240 mg/dl) on average? Import the function from ``scipy.stats`` that you can use to test the following null and alternative hypotheses:

- H0: People with heart disease have an average cholesterol level equal to 240 mg/dl
- H: People with heart disease have an average cholesterol level that is greater than 240 mg/dl

In [41]:
from scipy.stats import ttest_1samp

### Task 4

Run the hypothesis test indicated in task 3 and print out the p-value. Can you conclude that heart disease patients have an average cholesterol level significantly greater than 240 mg/dl? Use a significance threshold of 0.05.

In [42]:
stat, pval = ttest_1samp(chol_hd, 240, alternative='greater')
print(f'p-value {pval}')
print()
if pval < 0.05:
    print("Reject the null hypothesis; there is sufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.")
else:
    print("Fail to reject the null hypothesis; there is insufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.")

p-value 0.0035411033905155707

Reject the null hypothesis; there is sufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.


### Task 5

Repeat steps 1-4 in order to run the same hypothesis test, but for patients in the sample who were not diagnosed with heart disease. Do patients without heart disease have average cholesterol levels significantly above 240 mg/dl?

In [43]:
chol_no_hd = no_hd['chol']
chol_no_hd.iloc[:5]

0    233
3    250
4    204
5    236
7    354
Name: chol, dtype: int64

In [44]:
print(f'The mean cholesterol for patients who have not been diagnosed with heart disease: {round(chol_no_hd.mean(),2)}')

The mean cholesterol for patients who have not been diagnosed with heart disease: 242.64


In [45]:
stat, pval = stats.ttest_1samp(chol_no_hd, 240, alternative='greater')
pval
if pval < 0.05:
    print("Reject the null hypothesis; there is sufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.")
else:
    print("Fail to reject the null hypothesis; there is insufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.")

Fail to reject the null hypothesis; there is insufficient evidence to conclude that people with heart disease have an average cholesterol level greater than 240 mg/dl.


## Fasting Blood Sugar Analysis

### Task 6

Let’s now return to the full dataset (saved as ``heart``). How many patients are there in this dataset? Save the number of patients as ``num_patients`` and print it out.

In [46]:
num_patients = len(heart)
print(f'Number of partients {num_patients}')

Number of partients 303


### Task 7

Remember that the fbs column of this dataset indicates whether or not a patient’s fasting blood sugar was greater than 120 mg/dl (1 means that their fasting blood sugar was greater than 120 mg/dl; 0 means it was less than or equal to 120 mg/dl).

Calculate the number of patients with fasting blood sugar greater than 120. Save this number as ``num_highfbs_patients`` and print it out.

In [47]:
num_highfbs_patients = len(heart[heart['fbs'] == 1])
print(f'Number of patients with fasting blood sugar greater than 120: {num_highfbs_patients}')

Number of patients with fasting blood sugar greater than 120: 45


### Task 8

Sometimes, part of an analysis will involve comparing a sample to known population values to see if the sample appears to be representative of the general population.

By some estimates, about 8% of the U.S. population had diabetes (diagnosed or undiagnosed) in 1988 when this data was collected. While there are multiple tests that contribute to a diabetes diagnosis, fasting blood sugar levels greater than 120 mg/dl can be indicative of diabetes (or at least, pre-diabetes). If this sample were representative of the population, approximately how many people would you expect to have diabetes? Calculate and print out this number.

Is this value similar to the number of patients with a resting blood sugar above 120 mg/dl — or different?

In [48]:
print(f'The estimated number of people with diabetes: {round(num_patients*0.08,0)}')

The estimated number of people with diabetes: 24.0


### Task 9

Does this sample come from a population in which the rate of fbs > 120 mg/dl is equal to 8%? Import the function from scipy.stats that you can use to test the following null and alternative hypotheses:

- H0: This sample was drawn from a population where 8% of people have fasting blood sugar > 120 mg/dl
- H1: This sample was drawn from a population where more than 8% of people have fasting blood sugar > 120 mg/d

In [49]:
from scipy.stats import binomtest

### Task 10

Run the hypothesis test indicated in task 9 and print out the p-value. Using a significance threshold of 0.05, can you conclude that this sample was drawn from a population where the rate of fasting blood sugar > 120 mg/dl is significantly greater than 8%?

In [50]:
result = binomtest(num_highfbs_patients, num_patients, 0.08, alternative='greater')
print(f'p-value {result.pvalue}')
print()
if result.pvalue < 0.05:
    print("Reject the null hypothesis; there is sufficient evidence to conclude that more than 8% of people have fasting blood sugar > 120 mg/dl.")
else:
    print("Fail to reject the null hypothesis; there is insufficient evidence to conclude that more than 8% of people have fasting blood sugar > 120 mg/dl.")


p-value 4.689471951448875e-05

Reject the null hypothesis; there is sufficient evidence to conclude that more than 8% of people have fasting blood sugar > 120 mg/dl.
