# Project Title
Predicting probability of heart disease based on different risk factors
# Project Proposal
Using Heart Disease UCI and Heart Failure Clinical Data databases, our goal is to identify which risk factors are more strongly related to heart disease, and develop a model to predict the probability of heart disease using these risk factors.
# Reference to Data Sources
## Heart Disease UCI
Acknowledgements
Creators:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

*Link*
https://www.kaggle.com/ronitf/heart-disease-uci 

## Heart Failure Clinical Data
Creator
Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020).
Original Publication
https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5

*Link*
https://www.kaggle.com/andrewmvd/heart-failure-clinical-data


In [1]:
%%bigquery
SELECT *
FROM `heart_disease_dataset.heart_disease_ICU`;

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,62,0,0,140,268,0,0,160,0,3.6,0,2,2,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
3,62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
4,55,1,0,140,217,0,1,111,1,5.6,0,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,34,1,3,118,182,0,0,174,0,0.0,2,0,2,1
299,42,1,3,148,244,0,0,178,0,0.8,2,2,2,1
300,60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
301,59,1,3,160,273,0,0,125,0,0.0,2,0,2,0


In [1]:
%%bigquery
SELECT *
FROM `heart_disease_dataset.heart_failure_clinical_data`;

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,42.0,1,250,1,15,0,213000.00,1.3,136,0,0,65,1
1,46.0,0,168,1,17,1,271000.00,2.1,124,0,0,100,1
2,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
3,53.0,1,91,0,20,1,418000.00,1.4,139,0,0,43,1
4,50.0,1,582,1,20,1,279000.00,1.0,134,0,0,186,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,63.0,1,122,1,60,0,267000.00,1.2,145,1,0,147,0
295,45.0,0,308,1,60,1,377000.00,1.0,136,1,0,186,0
296,70.0,0,97,0,60,1,220000.00,0.9,138,1,0,186,0
297,53.0,1,446,0,60,1,263358.03,1.0,139,1,0,215,0


# Table information

The heart disease ICU dataset contains 13 fields that can be used to predict heart failure in patients admitted to the hospital. These fields include resting blood pressure, blood sugar levels and maximum heart rate achieved, thus showing this dataset is focused on biological indicators for each of the 303 patients. The 14 field (Target) marks the presence of heart disease as 1 or the absence of it as 0.

Meanwhile, the heart failure clinical data dataset provides biological indicators as well as behavioral risk factors such as if a patient is a smoker. This dataset can be also used to determine the mortality of patients with heart disease and it would be interesting to see the relationship between tables and variables.

We have queried the tables to ensure there are no null/missing values, which will help with our analysis.

In [None]:
# our goal should be around, who is more likely to have heart failure and most important risk factores? rj
#target variable: heart failure 
#number of observation :299
#number of features : 12


In [23]:
%%bigquery 
select * from `heart_disease_dataset.heart_failure_clinical_data` where DEATH_EVENT = 1 order by age

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,42.0,1,250,1,15,0,213000.00,1.30,136,0,0,65,1
1,45.0,0,582,0,14,0,166000.00,0.80,127,1,0,14,1
2,45.0,0,582,0,20,1,126000.00,1.60,135,1,0,180,1
3,45.0,0,7702,1,25,1,390000.00,1.00,139,1,0,60,1
4,45.0,1,981,0,30,0,136000.00,1.10,137,1,0,11,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,90.0,1,47,0,40,1,204000.00,2.10,132,1,1,8,1
92,90.0,1,60,1,50,0,226000.00,1.00,134,1,0,30,1
93,94.0,0,582,1,38,1,263358.03,1.83,134,1,0,27,1
94,95.0,1,112,0,40,1,196000.00,1.00,138,0,0,24,1


In [34]:
%%bigquery
SELECT count(anaemia) as A, CAST(ROUND(count(anaemia)/299*100,2) AS STRING)||'%' as precentage_of_anaemia
FROM  `heart_disease_dataset.heart_failure_clinical_data` 
where anaemia = 1 


Unnamed: 0,A,precentage_of_anaemia
0,129,43.14%


In [37]:
%%bigquery
SELECT count(*) as D, CAST(ROUND(count(*)/299*100,2) AS STRING)||'%' as precentage_of_diabetes
FROM  `heart_disease_dataset.heart_failure_clinical_data` 
where diabetes = 1 

Unnamed: 0,D,precentage_of_diabetes
0,125,41.81%


In [26]:
%%bigquery
SELECT count(*) as H, CAST(ROUND(count(*)/299*100,2) AS STRING)||'%' as precentage_of_high_blood_pressure
FROM `heart_disease_dataset.heart_failure_clinical_data` 
where high_blood_pressure = 1 

Unnamed: 0,H,precentage_of_high_blood_pressure
0,105,35.12%


In [25]:
%%bigquery
SELECT count(*) as Male, CAST(ROUND(count(*)/299*100,2) AS STRING)||'%' as precentage_of_male
FROM `heart_disease_dataset.heart_failure_clinical_data` 
where sex = 1 

Unnamed: 0,Male,precentage_of_male
0,194,64.88%


In [None]:
#male is more likely to have heart failure.

In [24]:
%%bigquery
SELECT count(*) as S, CAST(ROUND(count(*)/299*100,2) AS STRING)||'%' as precentage_of_smoking
FROM  `heart_disease_dataset.heart_failure_clinical_data` 
where smoking = 1 

Unnamed: 0,S,precentage_of_smoking
0,96,32.11%


In [5]:
# the relation between ejection_fraction in diffrent level
# toolow x<40 , normal between 50< <70, too high >75-rj

In [33]:
%%bigquery
SELECT ejection_fraction_level, CAST(ROUND(count(ejection_fraction)/299*100,2)AS STRING)||'%' AS Percentage
FROM(
SELECT 
*,
CASE 
  WHEN ejection_fraction < 50 THEN 'LOW'
  WHEN ejection_fraction > 50 AND ejection_fraction <70 THEN 'NORMAL'
  ELSE 'HIGH' END AS ejection_fraction_level
FROM `ba775-team-7b.heart_disease_dataset.heart_failure_clinical_data`) AS q1
GROUP BY ejection_fraction_level
ORDER BY Percentage DESC

Unnamed: 0,ejection_fraction_level,Percentage
0,LOW,79.93%
1,HIGH,7.69%
2,NORMAL,12.37%


In [None]:
#the relation between serum_creatinine in different level
#a normal range is between 0.5 and 1.2 mg/dl,1.2 to 2.2, over 2.2.rj

In [31]:
%%bigquery
SELECT Creatinine_Level, CAST(ROUND(count(Creatinine_Level)/299*100,2)AS STRING)||'%' AS Percentage
FROM(
SELECT 
*,
CASE 
  WHEN serum_creatinine >= 0.5 AND serum_creatinine <1.2 THEN '0.5-1.2 (normal)'
  WHEN serum_creatinine > 1.2 AND serum_creatinine <2.2 THEN '1.2-2.2'
  ELSE '>2.2' END AS Creatinine_Level
FROM `ba775-team-7b.heart_disease_dataset.heart_failure_clinical_data`) AS q1
GROUP BY Creatinine_Level
ORDER BY Percentage DESC

Unnamed: 0,Creatinine_Level,Percentage
0,0.5-1.2 (normal),58.19%
1,1.2-2.2,24.08%
2,>2.2,17.73%


In [6]:
#what percentage of heart failure will end up to death?rj

In [30]:
%%bigquery
SELECT count(DEATH_EVENT) as death, CAST(ROUND(count(DEATH_EVENT)/299*100,2) AS STRING)||'%' as percentage
FROM `heart_disease_dataset.heart_failure_clinical_data`
where DEATH_EVENT = 1

Unnamed: 0,death,percentage
0,96,32.11%


In [2]:
# what age is more likly to have heart failure?
# in range: <35, 35<x<45 ,45<x<55 , 55<x65, >65-rj
# age larger than 65 is more likely to have

In [28]:
%%bigquery
select age_group,CAST(ROUND(count(age_group)/299 * 100,2) AS STRING)||'%' AS percentage_age_group
from
(select *,
case 
when age < 35 then '< 35'
when age <45 then '35-45'
when age <55 then '45-55'
when age <65 then '55-65'
else '>65' end as age_group
from `heart_disease_dataset.heart_failure_clinical_data`)
group by age_group
order by age_group

Unnamed: 0,age_group,percentage_age_group
0,35-45,6.02%
1,45-55,25.75%
2,55-65,29.77%
3,>65,38.46%
