# CS-E5710 BAYESIAN DATA ANALYSIS - PROJECT
## Table of Content:
* [1. INTRODUCTION](#1)
* [2. DESCRIPTION OF DATA AND THE ANALYSIS PROBLEM](#2)
* [3. MODELS' DESCRIPTION](#3)
* [4. MODEL COMPARSION](#4)
* [5. PREDICTIVE PERFORMANCE ASSESSMENT](#5)
* [6. DISCUSSION AND POTENTIAL IMPROVEMENTS](#6)
* [7. CONCLUSION](#7)
* [8. SELF-REFLECTION LESSONS](#8)

## 1. INTRODUCTION<a class="anchor" id="1"></a>
Predicting one's heart condition is crucial for giving proper medical decisions and possibly saving lives. The misdiagnosis of heart disease may cause serious problems, as being left untreated adequately can even lead to an irreversible damage in the heart muscle and can be life-threatening. Therefore, the correct diagnosis of the heart disease status is extremely vital not only for patients’ health but also their live.
Unfortunately, the heart disease status cannot be diagnosed easily since there are many blood vessels in the human body which lead directly to the heart and it is very expensive and time consuming to check all the blood vessels’ conditions by imaging method such as coronary angiogram. Hence, before making the decision whether or not conducting complicated examination techniques, it is important for the doctor to accurately assess the patients’ heart condition based on easily measurable biometric parameters such as age, sex, chest paint types, resting blood vessels, blood cholesterol levels, etc.

The report is divided into ... different sections:\
• **Section 1** introduces the application domain of the Heart Disease Detection Bayesian problem\
• **Section 2** explains how the Bayesian problem is formulated. The aim of this section is to define what data points represent and which properties of a data point are used as its features and labels.\
• **Section 3** visualizes the data and relation between the features and the labels\
• ...

## 2. DESCRIPTION OF DATA AND THE ANALYSIS PROBLEM<a class="anchor" id="2"></a>

1. age
2. sex: 0: female, 1: male
3. cp: chest pain type (4 values) 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
4. trestbps: resting blood pressure
5. chol: serum cholestoral in mg/dl
6. fbs: fasting blood sugar > 120 mg/dl (1 = True, 0 = False)
7. restecg: resting electrocardiographic results (values 0 = normal, 1 = having ST-T wave abnormality, 2 = showing proable or definite left ventricular hypertrophy by Estes's criteria)
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
12. ca: number of major vessels (0-4) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target: angiographic disease status (0 = <50% diameter narrowing _ no presence, 1% = >50% diameter narrowing _ presence)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Read csv file
df = pd.read_csv('heart.csv')


### Check if there is any missing varibale
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [6]:
### Normalizing numeric features value
from sklearn.preprocessing import StandardScaler

SS = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
df[col_to_scale] = SS.fit_transform(df[col_to_scale])
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0.952197,1,3,0.763956,-0.256334,1,0,0.015443,0,1.087338,0,0,1,1
1,-1.915313,1,2,-0.092738,0.072199,0,1,1.633471,0,2.122573,0,0,2,1
2,-1.474158,0,1,-0.092738,-0.816773,0,0,0.977514,0,0.310912,2,0,2,1
3,0.180175,1,1,-0.663867,-0.198357,0,1,1.239897,0,-0.206705,2,0,2,1
4,0.290464,0,0,-0.663867,2.08205,0,1,0.583939,1,-0.379244,2,0,2,1


Our application can be modelled as Machine Learning problem with **data points** representing patients who have already undergone heart tests. Each data point is characterized by 13 different health parameters such as age, sex, chest pain type, resting blood pressure, etc. The **label** (quantity of interest) of a data point is the heart disease status, for which values 0 and 1 indicate no presence and presence of a heart disease, respectively.
We gathered the data points with known label values using the patients’ health data recording available from UC Irvine machine learning heart disease repository which can be accessed via the link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

The above data repository contains health records data from 4 different locations such as Cleveland and Long Beach of America, Switzerland, and Hungary. However, only the Cleveland’s dataset is in a good condition and being maintained. All other datasets are in poor condition and do have a lot of missing information. Thus, we will use the Cleveland dataset which consists of 303 data points which are suitable for our machine learning problem.

## 3. MODELS' DESCRIPTION<a class="anchor" id="3"></a>

### 3.1 Linear regression model
TODO:
1. Model theoratical/mathematic description
2. Priors choice explaination
3. Convergence diagnostics (Rhat, ESS),  HMC specific convergence diagnostics (divergences, tree depth)
4. Posterior predeictive checks and what was done to improve the model
5. Sensitivity analysis with respect to the prior choices

### 3.2 Logistic regression model
TODO:
1. Model theoratical/mathematic description
2. Priors choice explaination
3. Convergence diagnostics (Rhat, ESS), HMC specific convergence diagnostics (divergences, tree depth)
4. Posterior predeictive checks and what was done to improve the model
5. Sensitivity analysis with respect to the prior choices

### 3.3 Hierarchical model
TODO:
1. Model theoratical/mathematic description
2. Priors choice explaination
3. Convergence diagnostics (Rhat, ESS), HMC specific convergence diagnostics (divergences, tree depth)
4. Posterior predeictive checks and what was done to improve the model
5. Sensitivity analysis with respect to the prior choices

## 4. MODEL COMPARSION<a class="anchor" id="4"></a>

## 5. PREDICTIVE PERFORMANCE ASSESSMENT<a class="anchor" id="5"></a>
TODO:
(e.g. classification accuracy) and evaluation of practical usefulness of the accuracy. If not applicable, then explanation why in this case the predictive performance is not applicable.

## 6. DISCUSSION ISSUES AND POTENTIAL IMPROVEMENTS<a class="anchor" id="6"></a>

## 7. CONCLUSION<a class="anchor" id="7"></a>
TODO: What was learn from the data analysis

## 8. SELF-REFLECTION LESSONS<a class="anchor" id="8"></a>
TODO: What group learn during project