# **Project Proposal**
## Title: ...
Group 8

Team members: ...

## **Introduction**
Heart disease is a subset of cardiovascular diseases, the global leading cause of death. From the Framingham Heart Study, many different factors, such as age, are found to be correlated with heart disease. 

The question we will try to answer with this project is: Can we use `age`, `exang` (exercise induced angina), `??` to predict whether someone will be diagnosed with heart disease (yes or no)? 

The dataset we will use is "Heart Disease" Data Set, consisting of 14 attributes: 
* `age` (in years)
* `sex` (1 = male; 0 = female)
* `cp`: Chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
* `trestbps`: resting blood pressure (mmHg)
* `chol`: serum cholesterol (mg/dl)
* `fbs`: fasting blood sugar > 120 mg/dl (1 = true; 0 = false) 
* `restecg`: resulting electrocardiographic results
* `thalach`: maximum heart rate achieved 
* `exang`: exercise induced angina (1= yes; 0= no)
* `oldpeak`: ST depression induced by exercise relative to rest
*  `slope`: slope of peak exercise ST segment
    - Value 1: upsloping = 3
    - Value 2: flat = 2
    - Value 3: downsloping = 1
* `ca`: number of major vessels (0, 1, 2, 3)
* `thal`: 
    - Value 3 = normal
    - Value 6 = fixed defect
    - Value 7 = reversible defect
* `num`: diagnosis of heart disease
    - Value 0: Healthy, <50% diameter narrowing
    - Value 1: Diagnosed with stage 1, >50% diameter narrowing
    - Value 2: Diagnosed with stage 2, >50% diameter narrowing
    - Value 3: Diagnosed with stage 3, >50% diameter narrowing
    - Value 4: Diagnosed with stage 4, >50% diameter narrowing


## **Preliminary exploratory data analysis**

### Reading and cleaning the data

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)

In this following cell we perform 3 steps:

* We read the data directly from the internet into this R notebook.
* Since the data doesn't have the columns' names in its file, we explicitly write the names from the information we find in https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
* We remove rows which have null values by filtering. In this case null values are represented as "?", and we know that `ca` and `thal` have some of this null values
* We change the data type of some of the variables since in the webpage mentioned above we can see that some variables should be categorical,as they only have a few possible values.
* We create a new variable from `num`. Instead of having different values of diagnosed (`num`= 1, 2, 3 or 4) we just consider if the person has been diagnosed with a heart disease or not.

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read_csv(url, col_names = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach",
                                    "exang", "oldpeak", "slope", "ca", "thal", "num")) %>%
        filter(ca != "?" & thal != "?") %>%
        mutate(sex = as.factor(sex), cp = as.factor(cp), fbs = as.factor(fbs), restecg = as.factor(restecg),
               exang = as.factor(exang), slope = as.factor(slope), thal = as.factor(thal), num = as.factor(num),
               ca = as.factor(ca),
               target = ifelse(num==0, 0, 1))
        

head(data, 10)

At this point we can see that the data looks tidy. The dataset already contained tidy data, therefore we only needed to add the columns' names, remove the null values and the specify data types, plus we aadditionally created a new variable to simplify answering our question.

### Splitting data into Train and Test

For the next part, we want to explore our dataset, but only the training dataset. Therefore we first will split the data into 2 sub-datasets. We will use 75% of our data for training and the variable we want to classify is `num`.


In [None]:
data_split <- initial_split(data, prop = 0.75, strata = num)  
data_train <- training(data_split)   
data_test <- testing(data_split)

head(data_train, 5)
head(data_test, 5)

Now we can see the size we originally had from the whole dataset, as well as the size of the training dataset and the test dataset.

In [None]:
nrow(data)
nrow(data_train)
nrow(data_test)

### Training Data Summary Table
We perform an exploratory data analysis of the data and show it in a table format.  

CAN USE THIS:
summarize the data in at least one table (this is ). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

In [None]:
str(data_train)

In [None]:
#library(mlr)
#summarizeColumns(data_train)

# CAN'T USE BECAUSE CAN'T DOWNLOAD mlr PACKAGE :(

In [None]:
EDA_data <- data_train %>%
    group_by(num) %>%
    summarize(n = n(),
              mean_age = mean(age, na.rm = TRUE),
              mean_trestbps = mean(trestbps, na.rm = TRUE),
              mean_chol = mean(chol, na.rm = TRUE),
              mean_thalach = mean(thalach, na.rm = TRUE),
              mean_oldpeak = mean(oldpeak, na.rm = TRUE),
              majority_sex = sex[n == max(n)][1],
              majority_cp = cp[n == max(n)][1],
              majority_fbs = fbs[n == max(n)][1],
              majority_restecg = restecg[n == max(n)][1],
              majority_exang = exang[n == max(n)][1],
              majority_slope = slope[n == max(n)][1],
              majority_ca = ca[n == max(n)][1],
              majority_thal = thal[n == max(n)][1])

EDA_data

### Training Data Visualization Analysis

Now we perform another exploratory data analysis to visualize the data with some relevant plots to our analysis. 

WE CAN USE: An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [None]:
#thalach, age

options(repr.plot.width = 10, repr.plot.height = 8) #Remember to set your plot sizes to an appropiate size

# add the code for your plot here!
smoke_plot <- data_train %>%
    ggplot(aes(x = thalach, y = oldpeak)) + 
        geom_point(aes(colour=num), alpha = 0.6, size=4) + # Deals with the transparency of the points, set it to an appropiate value
        labs(x= "x", y= "y", colour="Predicted group") + 
        ggtitle("plot 1") +
        scale_color_manual(values=c("#000000", "#F5213D", "#56B4E9", "#39C45E", "#E6FF00"))


smoke_plot

In [None]:
data_num <- data %>%
    select(age, trestbps, chol, thalach, oldpeak)

res <- cor(data_num)
round(res, 2)

In [None]:
library(ggcorrplot)
options(repr.plot.width = 12, repr.plot.height = 12) #Remember to set your plot sizes to an appropiate size

model.matrix(~0+., data=data) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)