![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

### Introduction to Data Science with Python — Starter Pass

# Predicting heart diseases

In this project, you will be using a dataset from a cardiovascular study on residents of the town of Framingham, Massachusetts. This dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

The goal of this project **predict whether the patient has 10-year risk of future coronary heart disease (CHD)**. To do that you will need to put in practice all the topics you saw on previous lessons.

![heart](https://user-images.githubusercontent.com/7065401/103839631-9583ce80-506e-11eb-87fe-3ebf2a7a0be8.png)

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Knowing our data

Before starting it's important to load all the libraries we'll be using and understand the data we'll be working on.

---
### Attributes

#### Demographic

- Sex: male or female
- Age: Age of the patient
- Education: no further information provided

#### Behavioral

- Current Smoker: whether or not the patient is a current smoker
- Cigs Per Day: the number of cigarettes that the person smoked on average in one day

#### Information on medical history
- BP Meds: whether or not the patient was on blood pressure medication
- Prevalent Stroke: whether or not the patient had previously had a stroke
- Prevalent Hyp: whether or not the patient was hypertensive
- Diabetes: whether or not the patient had diabetes

#### Information on current medical condition
- Tot Chol: total cholesterol level
- Sys BP: systolic blood pressure
- Dia BP: diastolic blood pressure
- BMI: Body Mass Index
- Heart Rate: heart rate - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.
- Glucose: glucose level

#### Target variable to predict
- TenYearCHD: 10 year risk of coronary heart disease (binary: “1:Yes”, “0:No”)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('data/heart.csv')

Take a look at some records of your data, analyze columns and values:

In [None]:
# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Dealing with missing values

Check if there is any missing value on the data.

> In that case remove the rows with missing values. An advanced approach could be imputting values or removing insignificant columns with more than 30% of missing values.

In [None]:
# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Correlation analysis

Plot a heatmap showing correlation between variables.

In [None]:
# your code goes here


---
### Have you detected any high correlated variables?

If yes remove one of them to avoid collinearity.

In [None]:
# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Features and Labels

Assign to the `X` variable your features, and to the `y` variable your labels.

In [None]:
# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Standardization

As the `X` features aren't on the same scale, let's standardize them.

To do that use the `StandardScaler` from scikit-learn.

In [None]:
from sklearn.preprocessing import StandardScaler

# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Train and test splits

Let's split the `X` and `y` data into training and testing sets.

Keep 80% of the data in the training set and 20% of the data in the test set.

> You can use a `random_state` to reproduce your problem the same every time it is run.

In [None]:
from sklearn.model_selection import train_test_split

# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## My first model

Now you will need to create a model to make predictions. In this case you will need to create a `LogisticRegression` classifier:

1. Create the classifier `clf` model.
2. Fit/train that model with your `X_train` and `y_train` data.
3. Get the predictions of that model over your `X_test` set.
4. Get the score of that model using your `X_test` and `y_test` data.


In [None]:
from sklearn.linear_model import LogisticRegression

# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Other classifiers

scikit-learn offers a lot of methods/algorithms for this type of problems. Are we sure that the `LogisticRegression` classifier we used is the best model?

Let's try other methods/algorithms to see if any of them can achieve a better score.

> You can try a `RandomForestClassifier`, a `DecisionTreeClassifier`, a `LinearSVC`, or any other classifier from scikit-learn.


In [None]:
# your code goes here


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)