## The Data

At this link, you will find a dataset containing information about heart disease patients: https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1

A description of the original dataset can be found here: https://archive.ics.uci.edu/dataset/45/heart+disease (However, this dataset has been cleaned and reduced, and the people have been given fictious names.)

In [12]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from plotnine import *
df = pd.read_csv('https://www.dropbox.com/scl/fi/0vrpdnq5asmeulc4gd50y/ha_1.csv?rlkey=ciisalceotl77ffqhqe3kujzv&dl=1')

## 1. Logistic Regression

Fit a Logistic Regression using only `age` and `chol` (cholesterol) as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?


How high for the doctors to estimate a 90% chance that heart disease is present?


In [3]:
X = df[['age', 'chol']]
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

pipeline = Pipeline([
    ('logreg', LogisticRegression())  # Logistic regression
])

pipeline.fit(X, y)
pipeline.named_steps['logreg'].coef_

array([[0.04686331, 0.00180124]])

In [4]:
pipeline.named_steps['logreg'].intercept_

array([-3.24011226])

In [5]:
(np.log(9) + 3.24011226 - .04686331 * 55) / .00180124

1587.7144563390887

## 2. Linear Discriminant Analysis

Fit an LDA model using only `age` and `chol` (cholesterol)  as predictors.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [6]:
X = df[['age', 'chol']]
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

pipeline = Pipeline([
    ('LinDA', LinearDiscriminantAnalysis())
])

pipeline.fit(X, y)
pipeline.named_steps['LinDA'].coef_

array([[0.04655744, 0.00178967]])

In [7]:
pipeline.named_steps['LinDA'].intercept_

array([-3.21967766])

In [None]:
(3.21967766 - 0.04655744 * 55) / 0.00178967

## 3. Support Vector Classifier

Fit an SVC model using only `age` and `chol` as predictors.  Don't forget to tune the regularization parameter.

For a 55 year old, how high would their cholesterol need to be for the doctors to predict heart disease is present?

In [9]:
X = df[['age', 'chol']]
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

pipeline = Pipeline([
    ('SVC', SVC(kernel='linear'))  # Specify 'linear' kernel
])

pipeline.fit(X, y)
pipeline.named_steps['SVC'].coef_

array([[0.06439772, 0.00365896]])

In [10]:
pipeline.named_steps['SVC'].intercept_

array([-4.68603406])

In [11]:
(4.68603406 - 0.06439772 * 55) / 0.00365896

312.70072916894406

## 4. Comparing Decision Boundaries

Make a scatterplot of `age` and `chol`, coloring the points by their true disease outcome.  Add a line to the plot representing the **linear separator** (aka **decision boundary**) for each of the three models above.

In [None]:
X = df[['age', 'chol']]
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

pipeline = Pipeline([
    ('logreg', LogisticRegression()),
])

pipeline.fit(X_train, y_train)
