# Abstract
The goal of this project was to demonstrate a linear or logistic regression on a given dataset. One of the initial challenges of this project was understanding what qualities to look for in a dataset for this task. While prioritizing continuous data over categorical data helped, the most important quality was how well I could interpret or understand the results, regardless of how clean the model fit to the data. 

Inspired by the work of [Davide Chicco & Giuseppe Jurman](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5#Sec2), the goal of this project was to see how well a basic logistic regression would serve as a prediction model for the death of a patient with a heart condition. 

The model was able to account for 81% of the variance in the data, which adds confidence the results found in the reference paper. In that study, they used neural networks in their final model, which is a much more flexible model than linear regression. Moving forward, it would be interesting to see how a larger data set, paired with a more model would improve the accuracy of the predictions.

# Importing Data and Libraries

In [12]:
from import_file import *

data = pd.read_csv("../CSV/heart_failure_clinical_records_dataset_cleaned.csv")
display(data.head())

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,65,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
2,50,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
3,90,1,47,0,40,1,204000.0,2.1,132,1,1,8,1
4,75,1,246,0,15,0,127000.0,1.2,137,1,0,10,1


# Logistic Regression Model
The linear regression model had an $R^2 = 0.39$, where the logistic regression scored accuracy $0.807$. Since the variable we are predicting is binary the Logistic regression model makes sense as the more appropriate choice.

In [15]:
X = pd.DataFrame(data.drop(columns=["DEATH_EVENT"]))
Y = pd.DataFrame(data["DEATH_EVENT"])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=2020)

model_logistic = LogisticRegression()
model_logistic = model_logistic.fit(X_train, Y_train.values.ravel()) 
print("Logistic Score:\t", model_logistic.score(X_train, Y_train.values.ravel()))

Logistic Score:	 0.8071748878923767


# Conclusion
A score of $.81$ definitely inspires confidence that these features can serve as strong predictors for the life of death of a patient facing a heart condition. Unfortunately, I have not found a way to extract the built in feature selection from the sklearn library, although moving forward I would like to compare the listing with the features Davide Chicco & Giuseppe Jurman concluded were the strongest indicators. Moreover, the study would definitely benefit from a larger sample size to mode from.