### College of Computing and Informatics, Drexel University
### INFO 213: Data Science Programming II
---

## Project Proposal

## Project Title: Cardiovascular Diseases Risk Prediction

## Student(s): Kevin Shi, Rithvik Sukumaran

#### Date: 8/10/2023
---

#### Purpose
---
You are asked to propose a final project and present in the class. This proposal should describe the problem, the data sets, and the goal(s) of the project. Use the Project Requirements at the end of this notebook for choosing and scoping your project.

### 1. Introduction
---
*(Introduce the project and describe the objectives.)*

Welcome to our Cardiovascular Diseases Risk Prediction Project! In this project, our primary objective is to develop a comprehensive understanding of the risk factors associated with Cardiovascular Diseases (CVDs) and create predictive models that can assess an individual's susceptibility to these conditions based on their personal lifestyle factors. Cardiovascular diseases are a significant health issue for millions around the world. Existing health conditions and lifestyle factors play a crucial role in determining an individual's vulnerability to CVDs, and understanding these factors is essential for future outlook. Taking data from the 2021 Behavioral Risk Factor Surveillance System (BRFSS) dataset provided by the Centers for Disease Control and Prevention (CDC), a dataset with 308,854 records and 19 variables specifically related to lifestyle factors that have commonly been associated with an increased risk of various cardiovascular diseases was published. By analyzing the dataset, we can not only pinpoint high-risk populations, identify trends, and formulate health suggestions, but we can also create machine learning models and predictive models to predict the risk of CVDs based on personal lifestyle factors.

Here are the objectives of our project:


*   Create K-Nearest Neighbor classification model for predictive analyses
*   Create Random Forest Classifier for predictive analyses
*   Create Naive Bayes classification model for predictive analyses
*   Use accuracy, precision, recall, and F1-score to evaluate the performance of our models
*   Visualize the results of our machine learning and predictive models
*   Compare our models and analyze their effectiveness
*   Discover and analyze the key attributes associated with CVD risk prediction
*   Publish and share our results with interested parties (related companies, organizations, or the general public)



### 2. Problem Definition
---
*(Define the problem that will be solved in this data analytics project.)*

The problem we aim to solve is the prediction of CVD based off of a few select risk factors. More specifically, we aim to predict the *risk* of CVD from the risk factors present in the data. In doing so, we will learn which factors have greater impacts on one's risk of CVD.

This project operates as a case study on practical applications of machine learning techniques in a medical context. Many studies before us have proven the effectiveness of artificial intelligence in medicine, but as the limelight tends towards complex deep learning methodologies addressing similarly complex problems, our study focuses on a simpler problem using simpler methods. To this end, we aim to show that for many applications, the simpler machine learning algorithms will suffice.

The application of artificial intelligence in medicine an important topic to discuss. It is no surprise that AI has a variety of applications in many different fields, medicine included. In a medical context, machine learning can be used to support and supplement clinicians, diagnosticians, and other medical professionals in their tasks. Reducing the burden on these experts by allowing artificial intelligence to pick up the slack allows them to spend less time on trivial issues.

We aim to show that while large volumes of data may be required, it is not necessary to invest huge amounts of resources into developing complex learning algorithms to make an impact in common medical tasks. The simpler models will more than suffice for a majority of applications.

### 3. Data Sources
---
*(Describe the origin of the data sources. What is the format of the original data? How to access the data?)*



*   The Behavioral Risk Factor Surveillance System (BRFSS) collects comprehensive health-related data from U.S. residents through telephone surveys conducted by various state health departments in collaboration with the Centers for Disease Control and Prevention (CDC).
*   The data collection prodecure that formed our dataset came from the BRFSS data collected by the CDC in 2021 (data was accessed from a local machine).
*   The original dataset contained 438,693 records and 304 features. However, due to the irrelevance of some of the features to the particular study, a smaller set of 19 features, specifically related to lifestyle factors that have commonly been associated with an increased risk of various cardiovascular diseases, were incorporated into our Cardiovascular Diseases Risk Prediction Dataset.
*   When using these 19 chosen features to construct machine learning models and predictive models for cardiovascular diseases, the number of records that were used reduced from 438,693 to 308,854.
*   In our dataset, there are 308854 rows (records) and 19 columns (features). Of the 19 features, 7 are numerical, 9 are categorical, and 3 are ordinal.
*   The format of our dataset is a csv file.
*   To access the data, visit [Kaggle](https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset), and download the csv data file.
*   More information about the origin of the data source and the study performed can be found in this [research article](https://eajournals.org/ejcsit/vol11-issue-3-2023/integrated-machine-learning-model-for-comprehensive-heart-disease-risk-assessment-based-on-multi-dimensional-health-factors/). We will reference this research article in our predictive analyses.



### 4. The Goal(s) of the predictions
---
*(What are the expected results of the project?)*



1.   Identifying Key Risk Factors

After engaging in EDA, as well as researching the common causes of cardiovasular diseases according to trusted sources such as the CDC, the National Health Service, and the World Health Organization, we were able to get a relative understanding of possible leading factors to the development of CVDs. Particular features in our dataset that we would expect to impact the risk of having CVDs include...

*   Physical activity level
*   Diabetes
*   Smoking history
*   Gender
*   Age
*   Fried Potato Consumption

It is important to note, however, that although these are the features that we would expect to be leading scientific and medical causes of CVDs, not all of these features may be significant contributors in our machine learning and predictive models. Nevertheless, we hope that the results of our models and our predictions will affirm and support the general knowledge of the scientifically discovered causes of CVDs. We may also expect to have legitimate unfounded findings when it comes to predicting CVDs, which can prove to serve as valid results.

2.   Model Assessment

Our expected results are the performance metrics of our three models, as well as any insights on the risk factors gained over the course of the study. The performance metrics of our models (which will include accuracy along with precision, recall, and F1) allow us to assess the effectiveness of machine learning in predicting CVD.

3.   Public Health Insights

The knowledge gained on risk factors can be beneficial for us and others when it comes to managing our health and preventing CVD. Recognizing specific lifestyle factors that have the most significant impact on CVD risk can undoubtedly help the general public to watch out for their health. If our predictive models are successful, we can further develop personalized recommendations for minimizing CVD risk.

4.   Research Contribution

Our results could further prove useful to doctors, researchers, and other professionals. Our results can further support precedented findings as well as serve as evidence for unfounded hypotheses.





---
(*Use the following requirements for writing your reports. DO NOT DELETE THE CELLS BELLOW*)

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The following course outcomes will be checked against the content of the report:

Upon successful completion of this course, a student will be able to:
* Describe the key Python tools and libraries that related to a typical data analytics project.
* Identify data science libraries, frameworks, modules, and toolkits in Python that efficiently implement the most common data science algorithms and techniques.
* Apply latest Python techniques in data acquisition, transformation and predictive analytics for data science projects.
* Discuss the underlying principles and main characteristics of the most common methods and techniques for data analytics.
* Build data analytic and predictive models for real world data sets using existing Python libraries.

** Marking will be foucsed on both presentation and content.**

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project comparable to that of the example data sets used in the lectures and assignments?
* Did the report describe the charactriatics of the data?
* Did the report describe the goals of the data analysis?
* Did the analysis conduct exploratory analyses on the data?
* Did the analysis build models of the data and evaluated the performance of the models?
* Overall, what is the rating of this project?