# **Lab One: Vizualization and Data Preprocessing**

*Contributors:* Balaji Avvaru, Joshua Eysenbach, Vijay Kaniti, Daniel Turner

## **Business Understanding**

This analysis uses a dataset categorizing patients with a cardiovascular disease diagnosis.  It contains a collection of 11 attributes that were gathered with the intention of trying to identify potential characteristics of individuals that correlate with heart disease.

This dataset was procured from Kaggle (https://www.kaggle.com/sulianova/cardiovascular-disease-dataset) but it is unclear from where the original data originates.

The goal of any prediction algorithm using this data is to determine if any of these attributes or combination of them can accurately predict a cardiovascular disease diagnosis. These predicition models could provide valuable insight into what conditions or behaviors might be correlated with heart disease and could be used in aiding diagnosis or helping to mitigate the disease through understanding its possible causes.

The features included are described on the Kaggle page for this data as being separable into three categories:
* *Objective*: Factual initial information about the patient;
* *Examination*: Information resulting from medical examination;
* *Subjective*: Information given by the patient.

Distinctions between the different attributes are paramount to interpreting and qualifying results of analysis and modelling of this data as they can represent varying degrees of validity and potential biases, so it is important that we keep this in mind as we explore the data and eventually make any recommendations.

******* **Need more on specifics of measureable qualities of prediction models (which of accuracy, specificity etc are more important?)**



## **Data Understanding**

### Initial Import

The dataset acquired from Kaggle is stored for our use on Github. The code for importing the data is combined with the inital loading of *Pandas* and *Numpy* packages below. 

In [18]:
import pandas as pd
import numpy as np

In [23]:
cd = pd.read_csv("https://raw.githubusercontent.com/jteysen/MSDS-7331-Machine-Learning-I/master/Data/cardio_train.csv")

We can verify that the import was successful and get a preview of our data with the *.head()* command.

In [24]:
cd.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


### Feature Descriptions

The attributes included in the dataset are outlined below per their descriptions on the Kaggle page.

|Attribute Name | Category | Description |
|---------------|----------|-------------|
|age | Objective | Age of the Patient |
|height | Objective | Height of the Patient (cm) |
|weight | Objective | Weight of the Patient (kg) |
|gender | Objective | Gender of the Patient |
|ap_hi | Examination | Systolic blood pressure |
|ap_lo | Examination | Diastolic blood pressure |
|cholesterol | Examination | Cholesterol level -  1: normal, 2: above normal, 3: well above normal |
|gluc | Examination | Glucose level - 1: normal, 2: above normal, 3: well above normal |
|smoke | Subjective | Patient does or does not describe themselves as a smoker |
|alco | Subjective | Patient does or does not regularly drink alcohol |
|active | Subjective | Patient does or does not regularly exercise |
|cardio | Target Variable | Diagnosis of presence or absence of cardiovascular disease |

In [25]:
cd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


Note that there are no missing values in the data as there are 70,000 non-null values for all attributes.