# Cardiovascular Disease Dataset Analysis

This notebook handles the dataset described in the provided dataset structure, including data exploration, cleaning, and feature analysis.

## Environment Setup
1. `cd environment`
2. `conda env create -f project_env.yml`
3. `conda activate project_env`

If that doesn't work...
1. `!pip install pandas matplotlib scikit-learn` (installing dependencies)

## Data Dictionary
- `patientid`: Patient Identification Number (Numeric)
- `age`: Age in Years (Numeric)
- `gender`: Gender (Binary: 0 = female, 1 = male)
- `chestpain`: Chest pain type (Nominal: 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)
- `restingBP`: Resting blood pressure (Numeric: 94-200 mm Hg)
- `serumcholestrol`: Serum cholesterol (Numeric: 126-564 mg/dl)
- `fastingbloodsugar`: Fasting blood sugar > 120 mg/dl (Binary: 0 = false, 1 = true)
- `restingelectro`: Resting electrocardiogram results (Nominal: 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy)
- `maxheartrate`: Maximum heart rate achieved (Numeric: 71-202 bpm)
- `exerciseangia`: Exercise induced angina (Binary: 0 = no, 1 = yes)
- `oldpeak`: ST depression induced by exercise (Numeric: 0-6.2)
- `slope`: Slope of peak exercise ST segment (Nominal: 1 = upsloping, 2 = flat, 3 = downsloping)
- `noofmajorvessels`: Number of major vessels (Numeric: 0-3)
- `target`: Classification (Binary: 0 = Absence of Heart Disease, 1 = Presence of Heart Disease)

## Data Loading and Cleaning
We will load the dataset and perform basic cleaning, handling any missing values and preparing the data for analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

In [2]:
# Load the dataset
data = pd.read_csv('/path_to_dataset/cardiovascular_disease.csv')

# Check the structure of the dataset
print('Dataset Shape: (number of rows, number of columns)')
print(data.shape)

# Display the first few rows
data.head()

Dataset Shape: (number of rows, number of columns)
(303, 14)


From the data structure, we can see the columns match the cardiovascular dataset description with 303 patient records and 14 attributes. Next, we will check for missing values and clean the data.

In [3]:
# Check for missing values
print('Missing values in each column:')
print(data.isnull().sum())

Missing values in each column:
patientid            0
age                  0
gender               0
chestpain            0
restingBP            0
serumcholestrol      0
fastingbloodsugar    0
restingelectro       0
maxheartrate         0
exerciseangia        0
oldpeak              0
slope                0
noofmajorvessels     4
target               0
dtype: int64


There are 4 missing values in the `noofmajorvessels` column. We will impute these values using the mean value for this feature.

In [4]:
# Impute missing values in 'noofmajorvessels' with mean
mean_value = int(math.ceil(data['noofmajorvessels'].mean()))
print(f"Imputed missing values with mean: {mean_value}")
data['noofmajorvessels'].fillna(mean_value, inplace=True)

Imputed missing values with mean: 1


Now that we've imputed missing values, let's check the data types and begin exploring the dataset with visualizations.

In [5]:
# Check data types and overall info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patientid          303 non-null    int64  
 1   age                303 non-null    int64  
 2   gender             303 non-null    int64  
 3   chestpain          303 non-null    int64  
 4   restingBP          303 non-null    int64  
 5   serumcholestrol    303 non-null    int64  
 6   fastingbloodsugar  303 non-null    int64  
 7   restingelectro     303 non-null    int64  
 8   maxheartrate       303 non-null    int64  
 9   exerciseangia      303 non-null    int64  
10  oldpeak            303 non-null    float64
11  slope              303 non-null    int64  
12  noofmajorvessels   303 non-null    int64  
13  target             303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


## Data Visualization
We will now visualize some key attributes to understand their distribution and explore correlations with the target variable (`target` = presence of heart disease).

In [6]:
# Visualize the distribution of age
plt.figure(figsize=(8,6))
plt.hist(data['age'], bins=20, color='lightblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [7]:
# Plot the distribution of chest pain types
plt.figure(figsize=(8,6))
data['chestpain'].value_counts().plot(kind='bar', color='salmon', edgecolor='black')
plt.title('Distribution of Chest Pain Types')
plt.xlabel('Chest Pain Type')
plt.ylabel('Count')
plt.show()

In [8]:
# Visualize the correlation between maxheartrate and target
plt.figure(figsize=(8,6))
plt.scatter(data['maxheartrate'], data['target'], color='green')
plt.title('Max Heart Rate vs. Heart Disease Presence')
plt.xlabel('Max Heart Rate')
plt.ylabel('Heart Disease (0 = No, 1 = Yes)')
plt.show()

From these visualizations, we can explore how certain features like `age`, `chestpain`, and `maxheartrate` vary with the target variable (presence of heart disease). We can also see the overall distribution of age and other categorical features.