# M8: Heart Disease Dataset

Congratulations, you have made it to the last module! Throughout the course, you have covered the fundamental knowledge and packages needed to apply Python programming in bioinformatics.

The aim of this module is to help you consolidate what you have learned. We will introduce a new dataset for you to analyse and explore. You'll be given a series of exercises designed for you to complete independently, with minimal external assistance.

If you find yourself stuck, make sure to give it a proper attempt on your own first. If that doesn't resolve the issue, revisit earlier modules to refresh your memory. And if you're still unsure, feel free to use Google or consult the solution sheet as a last resort.

## Initial Exploration

The dataset that we will be using is the Heart Disease Dataset from UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/45/heart+disease

This dataset is intended for a machine learning task: given a set of patient features, the goal is to predict whether the patient has been diagnosed with heart disease (the target variable). While building such a model is beyond the scope of this course, you will conduct some initial exploratory analyses to become familiar with the dataset and its contents.

These exploratory analyses are often an essential first step, regardless of whether your aim to develop a machine learning model or to pursue other kinds of investigation.

**### Exercise 1:**
Take a few minutes to read through the dataset description on the website to familiarise yourself with its structure and the variables it contains.

Great! Now that you've had a look at the dataset description, let's dive into the data itself.

Don't worry if you didn't understand everything — things should become clearer as you familiarise yourself with the dataset through practice.

At the top right of the website, you'll find a button labelled `IMPORT IN PYTHON`. Clicking on it will show you which package to install and how to load the dataset using it.

**Exercise 2:**
Install the required package (using any of the methods you've learned) and import the dataset into your Python environment.

```{dropdown}

{admonition}
:class: tip

For the first part of the exercises, you mainly just need to follow the example code provided on the website.

Take your time going through the initial code, it will help you begin exploring the dataset and understanding what is being provided and how to work with it.

In [1]:
# Your code goes here


**Exercise 3:**
Create an instance of the Heart Disease dataset using the `fetch_ucirepo` object. Then:
- Check the **type** of the dataset object you've created.
- Use `dir()` on the object to list its attributes.
- Try accessing `.metadata`, `.variables`, and `.data`. What types are these? What kind of information do they contain?

This will help you understand how the dataset is structured and how to navigate it.

```{dropdown}

{admonition}
:class: tip

Use `type(heart_disease)` and `dir(heart_disease)` to explore the structure of the object.

You’ll find that `heart_disease` has components like `.metadata` (a dictionary), `.variables` (a DataFrame), and `.data` (a dictionary containing the features and target).

In [2]:
# Your code goes here

**Exercise 4:**
Print the metadata information about the dataset. Try to answer the following questions based on what you find:
- How many data points (patients) and how many features are there?
- What are the names of the demographic features?
- What is the name of the target variable?
- Are there any missing values in the dataset? If so, how can they be identified?

In [3]:
# Your code goes here

**Exercise 5:**
Print the variable information for the dataset. Try to answer the following questions based on the output:
- Which variables have missing values?
- What is the unit of *resting blood pressure*?
- How many categorical variables are there?

In [4]:
# Your code goes here

```{admonition} Reflection
:class: note

Based on the metadata and variable information, write down answers to the following:

- What types of features are present in the dataset (categorical, integer, real)?
- Which features seem to be demographic?
- Are there any features with missing values?
- What does the target variable represent, and how is it structured?
- Are there any variables you expect might be particularly important for heart disease prediction?

Take a moment to write 3–5 bullet points. This will help you later when you're cleaning the data and doing visualisation or analysis.

**Exercise 6:**
Extract the feature values and the target values into two separate variables. Print each of them.
- Do the number of rows and columns make sense?
- Are you able to understand what information they contain based on the metadata exploration?
- If not, revisit the website, metadata, and variable information to clarify.

Next, import the `pandas` library. Combine the feature and target values into a single pandas DataFrame, where each row represents a patient and each column represents a feature, with the final column being the target variable. Display the first five rows of the resulting DataFrame.

```{dropdown}

{admonition}
:class: tip

If you're unsure how best to combine the features and target, start by checking their types. You'll see that they are already both in the pandas framework, so you'll need to use a pandas command to combine the two into the required layout — this should only require one line of code.

Whenever you're unsure how to manipulate the data, the first step should always be to check its type. This will help you better understand what operations are available.

In [5]:
# Your code goes here

# Data Cleaning

Now that we have everything in a single, tidy DataFrame, we need to make sure the data is properly cleaned before we begin analysing it.

Let's start with missing values. From the metadata, we already know that the variables `ca` and `thal` contain missing values.

**Exercise 7:**
To decide how to handle these missing values, we first need a more detailed understanding. For both `ca` and `thal`, print the following information:
- The full column of values
- How many times each contains a `NaN` value
- The number of unique values in the column, and what those values are
- The type of each feature (use the variable information DataFrame to check this)

In [6]:
# Your code goes here

You should notice that `ca` contains integer values, while `thal` is categorical, with each having four or three unique values respectively (excluding `NaN`). For this reason, it wouldn't make sense to replace the missing values with the mean. Instead, a simple approach is to remove patients (rows) with any missing (`NaN`) values.

**Exercise 8:**
Remove all rows (patients) that contain any NaN values. Then:
- Print the number of NaN values in each column to ensure that none remain
- Print the shape of the DataFrame and check whether the number of removed rows makes sense
- Reset the row index after dropping the rows with missing data

In [7]:
# Your code goes here

**Exercise 9:**
Check that none of the columns or rows are duplicates of one another.

In [8]:
# Your code goes here

As a final step, we want to ensure that the target variable (`num`) is in the correct format.
According to the website, it should take five possible values: 0, 1, 2, 3, and 4.
A value of 0 indicates no heart disease, while values 1–4 represent different categories of heart disease.


**Exercise 10:**
- Check the type of `num` and its unique values, as you did in earlier exercises, to confirm that the data matches the description.
- Create a new column at the end of the DataFrame called `heart_disease_binary`. This column should contain 0 if `num` is 0, and 1 otherwise. Use a lambda function to achieve this transformation.
- Print the final DataFrame to verify the result.

In [9]:
# Your code goes here

## Exploratory analyses