# M8: Final Practice: Heart Disease Dataset

You have made it to the last module! Throughout the course, you have covered the fundamental knowledge and Python packages needed to apply programming in bioinformatics.

The aim of this module is to help you consolidate what you have learned. We will introduce a new dataset for you to analyse and explore. You'll be given a series of exercises designed for you to complete independently, with minimal external assistance.

If you find yourself stuck, make sure to give it a proper attempt on your own first. If that doesn't resolve the issue, revisit earlier modules to refresh your memory. And if you're still unsure, feel free to use Google or consult the solution sheet as a last resort.

---
---

## Initial Exploration

The dataset that we will be using is the *Heart Disease Dataset* from UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/45/heart+disease

This dataset is intended for a machine learning task: given a set of patient features, the goal is to predict whether the patient has been diagnosed with heart disease (the target variable). While building such a model is beyond the scope of this course, you will conduct some initial exploratory analyses to become familiar with the dataset and its contents.

Exploratory analysis is often an essential first step, regardless of whether your aim is to develop a machine learning model or to pursue other kinds of investigation.

### Exercise 1: Understanding the Dataset

*Take a few minutes to read through the dataset description on the website and familiarise yourself with its structure and the variables it contains.*

---

Great! Now that you've had a look at the dataset description, let's dive into the data itself.

Don't worry if you didn't understand everything — things should become clearer as you familiarise yourself with the dataset through practice.

At the top right of the website, you'll find a button labelled `IMPORT IN PYTHON`. Clicking on it will show you which package to install and how to load the dataset using it.

### Exercise 2: Import the Dataset Using the UCI Interface

*Install the required package (using any of the methods you've learned) and import the dataset into your Python environment.*

```{admonition}
:class: tip

For the first part of the exercises, you mainly just need to follow the example code provided on the website.

Take your time going through the initial code, it will help you begin exploring the dataset and understanding what is being provided and how to work with it.

In [3]:
# Write your code here.

---

### Exercise 3: Explore the Structure of the Dataset Object

*Create an instance of the Heart Disease dataset using the `fetch_ucirepo` object. Then:*
- *Check the **type** of the dataset object you've created.*
- *Use `dir()` on the object to list its attributes.*
- *Try accessing `.metadata`, `.variables`, and `.data`. What types are these? What kind of information do they contain?*

*This will help you understand how the dataset is structured and how to navigate it.*

```{admonition}
:class: tip

Use `type(heart_disease)` and `dir(heart_disease)` to explore the structure of the object.

You’ll find that `heart_disease` has components like `.metadata` (a dictionary), `.variables` (a DataFrame), and `.data` (a dictionary containing the features and target).
```

In [4]:
# Write your code here.

---

### Exercise 4: Summarise the Dataset Metadata

*Print the metadata information about the dataset. Try to answer the following questions based on what you find:*
- *How many data points (patients) and how many features are there?*
- *What are the names of the demographic features?*
- *What is the name of the target variable?*
- *Are there any missing values in the dataset?*

In [5]:
# Write your code here.

---

### Exercise 5: Inspect Variable Details and Metadata

*Print the variable information for the dataset. Try to answer the following questions based on the output:*
- *Which variables have missing values?*
- *What is the unit of **resting blood pressure**?*
- *How many categorical variables are there?*

In [6]:
# Write your code here.

---

### Exercise 6: Convert Data to DataFrame Format for Exploration

*Extract the feature values and the target values into two separate variables. Print each of them.*
- *Do the number of rows and columns make sense?*
- *Are you able to understand what information they contain based on the metadata exploration?*
- *If not, revisit the website, metadata, and variable information to clarify.*

*Next, import the `pandas` library. Combine the feature and target values into a single pandas DataFrame, where each row represents a patient and each column represents a feature, with the final column being the target variable. Display the first five rows of the resulting DataFrame.*

```{admonition}
:class: tip

If you're unsure how best to combine the features and target, start by checking their types. You'll see that they are already both in the pandas framework, so you'll need to use a pandas command to combine the two into the required layout — this should only require one line of code.

Whenever you're unsure how to manipulate the data, the first step should always be to check its type. This will help you better understand what operations are available.
```

In [7]:
# Write your code here.

---
---

## Data Cleaning

Now that we have everything in a single, tidy DataFrame, we need to make sure the data is properly cleaned before we begin analysing it.

Let's start with missing values. From the metadata, we already know that the variables `ca` and `thal` contain missing values.

### Exercise 7: Investigating Missing Values in the Dataset

*To decide how to handle these missing values, we first need a more detailed understanding. For both `ca` and `thal`, print the following information:*
- *The full column of values*
- *How many times each contains a `NaN` value*
- *The number of unique values in the column, and what those values are*
- *The type of each feature (use the variable information DataFrame to check this)*

In [9]:
# Write your code here.

---

You should notice that `ca` contains integer values, while `thal` is categorical, with each having four or three unique values respectively (excluding `NaN`). For this reason, it wouldn't make sense to replace the missing values with the mean. Instead, a simple approach is to remove patients (rows) with any missing (`NaN`) values.


### Exercise 8: Removing Rows with Missing Values

*Remove all rows (patients) that contain any `NaN` values. Then:*
- *Print the number of NaN values in each column to ensure that none remain*
- *Print the shape of the DataFrame and check whether the number of removed rows makes sense*
- *Reset the row index after dropping the rows with missing data*

In [10]:
# Write your code here.

---

### Exercise 9: Checking for Duplicate Rows and Columns

*Check that none of the columns or rows are duplicates of one another.*

In [11]:
# Write your code here.

---

As a final step, we want to ensure that the target variable (`num`) is in the correct format.
According to the website, it should take five possible values: 0, 1, 2, 3, and 4.
A value of 0 indicates no heart disease, while values 1–4 represent different categories of heart disease.

### Exercise 10: Create a Binary Target Variable
- *Check the type of `num` and its unique values, as you did in earlier exercises, to confirm that the data matches the description.*
- *Create a new column at the end of the DataFrame called `heart_disease_binary`. This column should contain 0 if `num` is 0, and 1 otherwise. Use a lambda function to achieve this transformation.*
- *Print the final DataFrame to verify the result.*

In [12]:
# Write your code here.

```{admonition} Why we use a lambda here
:class: tip
The lambda function allows us to create a binary indicator:  
- 0 → no heart disease  
- 1–4 → heart disease present

This is often useful for binary classification tasks, or to simplify analysis.

---
---

## Exploratory Data Analysis (EDA)

#### Visualising the Data

Now that the dataset is clean, we can begin exploring the distributions of different types of features.

Visualisation is a key part of **exploratory data analysis (EDA)**: it helps us detect patterns, identify outliers, and better understand the structure of our data — all of which can inform future modelling decisions.

We'll start by looking at the **distribution of individual features**, which helps us understand:
- What kind of data each column contains (categorical, continuous, etc.)
- Whether values are skewed, symmetric, uniform, or contain outliers
- Which features are best visualised as
    - **histograms** (for continuous values)
    - **bar plots** (for categorical values)

#### How Do You Know What Type a Feature Is?

Different types of variables require different plots and analysis methods. Let's walk through the main categories and how to spot them.

##### Two Main Types
1. **Numerical** (quantities and measurements)
2. **Categorical** (groups, types, labels)

##### Numerical Features
These represent measurable quantities, and they can be either:

| Type        | Description                         | Example                      | Visualisation        |
|-------------|-------------------------------------|------------------------------|-----------------------|
| Continuous  | Can take any value within a range   | `chol`, `oldpeak`            | Histogram             |
| Discrete    | Countable whole numbers             | `ca` (number of vessels)     | Histogram or bar plot |

```{note}
Even though both types are numeric, you might treat them differently in analysis.

For example, **log transformations or standardisation** usually make sense for continuous variables  
but not for small-range discrete variables like `ca` or `slope`.

```{admonition} Continuous vs Discrete – Real vs Representation
:class: note

Some variables, like **age**, may appear as **discrete** in a dataset (e.g. whole years),  
but they are **inherently continuous** — people can be 18.5 or 73.2 years old.  
Because it spans a wide range and behaves like a measurement, **age is typically analysed as continuous**.

In contrast, features like **number of vessels (`ca`)** are **truly discrete**:  
you can’t have 2.5 vessels — it’s either 2 or 3. These are counts and must be treated accordingly.

- Discrete features with **many unique values** (like age) are often analysed as continuous.
- Truly discrete, **count-based** features are handled differently — especially in statistical models.

##### Categorical Features
These represent groups or labels. They can be:

| Type     | Description                           | Example                    | Visualisation |
|----------|---------------------------------------|----------------------------|----------------|
| Nominal | No natural order                       | sex, thal, chest pain type | Bar plot       |
| Ordinal | Ordered categories                     | slope (up, flat, down)     | Bar plot       |

```{admonition}
:class: warning

Just because a feature is stored as an integer doesn't mean it's numerical.

Many **categorical variables** are encoded as integers (e.g., `sex`, `cp`, `slope`),  
but these values represent **categories**, not measurements.

Always check the metadata or variable descriptions to know whether a variable is truly categorical.

##### Special Note: Identifiers

Some biomedical datasets include IDs (e.g., patient ID or record number). These are usually:
- Unique per row
- Not useful for prediction
- Should not be plotted or included in modelling

#### How Can You Figure This Out?

Try the following steps to classify your features:

1.	Use the `variables` table:
It often indicates if a variable is "Categorical" or "Integer".

2.	Check `.dtypes`:
- float64 → likely continuous
- int64 → might be either discrete or categorical

3.	Use `.nunique()`:
- Few unique values (e.g. 2–4) → probably categorical
- Many unique values (30+) → likely continuous

4.	Use the variable descriptions:
- Do they describe a measurement (e.g. blood pressure)? → Numerical
- Do they describe a group or type (e.g. chest pain type)? → Categorical

5.	Plot to be sure:
- Histograms reveal shape and spread of numerical data
- Bar plots show category frequencies clearly

---

### Exercise 11: Classify Each Variable by Type

*Before plotting or analysing features, it's important to understand what **kind of variable** each column represents.*

*Your goal is to carefully classify each feature in the dataset into a suitable variable type.*
*This helps guide which visualisation, transformation, or statistical method to apply later.*

**Common Variable Types to Consider:**

*(Not all types may be present in this dataset)*
- **Continuous numerical** – real-valued measurements with many unique values
- **Discrete numerical** – count-based integers
- **Nominal categorical** – unordered categories
    - *Binary* – a special case with exactly two values
- **Ordinal categorical** – ordered categories
- **Identifier / metadata** – e.g., patient IDs (not used in modelling)

**Tools to Help You Decide:**
- `.dtypes` to check each column’s data type
- `.nunique()` to see how many distinct values each feature has
- `variables[['name', 'type', 'description']]` to understand what each column represents

**Task:**
- *Create a list of variables that fall into each type above (if applicable).*
- *Focus especially on identifying categorical and continuous features.*

In [13]:
# Write your code here.

---

### Exercise 12: Explore Categorical Variable Distributions

*Your goal is to explore how categorical variables are distributed across the dataset.*

*This can help you:*
- *Understand class balance (e.g., male vs female)*
- *Identify rare or dominant categories*
- *Spot issues like unbalanced variables, which can affect modelling*

**Task:**
1. *Import the visualisation library*
2. *Use a loop to create bar plots for each categorical feature.*
	- *Use `.value_counts()` to get category counts.*
	- *Use `plot(kind='bar')` to create the bar plot.*
3. *Try displaying all the plots in a grid of subplots for clarity.*

In [14]:
# Write your code here.

---

### Exercise 13: Compare Categorical Distributions by Sex

*In this exercise, you'll explore how categorical feature distributions vary between male and female patients.*

**Task:**
1. *Use your list of categorical features.*
2. *Create bar plots for each feature, stratified by sex:*
    - *One bar for each category within each sex group.*
3. *Use `pd.crosstab()` or `df.groupby('sex')[feature].value_counts()` to count values.*
4. *Plot them using `kind='bar'` or `kind='barh'` with legends and labels.*

```{admonition}
:class: tip

Use subplots to display multiple plots at once.
```

```{admonition} Unequal group sizes can affect your interpretations
:class: warning

In our dataset, around **68% of the patients are male**, and only **32% are female**.  
This imbalance can impact how we interpret feature distributions.

For example, if a certain chest pain type appears more often in males, it may simply reflect the larger number of male patients — not a true difference in risk or symptoms.

To make more meaningful comparisons, we’ll now **stratify** our plots by sex and compare distributions **within** each group.
```

In [15]:
# Write your code here.

---

### Exercise 14: Visualising Continuous Features with Histograms

*Now let's explore the continuous numerical features you identified earlier.*

*Histograms are ideal for:*
- *Understanding the shape of the distribution (e.g. symmetric, skewed, bimodal)*
- *Identifying outliers or unusual values*
- *Comparing ranges and spread between features*

**Task:**
1. *Create a list of continuous features from your earlier classification.*
2. *Plot a histogram for each of them.*
    - *You can use `df[feature].hist()` or `plt.hist()`.*
3. *Optionally, group them in subplots for clarity.*

```{admonition}
:class: tip
- Use `bins=20` and `edgecolor='black'` for clean visuals
- Set `figsize=(15, 10)` for a good layout
```

In [16]:
# Write your code here.

```{admonition} What to do with skewed distributions?
:class: tip

Skewed variables can sometimes distort correlations or model predictions.

A common fix is to apply a **log transformation** to make the data more normal.  
This is especially helpful for positively skewed data like `oldpeak` or `chol`.

```python
import numpy as np
df['log_oldpeak'] = np.log1p(df['oldpeak']) # log1p handles 0s safely

```

---

### Exercise 15: Compare Continuous Features by Sex

*Let's explore how continuous feature distributions differ between male and female patients. This can help identify:*
- *Biological or physiological differences*
- *Features that may require sex-specific modelling*
- *Whether standardisation should be stratified*

**Task:**
- *Use the list of continuous features: `['age', 'trestbps', 'chol', 'thalach', 'oldpeak']`*
- *For each feature, plot overlaid histograms for male and female patients*
- *Use `alpha=0.5` and different colours to distinguish the groups*
- *Add titles, axis labels, and a legend*

In [17]:
# Write your code here.

---

### Exercise 16: Boxplots of Continuous Features by Sex

*In the previous exercise, you used **stratified histograms** to see how the distribution shape of each continuous variable differs between males and females.*

*Now, you'll use **boxplots** to visualise:*

- *Summary statistics (median, quartiles)*

- *Spread and skewness*

- *Outliers more clearly*


*While histograms are better for understanding the full shape of a distribution (e.g., unimodal, skewed, bimodal), boxplots give a cleaner summary of central tendency and variability, especially for comparisons across groups.*

*Using both gives a more complete picture and helps confirm or refine your interpretations.*

**Task:**

1. *Import seaborn if you haven't yet*

2. *Choose a list of continuous features to explore*

3. *For each feature, create a boxplot comparing the distribution for males and females*


In [18]:
# Write your code here.

---

### Exercise 17: Exploring Relationships Between Continuous Features

So far, you've explored each continuous variable individually. Now let's investigate how they relate to each other.

A **correlation matrix** is a powerful way to:

- Identify **linear relationships** (positive or negative)

- Detect **redundant features** (high correlation means similar information)

- Understand which variables may interact in modelling

**Task:**

1. *Create a new dataframe with only the continuous features*

2. *Use `.corr()` to compute the Pearson correlation matrix*

3. *Plot the matrix using `seaborn.heatmap()`*
	- *Set `annot=True` to show correlation values*
	- *Set `cmap='coolwarm'` for intuitive colouring*

4. *Interpret the heatmap:*
   - *Which features are positively or negatively correlated?*
   - *Are any relationships strong (above $\pm0.6$)?*
   - *Do any pairs look redundant or surprisingly independent?*
   - *What might this imply about modelling?*

```{admonition} How to interpret correlations
:class: tip

- Values near +1 mean strong positive correlation

- Values near –1 mean strong negative correlation

- Values near 0 mean no linear relationship

We'll use this to guide our understanding of which features are related and which are independent.

```

```{admonition} Why not always trust correlation blindly?
:class: warning

Some of our continuous features are **right-skewed** or contain **outliers**.

*Pearson correlation* assumes normality and linearity, so in these cases it may:

- Underestimate a nonlinear relationship

- Be overly sensitive to outliers

To explore this further, you could:

- Use `.corr(method='spearman')` for rank-based correlation

- Apply transformations (e.g., log) to reduce skew before computing correlations
```

In [19]:
# Write your code here.

---

### Exercise 18: Spearman Correlation

You've just explored **Pearson correlation**, which measures *linear* relationships.
But some of our variables are **skewed**, and may violate Pearson's assumptions.

In this exercise, you'll compute **Spearman correlation**, which is:

- Based on **ranks**, not raw values

- More **robust to outliers and skew**

- Better at detecting **monotonic (nonlinear)** trends

**Task:**

1. *Use `.corr(method='spearman')` to compute the Spearman correlation matrix.*

2. *Plot it as a heatmap (just like before).*

3. *Compare it to your Pearson matrix:*
   - *Which pairs change the most?*
   - *Are there any new strong relationships?*
   - *Do any associations disappear?*

```{admonition} Why this matters
:class: info

Seeing how these two methods differ gives you a deeper view of your data and helps decide whether some features need **transformation** or **nonlinear methods** downstream.

In [20]:
# Write your code here.

---

### Exercise 19: Explore Feature Relationships with the Target Variable

You've explored relationships between features. Now it's time to ask which features differ most between patients with and without heart disease.*

This is a crucial part of exploratory analysis and feature selection. You’ll stratify patients based on the binary outcome: `heart_disease_binary` (0 = no disease, 1 = disease)

**Task:**
1. *Pick your continuous features*
2. *For each one:*
    - *Create a boxplot grouped by `heart_disease_binary`*
    - *Optionally: also create stratified histograms or violin plots*
3. *Observe:*
    - *Which variables show clear separation between the two classes?*
    - *Are there noticeable shifts in distribution or outliers?*
    - *Which features might be informative for classification?*

*This step helps identify potential predictors of heart disease, builds intuition about what differences exist between groups, and sets the stage for later statistical testing or modelling.*

In [21]:
# Write your code here.

---

### Exercise 20: Statistical Testing — Heart Disease vs Continuous Features

You've already visualised how the distributions of continuous variables differ depending on whether a patient has heart disease.

Now you'll perform **statistical tests** to check if those differences are **significant**.

**Goal:**
*Test whether each continuous feature differs **significantly** between:*

- *`heart_disease_binary = 0` → No heart disease*

- *`heart_disease_binary = 1` → Heart disease present*

**Task:**

1. *If needed, install SciPy first. Then import `ttest_ind` and `mannwhitneyu` from `scipy.stats`.*

2. *Run both tests for each continuous feature:*

	- *T-test: checks if means are significantly different*

	- *Mann–Whitney U: non-parametric alternative for non-normal distributions*

3. *Print and interpret the p-values:*

	- *Which features show statistically significant differences?*

	- *Do results change between t-test and Mann–Whitney?*

```{admonition} Reminder: Check distribution shape
:class: tip

Use your **histograms and boxplots** from earlier to decide which test is appropriate.

- If a feature is **skewed** or has **many outliers**, the **Mann–Whitney U test** is more reliable.

- If a feature is **symmetric** and looks normally distributed, the **t-test** is usually appropriate.

This helps ensure you're using valid statistical assumptions, and makes your results more trustworthy.
```

In [22]:
# Write your code here.

```{admonition}
:class: warning
When results **disagree**, it often reflects a violation of assumptions — such as **skewness** or **outliers** — which makes the **non-parametric test** more trustworthy.
```

```{admonition} Takeaways
:class: tip

- Use both tests for robustness, especially with non-normal data.

- Several features (like `thalach` and `oldpeak`) show strong group separation — they may be useful for modelling heart disease.

- For features like chol, the visualisations and test disagreements suggest caution — it may require transformation or further exploration.

---

### Exercise 21: Exploring Categorical Features vs Heart Disease

You've explored continuous variables in detail — now it's time to analyse how **categorical features** relate to heart disease.

This is especially important for features like:
- `cp` (chest pain type)
- `thal` (thalassemia type)
- `slope` (slope of ST segment)
- `sex`, `fbs`, `restecg`, etc.

**Part 1: Visualise Category vs. Target**

**Task:** 

*For each categorical feature:*
- *Create a **grouped bar chart** showing the count of heart disease (0 vs 1) per category.*
- *Use `pd.crosstab()` or `df.groupby()`.*
- *Plot with `plot(kind='bar', stacked=True)` or `seaborn.countplot()`.*


In [23]:
# Write your code here.

**Part 2: Statistical Test — Chi-Squared**

**Task:**

*Run a **chi-squared test of independence** for each categorical feature vs. heart disease.*


In [24]:
# Write your code here.

```{admonition} Interpreting Chi-Squared
:class: tip

- A **low p-value** (e.g., < 0.05) means the feature is **significantly associated** with heart disease.
- If the result is not significant, it may not be useful on its own — or it might require interaction with other features.


---

### *(Optional)* Exercise 22: K-Means Clustering — Grouping Patients

So far, you've explored individual features and compared them to heart disease status.
Now let's see if we can **automatically group patients** based on patterns in their data — without using the diagnosis label.

This is known as **unsupervised learning**, and one of the most widely used methods is **K-Means Clustering**.

*In this exercise, you'll:*

- *Select features that appear **statistically informative***
- ***Standardise** them so they're on the same scale*
- *Use **K-Means** to form two clusters ($K = 2$)*
- *Check how well the clusters align with actual heart disease labels*

```{admonition}
:class: info

What is K-Means Clustering?

K-Means is an **unsupervised** machine learning algorithm. It tries to:
- Divide the data into $K$ groups (clusters) based on similarity
- Assign each data point to the cluster with the **nearest "centroid"** (a kind of average)
- Repeat the process until the clusters stop changing

In this exercise, we'll set $K = 2$ — asking the algorithm to find *two* natural groups of patients, based only on selected features (not the diagnosis).

Then, we'll compare the resulting clusters with the actual heart disease labels to see if the grouping aligns with reality.


**Task**

1. ***Select features**:*
*Use a mix of continuous and categorical features that showed **strong group differences** earlier.*
   - For example:
     - Continuous: `age`, `thalach`, `oldpeak`, `trestbps`
     - Categorical: `cp`, `thal`, `slope`, `sex`

2. ***One-hot encode** the categorical variables using [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)*

```{admonition} What is One-Hot Encoding
:class: tip

Most machine learning algorithms, including K-Means, **can't handle categorical variables directly**.

**One-hot encoding** turns each category into its own **binary (0 or 1)** column.

Example:
If the `cp` (chest pain type) variable has 4 categories (0, 1, 2, 3), it will become 3 new columns: `cp_1`, `cp_2`, `cp_3` (we drop one to avoid redundancy).

This lets the model use each category as a separate feature.
```

3. ***Standardise** the continuous features using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*

4. ***Combine** the encoded categorical and scaled continuous features into one single dataset*

5. ***Fit KMeans** with `n_clusters=2` and assign cluster labels*

```{admonition} How to run K-Means in Python
:class: tip
```python
from sklearn.cluster import KMeans

# Create a KMeans object
kmeans = KMeans(n_clusters=2, random_state=0)

# Fit it to the data and assign cluster labels to each patient
df['cluster'] = kmeans.fit_predict(X_cluster)
```

6. ***Compare** the resulting clusters to actual `heart_disease_binary` values using:*
   - *A **confusion matrix** (`pd.crosstab()`)*
   - *An optional **heatmap** for visualisation*

```{admonition} What is a Confusion Matrix?
:class: tip
A **confusion matrix** is a table used to compare the model's predictions with the actual values.  
It shows how many patients were correctly or incorrectly assigned to each group.

For example, in our case:

- Rows = the **cluster labels** assigned by K-Means (cluster 0 or 1)
- Columns = the **true diagnosis** (`heart_disease_binary` = 0 or 1)

This helps you assess how well the clusters match real diagnoses.

To create it in pandas:

```python
pd.crosstab(df['cluster_kmeans'], df['heart_disease_binary'], 
            rownames=['Cluster'], colnames=['Heart Disease'])

In [25]:
# Write your code here.