# Dataframes in Python 🧮

**Author:** James Cranley (<james.cranley@doctors.org.uk>)  
**Date:** March 2025

---

## Overview

This notebook provides an introduction to working with tabular data using dataframes in Python. 

Dataframes are one of the most versatile and widely used data structures for data analysis, enabling you to read, manipulate, and write 'rectangular' data.

### What You'll Learn

✅ **Reading and Writing Data:** How to import and export data in Excel and CSV formats.

✅ **Inspecting Dataframes:** Techniques for exploring the structure and summary statistics of your data.

✅ **Subsetting Data:** Methods for filtering and extracting specific portions of your data for analysis.

## Basics of Jupyter notebooks

In [3]:
# This is a code cell
some_text = "Click inside a cell and press Shift + Enter to run it."
print(some_text)

Click inside a cell and press Shift + Enter to run it.


In [4]:
# This is a code cell
some_more_text = "We can add new cells by clicking the + button"
print(some_more_text)

We can add new cells by clicking the + button


In [7]:
a = 1
b = 2

a + b

3

By default cells expect code.

We can also write **text** in cells and use **markdown formatting** to make it look fancy 🎩

For instance, we can make text **bold** or *italic*

Or we
- can
- write
- lists

## Exercise
<div style="border-left: 4px solid #007acc; padding: 0.5em 1em; background-color: #f0f8ff; margin: 1em 0;">
  <h4>👨‍🏫 Scenario</h4>
  <p>Bob, your research supervisor, has just sent you an Excel file containing data on a heart failure patient cohort.</p>
  <p>He'd like you to take a look at it, explore what's inside, and start making sense of the data.</p>
</div>

<details>
  <summary><strong>ℹ️ Data Source</strong></summary>
  <p>Data are from: <em>Machine learning has been used to predict patient survival based on key clinical parameters</em> — <a href="https://doi.org/10.1186/s12911-020-1023-5" target="_blank">Chicco & Jurman, 2020</a>.</p>
</details>

### Import packages

Although widely used in Python data analysis, the **DataFrame** is not a native Python data type. Instead, it comes from a third-party library called **pandas**.

To use pandas (or any other external package), we first need to make it available in our notebook by **importing** it.

We'll start by importing the necessary packages in the next code cell.

In [9]:
import os # operating system: for interacting with files
import pandas # key tabular data package in python
import numpy # important package for mathematics in python

To save on typing, it's common to use **abbreviations** (or *aliases*) when importing packages.  
For example, instead of writing `pandas` every time, we can import it as `pd`:

```python
import pandas as pd
```

For clarity in this tutorial we have used the unabbreviated form.

### Reading data into dataframes

In [10]:
input_file_path = "./data/HF_data.xlsx" # define the relative path to the file'

We read in the Excel file into a **pandas dataframe** called HF_data.

In [11]:
HF_data = pandas.read_excel(input_file_path) # Pandas library has a built in Excel read/write function

### Inspect the data

In [12]:
HF_data

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
0,360061,75.0,0,582,0,20,1,265000.00,1.9,130,1,0
1,118349,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0
2,557032,65.0,0,146,0,20,0,162000.00,1.3,129,1,1
3,851734,50.0,1,111,0,20,0,210000.00,1.9,137,1,0
4,375585,65.0,1,160,1,20,0,327000.00,2.7,116,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
294,102821,62.0,0,61,1,38,1,155000.00,1.1,143,1,1
295,150663,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0
296,230520,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0
297,449982,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1


In a Jupyter notebook, simply typing the **name of a DataFrame** and running the cell will display a quick overview of its contents.

⚠️ **Note:** This output is *truncated* — you'll only see the first and last few rows by default, not the entire dataset.

In contrast to spreadsheet software like MS Excel, we do not normally look at all the data, we *peak* at *parts* of it.

We can use the `head()` and `tail()` methods to peak at the top and botoom of the data

In [13]:
HF_data.head() # the head() method shows us the top of the data

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
0,360061,75.0,0,582,0,20,1,265000.0,1.9,130,1,0
1,118349,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0
2,557032,65.0,0,146,0,20,0,162000.0,1.3,129,1,1
3,851734,50.0,1,111,0,20,0,210000.0,1.9,137,1,0
4,375585,65.0,1,160,1,20,0,327000.0,2.7,116,0,0


In [18]:
#💡
HF_data.tail() # the tail() method shows us the bottom of the data, we can adjust the number of rows shown

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
294,102821,62.0,0,61,1,38,1,155000.0,1.1,143,1,1
295,150663,55.0,0,1820,0,38,0,270000.0,1.2,139,0,0
296,230520,45.0,0,2060,1,60,0,742000.0,0.8,138,0,0
297,449982,45.0,0,2413,0,38,0,140000.0,1.4,140,1,1
298,513901,50.0,0,196,0,45,0,395000.0,1.6,136,1,1


Let's have a look at the top of the table after **sorting by age**.

Note we use the `ascending` argument and set it to `False`

In [22]:
#💡 asc/desc,.head()
HF_data.sort_values(by='ejection_fraction', ascending=False)

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
64,191487,45.0,0,582,0,80,0,263358.03,1.18,137,0,0
217,461100,54.0,1,427,0,70,1,151000.00,9.00,137,0,0
8,498503,65.0,0,157,0,65,0,263358.03,1.50,138,0,0
52,264586,60.0,0,3964,1,62,0,263358.03,6.80,146,0,0
211,632767,50.0,0,582,0,62,1,147000.00,0.80,140,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
210,805975,70.0,0,212,1,17,1,389000.00,1.00,136,1,1
126,305431,46.0,0,168,1,17,1,271000.00,2.10,124,0,0
6,741515,75.0,1,246,0,15,0,127000.00,1.20,137,1,0
66,615429,42.0,1,250,1,15,0,213000.00,1.30,136,0,0


Let's get an overview of the table with the `info()` method

In [23]:
# We can inspect the data
HF_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   patient_ID                299 non-null    int64  
 1   age                       299 non-null    float64
 2   anaemia                   299 non-null    int64  
 3   creatinine_phosphokinase  299 non-null    int64  
 4   diabetes                  299 non-null    int64  
 5   ejection_fraction         299 non-null    int64  
 6   high_blood_pressure       299 non-null    int64  
 7   platelets                 299 non-null    float64
 8   serum_creatinine          299 non-null    float64
 9   serum_sodium              299 non-null    int64  
 10  sex                       299 non-null    int64  
 11  smoking                   299 non-null    int64  
dtypes: float64(3), int64(9)
memory usage: 28.2 KB


We can see the table has **12 columns** and **299 patients/rows**.

The **data is complete** (no null values in any column)

`Dtype` indicates **the data are numbers**, either integers (int64) or non-integer numbers (float64)

Let's **summarise the data** using the `describe()` method

In [24]:
HF_data.describe() # Get summary statistics

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,537456.488294,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107
std,260925.766048,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767
min,100898.0,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0
25%,312815.5,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0
50%,554821.0,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0
75%,753848.0,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0
max,996295.0,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0


We can see from the `ejection_fraction` column that the mean EF is 38%, so this is a HFrEF cohort.

We can see that columns like `diabetes` and `anaemia` are numerical (they have summary statistics). Let's inspect just those columns:

In [25]:
columns_of_interest = ['anaemia','diabetes'] # Define a list of columns
HF_data[columns_of_interest] # We subset the dataframe to these columns only

Unnamed: 0,anaemia,diabetes
0,0,0
1,0,0
2,0,0
3,1,0
4,1,1
...,...,...
294,0,1
295,0,0
296,0,1
297,0,0


Indeed, this indicates that the data are **one-hot encoded** — a common technique used to represent categorical variables as binary indicators.

### Filtering dataframes

We start to explore the data. **How many patients are >60 years old?**

In [26]:
# Subset the data by a condition, e.g. age above 60
HF_data[HF_data['age'] > 60]

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
0,360061,75.0,0,582,0,20,1,265000.00,1.9,130,1,0
2,557032,65.0,0,146,0,20,0,162000.00,1.3,129,1,1
4,375585,65.0,1,160,1,20,0,327000.00,2.7,116,0,0
5,994132,90.0,1,47,0,40,1,204000.00,2.1,132,1,1
6,741515,75.0,1,246,0,15,0,127000.00,1.2,137,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
283,225433,65.0,0,1688,0,38,0,263358.03,1.1,138,1,1
288,895252,65.0,0,892,1,35,0,263358.03,1.1,142,0,0
289,790283,90.0,1,337,0,38,0,390000.00,0.9,144,0,0
293,694317,63.0,1,103,1,35,0,179000.00,0.9,136,1,1


137 of 299 patients are >60

We want to find the **non-smoking diabetic patients**

In [28]:
# Subset by 2 conditions: Find non-smokers (smoking == 0) with diabetes (diabetes == 1)
HF_data[
    (HF_data['smoking'] == 0) & # condition 1, the '&' means both must be satisfied
    (HF_data['diabetes'] == 1) # condition 2
]

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking
4,375585,65.0,1,160,1,20,0,327000.00,2.70,116,0,0
19,625179,48.0,1,582,1,55,0,87000.00,1.90,121,0,0
21,390659,65.0,1,128,1,30,1,297000.00,1.60,136,0,0
23,110310,53.0,0,63,1,60,0,368000.00,0.80,135,1,0
24,449959,75.0,0,582,1,30,1,263358.03,1.83,134,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
285,372413,55.0,1,170,1,40,0,336000.00,1.20,135,1,0
287,393308,45.0,0,582,1,55,0,543000.00,1.00,132,0,0
288,895252,65.0,0,892,1,35,0,263358.03,1.10,142,0,0
290,112579,45.0,0,615,1,55,0,222000.00,0.80,141,0,0


95 of 299 were non-smoking and diabetic

### Merging dataframes

<div style="border-left: 4px solid #007acc; padding: 0.5em 1em; background-color: #f0f8ff; margin: 1em 0;">
  <h4>👨‍🏫 Email from Bob</h4>
  <p>I've just found the mortailty outcomes for these patients in a separate file, sending it to you now!</p>
</div>

In [29]:
mortality_data_path = "./data/HF_data_mortality.xlsx"

In [30]:
HF_mortality_data = pandas.read_excel(mortality_data_path)

In [32]:
HF_mortality_data

Unnamed: 0,patient_ID,one_year_mortality_status
0,205724,0
1,780516,1
2,305431,1
3,990619,1
4,552248,0
...,...,...
284,141395,0
285,798565,0
286,264622,1
287,657980,1


We can see that this Excel file contains just **two columns**:  
`patient_ID` and `one_year_mortality_status`.

There are **289 rows**, while the original dataset had **299 rows** — so the **mortality status is missing for 10 patients**.

Let’s now **merge** the `HF_data` and `HF_mortality_data` DataFrames to bring everything together.

Note that in Microsoft Excel, performing tasks like merging tables can be tricky — often involving VLOOKUPs, manual copy-pasting, or complex formulas.

In contrast, pandas makes this process **simple** with the `merge()` function.

In [33]:
merged_data = HF_data.merge(
    right=HF_mortality_data,   # The DataFrame you're merging *into* HF_data
    left_on='patient_ID',      # Column in HF_data to match on
    right_on='patient_ID',     # Column in HF_mortality_data to match on
    how='left'                 # Keep all rows from HF_data, even if there's no match in HF_mortality_data
)

We use the `merge()` function to combine the two DataFrames based on the `patient_ID` column.

- `right`: The second DataFrame we're merging (`HF_mortality_data`)
- `left_on` / `right_on`: Columns to join on in each DataFrame (in this case, both are `patient_ID`)
- `how='left'`: A **left join** — this keeps all rows from `HF_data`, and fills in matching values from `HF_mortality_data`. If there's no match, the `one_year_mortality_status` will be `NaN`.


In [34]:
merged_data

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,one_year_mortality_status
0,360061,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,1.0
1,118349,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,1.0
2,557032,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,1.0
3,851734,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,1.0
4,375585,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,102821,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,0.0
295,150663,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,0.0
296,230520,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,0.0
297,449982,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,0.0


Now we can see we have **13 columns** and the `one_year_mortality_status` column has 10 null rows (i.e. missing data)

We can **inspect the rows where mortality data is missing**. The `isna()` method find rows where the data is null (missing).

In [35]:
merged_data[merged_data['one_year_mortality_status'].isna()] # NA means Not Available.

Unnamed: 0,patient_ID,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,one_year_mortality_status
30,960758,94.0,0,582,1,38,1,263358.03,1.83,134,1,0,
40,742068,70.0,0,582,0,20,1,263358.03,1.83,134,1,1,
56,456816,70.0,1,75,0,35,0,223000.0,2.7,138,1,1,
114,120158,60.0,1,754,1,40,1,328000.0,1.2,126,1,0,
149,753705,60.0,0,2261,0,35,1,228000.0,0.9,136,1,0,
175,924014,60.0,1,95,0,60,0,337000.0,1.0,138,1,1,
187,996295,60.0,0,1896,1,25,0,365000.0,2.1,144,0,0,
189,294677,40.0,0,244,0,45,1,275000.0,0.9,140,0,0,
194,953261,45.0,0,582,0,20,1,126000.0,1.6,135,1,0,
212,145211,78.0,0,224,0,50,0,481000.0,1.4,138,1,1,


### Saving dataframes

You decide to **save the dataframe with the mortality data now added as an Excel file**.

This is very simple with the `to_excel()` method.

In [36]:
output_file_path = "./results/HF_data_inc_mortality.xlsx" # this is where the file will save to

In [37]:
merged_data.to_excel(output_file_path, index=False)  # Save the file without row numbers (index=False)