# Introduction to Pandas for Clinicians
## A Foundation for Data Science in Personalised Medicine

Welcome to this introduction to Pandas - one of the most important tools you'll use for data analysis in Python. This notebook is designed specifically for clinicians who want to leverage data science for personalised medicine.

By the end of this notebook, you'll understand how to:
- Import Python modules and why this is important
- Read CSV files containing patient data
- Work with DataFrames (think of them as sophisticated spreadsheets)
- Filter data to find specific patient populations or conditions

Let's begin!

---

## 1. Importing Modules in Python

### Why Do We Import Modules?

In Python, modules are like specialized toolkits. Just as you wouldn't bring every medical instrument to every patient encounter, Python doesn't load every possible function when it starts. Instead, we import only the tools we need.

**Think of it this way:**
- **Base Python** = Your basic clinical skills
- **Pandas** = Your specialized diagnostic equipment
- **Other modules** = Additional specialist tools (statistics, visualization, etc.)

### How to Import Modules

```python
# The standard way to import pandas
import pandas as pd

# Why 'as pd'? 
# - It's a widely accepted convention
# - Saves typing (pd.read_csv vs pandas.read_csv)
# - Makes code more readable
```

In [1]:
# Let's import pandas and check it's working
import pandas as pd
print(f"Pandas version: {pd.__version__}")
print("✅ Pandas imported successfully!")

Pandas version: 2.2.3
✅ Pandas imported successfully!


### Other Common Imports You'll See

In [2]:
# These are other modules commonly used with pandas
import numpy as np          # For numerical operations
import matplotlib.pyplot as plt  # For creating plots
# import seaborn as sns     # For statistical visualizations

print("All modules imported successfully!")

All modules imported successfully!


---

## 2. Reading CSV Files with Pandas

### What is a CSV File?

CSV stands for "Comma-Separated Values." It's like a simplified spreadsheet that can be opened by any program. Most clinical data exports (from EMRs, lab systems, etc.) come in CSV format.

### Creating Sample Clinical Data

First, let's create some sample patient data to work with:

In [3]:
# Create a sample dataset that mimics clinical data
sample_data = {
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008'],
    'age': [45, 62, 38, 71, 29, 55, 68, 42],
    'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'F', 'M'],
    'diagnosis': ['Hypertension', 'Diabetes', 'Asthma', 'Diabetes', 'Hypertension', 'Diabetes', 'Asthma', 'Hypertension'],
    'systolic_bp': [140, 130, 110, 145, 125, 155, 115, 138],
    'hba1c': [5.2, 8.1, 5.0, 9.2, 4.8, 7.8, 5.1, 5.5],
    'medication_adherence': [0.85, 0.72, 0.95, 0.68, 0.90, 0.75, 0.88, 0.82]
}

# Convert to DataFrame and save as CSV
df_sample = pd.DataFrame(sample_data)
df_sample.to_csv('sample_patient_data.csv', index=False)
print("✅ Sample CSV file created: 'sample_patient_data.csv'")

✅ Sample CSV file created: 'sample_patient_data.csv'


### Reading the CSV File

In [4]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('sample_patient_data.csv')

print("✅ CSV file loaded successfully!")
print(f"Data shape: {df.shape[0]} patients, {df.shape[1]} variables")

✅ CSV file loaded successfully!
Data shape: 8 patients, 7 variables


### Common CSV Reading Options

In [5]:
# pandas.read_csv() has many useful options for clinical data:

# If your CSV has different separators (semicolons are common in European data)
# df = pd.read_csv('file.csv', sep=';')

# If you want to specify which column contains patient IDs
# df = pd.read_csv('file.csv', index_col='patient_id')

# If you have missing data coded differently
# df = pd.read_csv('file.csv', na_values=['NULL', 'N/A', 'missing'])

print("These options help handle real-world clinical data variations")

These options help handle real-world clinical data variations


---

## 3. Understanding DataFrames

### What is a DataFrame?

A DataFrame is like a smart spreadsheet. It has:
- **Rows** (observations) - each row represents one patient
- **Columns** (variables) - each column represents one measurement or characteristic
- **Index** - row labels (often patient IDs)

### Exploring Your DataFrame

In [6]:
# First, let's look at our data
print("=== FIRST FEW ROWS ===")
print(df.head())  # Shows first 5 rows by default

=== FIRST FEW ROWS ===
  patient_id  age gender     diagnosis  systolic_bp  hba1c  \
0       P001   45      M  Hypertension          140    5.2   
1       P002   62      F      Diabetes          130    8.1   
2       P003   38      F        Asthma          110    5.0   
3       P004   71      M      Diabetes          145    9.2   
4       P005   29      F  Hypertension          125    4.8   

   medication_adherence  
0                  0.85  
1                  0.72  
2                  0.95  
3                  0.68  
4                  0.90  


In [7]:
print("\n=== BASIC INFORMATION ===")
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Column names: {list(df.columns)}")


=== BASIC INFORMATION ===
Shape: (8, 7)
Column names: ['patient_id', 'age', 'gender', 'diagnosis', 'systolic_bp', 'hba1c', 'medication_adherence']


In [8]:
print("\n=== DATA TYPES ===")
print(df.dtypes)
# This is important! 
# - object = text/categorical data
# - int64/float64 = numerical data
# - datetime64 = dates/times


=== DATA TYPES ===
patient_id               object
age                       int64
gender                   object
diagnosis                object
systolic_bp               int64
hba1c                   float64
medication_adherence    float64
dtype: object


In [9]:
print("\n=== SUMMARY STATISTICS ===")
print(df.describe())
# This gives you key statistics for numerical columns
# Very useful for quality checks!


=== SUMMARY STATISTICS ===
             age  systolic_bp     hba1c  medication_adherence
count   8.000000     8.000000  8.000000               8.00000
mean   51.250000   132.250000  6.337500               0.81875
std    15.097303    15.229201  1.736941               0.09433
min    29.000000   110.000000  4.800000               0.68000
25%    41.000000   122.500000  5.075000               0.74250
50%    50.000000   134.000000  5.350000               0.83500
75%    63.500000   141.250000  7.875000               0.88500
max    71.000000   155.000000  9.200000               0.95000


### Accessing Individual Columns

In [10]:
# Get a single column (this creates a "Series")
ages = df['age']
print("=== AGE COLUMN ===")
print(ages)
print(f"\nType: {type(ages)}")

=== AGE COLUMN ===
0    45
1    62
2    38
3    71
4    29
5    55
6    68
7    42
Name: age, dtype: int64

Type: <class 'pandas.core.series.Series'>


In [11]:
# Get multiple columns (this creates a DataFrame)
clinical_measures = df[['age', 'systolic_bp', 'hba1c']]
print("=== CLINICAL MEASURES ===")
print(clinical_measures.head())

=== CLINICAL MEASURES ===
   age  systolic_bp  hba1c
0   45          140    5.2
1   62          130    8.1
2   38          110    5.0
3   71          145    9.2
4   29          125    4.8


### Key DataFrame Attributes

In [12]:
print("=== USEFUL DATAFRAME ATTRIBUTES ===")
print(f"Shape: {df.shape}")
print(f"Size (total cells): {df.size}")
print(f"Number of patients: {len(df)}")
print(f"Column names: {list(df.columns)}")
print(f"Index: {list(df.index)}")

=== USEFUL DATAFRAME ATTRIBUTES ===
Shape: (8, 7)
Size (total cells): 56
Number of patients: 8
Column names: ['patient_id', 'age', 'gender', 'diagnosis', 'systolic_bp', 'hba1c', 'medication_adherence']
Index: [0, 1, 2, 3, 4, 5, 6, 7]


---

## 4. Filtering DataFrames

### Why Filter Data?

In clinical practice, you often need to focus on specific patient populations:
- Diabetic patients only
- Patients over 65
- Those with poor medication adherence
- Combinations of conditions

Filtering helps you create these targeted datasets.

### Basic Filtering with Conditions

In [13]:
# Filter 1: Find all diabetic patients
diabetic_patients = df[df['diagnosis'] == 'Diabetes']
print("=== DIABETIC PATIENTS ===")
print(diabetic_patients)
print(f"\nFound {len(diabetic_patients)} diabetic patients")

=== DIABETIC PATIENTS ===
  patient_id  age gender diagnosis  systolic_bp  hba1c  medication_adherence
1       P002   62      F  Diabetes          130    8.1                  0.72
3       P004   71      M  Diabetes          145    9.2                  0.68
5       P006   55      M  Diabetes          155    7.8                  0.75

Found 3 diabetic patients


In [14]:
# Filter 2: Find patients over 60
elderly_patients = df[df['age'] > 60]
print("=== PATIENTS OVER 60 ===")
print(elderly_patients[['patient_id', 'age', 'diagnosis']])

=== PATIENTS OVER 60 ===
  patient_id  age diagnosis
1       P002   62  Diabetes
3       P004   71  Diabetes
6       P007   68    Asthma


In [15]:
# Filter 3: Find patients with high blood pressure (>140 mmHg)
hypertensive_patients = df[df['systolic_bp'] > 140]
print("=== PATIENTS WITH HIGH BP ===")
print(hypertensive_patients[['patient_id', 'systolic_bp', 'diagnosis']])

=== PATIENTS WITH HIGH BP ===
  patient_id  systolic_bp diagnosis
3       P004          145  Diabetes
5       P006          155  Diabetes


### Multiple Conditions (AND Logic)

In [16]:
# Find elderly diabetic patients (both conditions must be true)
elderly_diabetics = df[(df['age'] > 60) & (df['diagnosis'] == 'Diabetes')]
print("=== ELDERLY DIABETIC PATIENTS ===")
print(elderly_diabetics)

# Note: Use & for AND, not 'and'
# Always use parentheses around each condition!

=== ELDERLY DIABETIC PATIENTS ===
  patient_id  age gender diagnosis  systolic_bp  hba1c  medication_adherence
1       P002   62      F  Diabetes          130    8.1                  0.72
3       P004   71      M  Diabetes          145    9.2                  0.68


### Multiple Conditions (OR Logic)

In [17]:
# Find patients who are either very young OR very old
extreme_ages = df[(df['age'] < 35) | (df['age'] > 65)]
print("=== PATIENTS WITH EXTREME AGES ===")
print(extreme_ages[['patient_id', 'age']])

# Note: Use | for OR, not 'or'

=== PATIENTS WITH EXTREME AGES ===
  patient_id  age
3       P004   71
4       P005   29
6       P007   68


### Filtering with Multiple Values

In [18]:
# Find patients with diabetes OR hypertension
chronic_conditions = df[df['diagnosis'].isin(['Diabetes', 'Hypertension'])]
print("=== PATIENTS WITH CHRONIC CONDITIONS ===")
print(chronic_conditions)

=== PATIENTS WITH CHRONIC CONDITIONS ===
  patient_id  age gender     diagnosis  systolic_bp  hba1c  \
0       P001   45      M  Hypertension          140    5.2   
1       P002   62      F      Diabetes          130    8.1   
3       P004   71      M      Diabetes          145    9.2   
4       P005   29      F  Hypertension          125    4.8   
5       P006   55      M      Diabetes          155    7.8   
7       P008   42      M  Hypertension          138    5.5   

   medication_adherence  
0                  0.85  
1                  0.72  
3                  0.68  
4                  0.90  
5                  0.75  
7                  0.82  


### Practical Clinical Example

In [19]:
# Real-world scenario: Find high-risk diabetic patients
# Criteria: Diabetes + HbA1c > 7.0 + Poor adherence < 0.80

high_risk_diabetics = df[
    (df['diagnosis'] == 'Diabetes') & 
    (df['hba1c'] > 7.0) & 
    (df['medication_adherence'] < 0.80)
]

print("=== HIGH-RISK DIABETIC PATIENTS ===")
print(high_risk_diabetics)
print(f"\nIdentified {len(high_risk_diabetics)} high-risk patients")

if len(high_risk_diabetics) > 0:
    print(f"Average HbA1c in this group: {high_risk_diabetics['hba1c'].mean():.1f}%")
    print(f"Average adherence: {high_risk_diabetics['medication_adherence'].mean():.2f}")

=== HIGH-RISK DIABETIC PATIENTS ===
  patient_id  age gender diagnosis  systolic_bp  hba1c  medication_adherence
1       P002   62      F  Diabetes          130    8.1                  0.72
3       P004   71      M  Diabetes          145    9.2                  0.68
5       P006   55      M  Diabetes          155    7.8                  0.75

Identified 3 high-risk patients
Average HbA1c in this group: 8.4%
Average adherence: 0.72


### Filtering Summary

In [20]:
print("=== FILTERING OPERATORS SUMMARY ===")
print("==  : equals")
print("!=  : not equals") 
print(">   : greater than")
print(">= : greater than or equal")
print("<   : less than")
print("<=  : less than or equal")
print("&   : AND (both conditions)")
print("|   : OR (either condition)")
print("~   : NOT (opposite)")
print(".isin([list]) : value is in the list")

=== FILTERING OPERATORS SUMMARY ===
==  : equals
!=  : not equals
>   : greater than
>= : greater than or equal
<   : less than
<=  : less than or equal
&   : AND (both conditions)
|   : OR (either condition)
~   : NOT (opposite)
.isin([list]) : value is in the list


---

## 5. Putting It All Together: A Clinical Workflow

Let's simulate a real clinical data analysis workflow:

In [21]:
print("=== CLINICAL DATA ANALYSIS WORKFLOW ===")

# Step 1: Load and examine the data
print(f"1. Loaded data: {df.shape[0]} patients, {df.shape[1]} variables")

# Step 2: Data quality check
print(f"2. Missing values: {df.isnull().sum().sum()}")

# Step 3: Patient population summary
print("3. Patient population:")
print(f"   - Age range: {df['age'].min()}-{df['age'].max()} years")
print(f"   - Gender distribution: {df['gender'].value_counts().to_dict()}")
print(f"   - Diagnoses: {df['diagnosis'].value_counts().to_dict()}")

# Step 4: Clinical insights
print("4. Clinical insights:")
diabetics = df[df['diagnosis'] == 'Diabetes']
if len(diabetics) > 0:
    print(f"   - Diabetic patients: {len(diabetics)}")
    print(f"   - Average HbA1c: {diabetics['hba1c'].mean():.1f}%")
    print(f"   - Patients with HbA1c > 7%: {len(diabetics[diabetics['hba1c'] > 7.0])}")

hypertensives = df[df['diagnosis'] == 'Hypertension']
if len(hypertensives) > 0:
    print(f"   - Hypertensive patients: {len(hypertensives)}")
    print(f"   - Average systolic BP: {hypertensives['systolic_bp'].mean():.0f} mmHg")

=== CLINICAL DATA ANALYSIS WORKFLOW ===
1. Loaded data: 8 patients, 7 variables
2. Missing values: 0
3. Patient population:
   - Age range: 29-71 years
   - Gender distribution: {'M': 4, 'F': 4}
   - Diagnoses: {'Hypertension': 3, 'Diabetes': 3, 'Asthma': 2}
4. Clinical insights:
   - Diabetic patients: 3
   - Average HbA1c: 8.4%
   - Patients with HbA1c > 7%: 3
   - Hypertensive patients: 3
   - Average systolic BP: 134 mmHg


---

## 6. Next Steps and Best Practices

### What You've Learned
✅ How to import pandas and other essential modules  
✅ How to read CSV files containing clinical data  
✅ How to explore and understand DataFrames  
✅ How to filter data to find specific patient populations  

### Best Practices for Clinical Data

In [22]:
# Always check your data after loading
print("=== DATA VALIDATION CHECKLIST ===")
print("✓ Check data shape and types")
print("✓ Look for missing values")
print("✓ Verify value ranges make clinical sense")
print("✓ Check for duplicate patients")
print("✓ Validate categorical values")

# Example validation
print(f"\nAge range check: {df['age'].min()}-{df['age'].max()} years")
print(f"Systolic BP range: {df['systolic_bp'].min()}-{df['systolic_bp'].max()} mmHg")
print(f"HbA1c range: {df['hba1c'].min():.1f}-{df['hba1c'].max():.1f}%")

=== DATA VALIDATION CHECKLIST ===
✓ Check data shape and types
✓ Look for missing values
✓ Verify value ranges make clinical sense
✓ Check for duplicate patients
✓ Validate categorical values

Age range check: 29-71 years
Systolic BP range: 110-155 mmHg
HbA1c range: 4.8-9.2%


### What's Next?

This notebook covered the fundamentals. In your data science journey, you'll learn:
- More advanced filtering with `.loc[]` and `.iloc[]`
- Grouping and aggregating data (e.g., average by diagnosis)
- Merging datasets (combining lab results with demographics)
- Handling missing data
- Creating visualizations
- Statistical analysis

### Resources for Further Learning

- **Pandas Documentation**: https://pandas.pydata.org/docs/
- **Practice with real datasets**: Kaggle, UCI ML Repository
- **Clinical data standards**: HL7 FHIR, OMOP Common Data Model

---

## Practice Exercises

Try these exercises to reinforce your learning:

In [23]:
print("=== PRACTICE EXERCISES ===")
print("1. Find all female patients over 50")
print("2. Calculate the average medication adherence by diagnosis")
print("3. Find patients with systolic BP between 120-140 mmHg")
print("4. Identify patients who might need medication adjustment")
print("   (Hint: Consider diagnosis + relevant biomarker values)")

=== PRACTICE EXERCISES ===
1. Find all female patients over 50
2. Calculate the average medication adherence by diagnosis
3. Find patients with systolic BP between 120-140 mmHg
4. Identify patients who might need medication adjustment
   (Hint: Consider diagnosis + relevant biomarker values)


**Congratulations!** You now have the foundation to work with clinical data using pandas. This is your first step toward leveraging data science for personalised medicine.