# Data Science for Personalized Medicine
## Introduction to Jupyter Notebooks, Pandas, and Machine Learning

This notebook demonstrates the fundamental concepts of data science applied to personalized medicine. We'll cover:

1. **Reading data** with pandas
2. **Exploring and filtering** patient data
3. **Visualizing** clinical patterns
4. **Training a Random Forest model** to predict treatment outcomes

---

## 1. Import Required Libraries

First, we'll import the essential libraries for data science in healthcare.  
  
If you've used R before, this is the same as `library(dplyr)`

In [2]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Create Synthetic Patient Data

For this demonstration, we'll create a synthetic dataset representing patients with cardiovascular conditions. In practice, you would load real clinical data from CSV files, databases, or electronic health records.  
  
You can skip this cell if you only want to see the final dataset and aren't interested in how we produce it. 

In [3]:
# Create synthetic patient data
np.random.seed(42)
n_patients = 1000

# Generate patient demographics and clinical measurements
patient_data = {
    'patient_id': range(1, n_patients + 1),
    'age': np.random.normal(65, 12, n_patients).astype(int),
    'gender': np.random.choice(['Male', 'Female'], n_patients),
    'bmi': np.random.normal(28, 5, n_patients),
    'systolic_bp': np.random.normal(140, 20, n_patients),
    'diastolic_bp': np.random.normal(90, 10, n_patients),
    'cholesterol': np.random.normal(220, 40, n_patients),
    'glucose': np.random.normal(120, 30, n_patients),
    'smoking': np.random.choice(['Never', 'Former', 'Current'], n_patients, p=[0.5, 0.3, 0.2]),
    'family_history': np.random.choice(['Yes', 'No'], n_patients, p=[0.3, 0.7]),
    'exercise_hours_week': np.random.exponential(2, n_patients)
}

# Create DataFrame
df = pd.DataFrame(patient_data)

# Ensure realistic ranges
df['age'] = np.clip(df['age'], 18, 95)
df['bmi'] = np.clip(df['bmi'], 15, 50)
df['systolic_bp'] = np.clip(df['systolic_bp'], 90, 200)
df['diastolic_bp'] = np.clip(df['diastolic_bp'], 60, 120)
df['cholesterol'] = np.clip(df['cholesterol'], 120, 350)
df['glucose'] = np.clip(df['glucose'], 70, 250)
df['exercise_hours_week'] = np.clip(df['exercise_hours_week'], 0, 15)

# Create treatment outcome based on risk factors (this simulates real-world relationships)
# Higher risk = lower probability of positive treatment outcome
risk_score = (
    (df['age'] > 70) * 0.2 +
    (df['bmi'] > 30) * 0.15 +
    (df['systolic_bp'] > 150) * 0.2 +
    (df['cholesterol'] > 240) * 0.15 +
    (df['smoking'] == 'Current') * 0.25 +
    (df['family_history'] == 'Yes') * 0.1 +
    (df['exercise_hours_week'] < 1) * 0.1
)

# Convert risk score to treatment outcome (0 = Poor, 1 = Good)
treatment_probability = 1 / (1 + np.exp(5 * (risk_score - 0.5)))  # Logistic function
df['treatment_outcome'] = np.random.binomial(1, treatment_probability, n_patients)
df['treatment_outcome'] = df['treatment_outcome'].map({0: 'Poor', 1: 'Good'})

print(f"Created dataset with {len(df)} patients")
print(f"Columns: {list(df.columns)}")

Created dataset with 1000 patients
Columns: ['patient_id', 'age', 'gender', 'bmi', 'systolic_bp', 'diastolic_bp', 'cholesterol', 'glucose', 'smoking', 'family_history', 'exercise_hours_week', 'treatment_outcome']


## 3. Reading and Exploring Data with Pandas

Now let's explore our dataset using pandas - the fundamental tool for data manipulation in Python.

In [4]:
# Display basic information about the dataset
print("Dataset Shape:", df.shape)


Dataset Shape: (1000, 12)


In [5]:
print("\nFirst 5 rows:")
df.head()


First 5 rows:


Unnamed: 0,patient_id,age,gender,bmi,systolic_bp,diastolic_bp,cholesterol,glucose,smoking,family_history,exercise_hours_week,treatment_outcome
0,1,70,Male,26.173392,138.331241,109.725422,253.586555,77.668295,Former,No,0.216361,Poor
1,2,63,Female,28.923402,111.007096,76.140121,165.699298,144.094642,Never,No,1.896558,Good
2,3,72,Male,21.264369,121.562804,95.055892,183.973145,127.188826,Never,No,0.197896,Poor
3,4,83,Female,23.14193,119.920853,104.891131,181.431997,141.991628,Former,No,2.807144,Good
4,5,62,Male,34.00207,144.145347,112.714497,263.202242,70.0,Never,No,0.041448,Poor


In [6]:
# Get summary statistics
print("Summary Statistics:")
df.describe()

Summary Statistics:


Unnamed: 0,patient_id,age,bmi,systolic_bp,diastolic_bp,cholesterol,glucose,exercise_hours_week
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,64.717,28.155972,140.087859,89.535834,219.454459,119.757603,2.0176
std,288.819436,11.674125,4.899528,19.979528,10.10871,39.205808,29.095575,1.919296
min,1.0,26.0,15.0,90.0,60.0,120.0,70.0,0.004286
25%,250.75,57.0,24.89351,125.795433,82.849786,192.809278,99.83755,0.614676
50%,500.5,65.0,28.083845,140.070712,89.596407,219.718642,118.632253,1.437502
75%,750.25,72.0,31.421283,153.647426,95.854506,246.515244,140.395475,2.845803
max,1000.0,95.0,43.965538,200.0,120.0,344.516408,212.948983,15.0


In [7]:
# Check data types and missing values
print("Data Types and Missing Values:")
print(df.info())



Data Types and Missing Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   patient_id           1000 non-null   int64  
 1   age                  1000 non-null   int64  
 2   gender               1000 non-null   object 
 3   bmi                  1000 non-null   float64
 4   systolic_bp          1000 non-null   float64
 5   diastolic_bp         1000 non-null   float64
 6   cholesterol          1000 non-null   float64
 7   glucose              1000 non-null   float64
 8   smoking              1000 non-null   object 
 9   family_history       1000 non-null   object 
 10  exercise_hours_week  1000 non-null   float64
 11  treatment_outcome    1000 non-null   object 
dtypes: float64(6), int64(2), object(4)
memory usage: 93.9+ KB
None


In [8]:
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
patient_id             0
age                    0
gender                 0
bmi                    0
systolic_bp            0
diastolic_bp           0
cholesterol            0
glucose                0
smoking                0
family_history         0
exercise_hours_week    0
treatment_outcome      0
dtype: int64


## 4. Filtering and Viewing Data

Let's learn how to filter and subset our patient data based on clinical criteria.  
  
Again, this is similar to the way that dataframes are manipulated in R

In [9]:
# Filter patients by age group
elderly_patients = df[df['age'] >= 70]
print(f"Number of elderly patients (≥70 years): {len(elderly_patients)}")

# Display first few elderly patients
print("\nFirst 5 elderly patients:")
elderly_patients.head()

Number of elderly patients (≥70 years): 330

First 5 elderly patients:


Unnamed: 0,patient_id,age,gender,bmi,systolic_bp,diastolic_bp,cholesterol,glucose,smoking,family_history,exercise_hours_week,treatment_outcome
0,1,70,Male,26.173392,138.331241,109.725422,253.586555,77.668295,Former,No,0.216361,Poor
2,3,72,Male,21.264369,121.562804,95.055892,183.973145,127.188826,Never,No,0.197896,Poor
3,4,83,Female,23.14193,119.920853,104.891131,181.431997,141.991628,Former,No,2.807144,Good
6,7,83,Male,22.765445,125.565249,94.914295,239.492209,84.367043,Never,No,4.426708,Good
7,8,74,Male,30.683264,143.536417,95.697604,262.877195,108.080302,Former,No,2.535468,Good


In [10]:
# Filter patients by gender
male_patients = df[df['gender'] == "Male"]
print(f"Number of male patients: {len(male_patients)}")

# Display first few male patients
print("\nFirst 5 male patients:")
male_patients.head()

Number of male patients: 489

First 5 male patients:


Unnamed: 0,patient_id,age,gender,bmi,systolic_bp,diastolic_bp,cholesterol,glucose,smoking,family_history,exercise_hours_week,treatment_outcome
0,1,70,Male,26.173392,138.331241,109.725422,253.586555,77.668295,Former,No,0.216361,Poor
2,3,72,Male,21.264369,121.562804,95.055892,183.973145,127.188826,Never,No,0.197896,Poor
4,5,62,Male,34.00207,144.145347,112.714497,263.202242,70.0,Never,No,0.041448,Poor
5,6,62,Male,24.715529,141.386887,85.956026,257.711541,129.344726,Former,No,0.5737,Good
6,7,83,Male,22.765445,125.565249,94.914295,239.492209,84.367043,Never,No,4.426708,Good


In [11]:
# Be careful to get the name of the column exactly right, or you will get errors 
male_patients = df[df['Gender'] == "Male"]

KeyError: 'Gender'

In [12]:
# Multiple condition filtering - high-risk patients
high_risk_patients = df[
    (df['systolic_bp'] > 150) & 
    (df['cholesterol'] > 240) & 
    (df['smoking'] == 'Current')
]

print(f"High-risk patients (high BP + high cholesterol + smoking): {len(high_risk_patients)}")
print(f"Percentage of total: {len(high_risk_patients)/len(df)*100:.1f}%")

if len(high_risk_patients) > 0:
    print("\nTreatment outcomes in high-risk group:")
    print(high_risk_patients['treatment_outcome'].value_counts())

High-risk patients (high BP + high cholesterol + smoking): 16
Percentage of total: 1.6%

Treatment outcomes in high-risk group:
treatment_outcome
Poor    15
Good     1
Name: count, dtype: int64


In [13]:
# Group analysis - treatment outcomes by gender
print("Treatment outcomes by gender:")
gender_outcomes = df.groupby(['gender', 'treatment_outcome']).size().unstack(fill_value=0)
print(gender_outcomes)

# Calculate success rates
print("\nTreatment success rates by gender:")
success_rates = df.groupby('gender')['treatment_outcome'].apply(lambda x: (x == 'Good').mean())
print(success_rates)

Treatment outcomes by gender:
treatment_outcome  Good  Poor
gender                       
Female              317   194
Male                327   162

Treatment success rates by gender:
gender
Female    0.620352
Male      0.668712
Name: treatment_outcome, dtype: float64


## 6. Preparing Data for Machine Learning

Before we can train our Random Forest model, we need to prepare the data:

In [14]:
# Prepare features for machine learning
# We need to convert categorical variables to numerical format

# Select features for the model
feature_columns = [
    'age', 'bmi', 'systolic_bp', 'diastolic_bp', 'cholesterol', 'glucose',
    'exercise_hours_week'
]

X = df[feature_columns]
y = df['treatment_outcome']

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nTarget classes: {y.unique()}")


Feature matrix shape: (1000, 7)
Target vector shape: (1000,)

Target classes: ['Poor' 'Good']


## 7. Training a Random Forest Model

Random Forest is an excellent algorithm for medical prediction tasks because it:
- Handles mixed data types well
- Provides feature importance rankings
- Is relatively interpretable
- Performs well with limited data preprocessing

In [15]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} patients")
print(f"Testing set size: {X_test.shape[0]} patients")
print(f"\nTraining set outcome distribution:")
print(y_train.value_counts())

Training set size: 800 patients
Testing set size: 200 patients

Training set outcome distribution:
treatment_outcome
Good    515
Poor    285
Name: count, dtype: int64


In [16]:
# Create and train the Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=100,        # Number of trees in the forest
    max_depth=10,           # Maximum depth of trees
    min_samples_split=5,    # Minimum samples required to split a node
    min_samples_leaf=2,     # Minimum samples required at a leaf node
    random_state=42         # For reproducibility
)

# Train the model
print("Training Random Forest model...")
rf_model.fit(X_train, y_train)
print("Model training completed!")

Training Random Forest model...
Model training completed!


In [17]:
# Make predictions on the test ste 
y_test = rf_model.predict(X_test)

In [18]:
from sklearn.metrics import accuracy_score, classification_report 

print("Accuracy = ", accuracy_score(y_test, rf_model.predict(X_test)))

Accuracy =  1.0


In [None]:
print(classification_report(y_test, rf_model.predict(X_test)))

              precision    recall  f1-score   support

        Good       1.00      1.00      1.00       157
        Poor       1.00      1.00      1.00        43

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200



We will discuss model evaluation more in the second lesson, including how to interpret these results!

# Hints  
  
Coding is about iteration and improvement; it is okay and expected to not get things correct the first time. But what can you do if you get stuck or your code is not working as expected?  


### 1) Read any error messages you get: they may have a clue about what has gone wrong

In [27]:
male_patients = df[df['Gender'] == "Male"]

KeyError: 'Gender'

### 2) Try asking colleagues or help, maybe they experienced the same issue?

### 3) Ask a chatbot, like ChatGPT. 
  
Whilst it is always best to try to fix problems yourself first (to aid in learning), AI chatbots like ChatGPT are really useful in coding.  
  
If you are really stuck, explain your problem to the chatbot and it may be able to explain what is going wrong.