<a href="https://colab.research.google.com/github/logannye/research/blob/main/Discharge_Notes2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This model transforms clinical progress notes from patient profiles into vector embeddings. It is based upon the following model from HuggingFace:
https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT

Downloading the vector files

In [1]:
# Step 0: Mount Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')

# Step 1: Install the Transformers and Pandas libraries
!pip install transformers pandas

# Step 2: Import necessary libraries
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import numpy as np

# Step 3: Load the tokenizer and model for Bio_ClinicalBERT
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Function to vectorize a single clinical note
def vectorize_clinical_note(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding="max_length")
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

# Load the subject IDs for the two subsets
pancan_subj_path = '/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/pancan_subj.npy'
safe_subj_path = '/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/safe_subj.npy'
pancan_subj = np.load(pancan_subj_path, allow_pickle=True).tolist()
safe_subj = np.load(safe_subj_path, allow_pickle=True).tolist()

# Load the clinical notes DataFrame
csv_file_path = '/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/discharge.csv'
df_notes = pd.read_csv(csv_file_path)

# Function to filter for the last note per patient and vectorize, adding zero vector if not present
def filter_and_vectorize_last_notes(df, patient_list, id_col='subject_id', note_col='text', time_col='charttime', vector_length=768):
    # Initialize a dictionary to hold vectors for all patients, defaulting to zero vectors
    patient_vectors = {patient_id: np.zeros((vector_length,)) for patient_id in patient_list}

    # Filter for the patients of interest who are in the DataFrame
    df_filtered = df[df[id_col].isin(patient_list)]

    # Sort by patient ID and chart time, then drop duplicates to keep only the last entry per patient
    df_last_notes = df_filtered.sort_values(by=[id_col, time_col]).drop_duplicates(subset=id_col, keep='last').reset_index(drop=True)

    # Vectorize the final note for each patient in df_last_notes
    for _, row in df_last_notes.iterrows():
        patient_id = row[id_col]
        note_text = row[note_col]
        patient_vectors[patient_id] = vectorize_clinical_note(note_text).numpy().flatten()  # Flatten to ensure it's a 1D array

    # Convert dictionary values to a numpy array for all patients in the patient_list
    vectors = np.array(list(patient_vectors.values()))

    return vectors

# Vectorize the final clinical note for each patient in both groups, ensuring a zero vector for missing subjects
pancan_vectors = filter_and_vectorize_last_notes(df_notes, pancan_subj, id_col='subject_id', note_col='text', time_col='charttime')
safe_vectors = filter_and_vectorize_last_notes(df_notes, safe_subj, id_col='subject_id', note_col='text', time_col='charttime')

# Save the vectors
np.save('/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/pancan_vectors.npy', pancan_vectors)
np.save('/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/safe_vectors.npy', safe_vectors)


Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [4]:
from google.colab import files

# Download the saved .npy files
files.download('/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/pancan_vectors.npy')
files.download('/content/drive/MyDrive/Galen-Health/Datasets/mimic-iv-note/note/safe_vectors.npy')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>