# Retrieve Reports, Extract Word Embeddings, and Compute Sentence Embeddings

This notebook prepares and processes narrative clinical reports for downstream survival analysis. It includes the following key steps:

1. **Loading and Preprocessing Raw Reports**  
   Raw patient-level data is imported from CSV using `import_and_prepare_dataframe`, which handles target encoding, date parsing, and computation of per-patient start and end dates.

2. **Loading the Pretrained NLP Model**  
   A BERT-based model and its tokenizer are loaded with `load_nlp_model`, and assigned to the appropriate computation device (CPU or GPU).

3. **Sentence Embedding Computation**  
   Sentence embeddings are computed from text using `compute_sentence_embeddings`, with support for both CLS token and SIF-based methods.

4. **Batch Processing and Export**  
   Using `process_and_export_embeddings`, the full dataset is split into several parts, sentence embeddings are computed, unused columns are removed, and each batch is exported to a CSV file for efficient downstream use.

In [1]:
import types
import sys
from numbers import Real, Integral

# Create a fake module to emulate 'sklearn.utils._param_validation'
# (used by skglm in newer versions of scikit-learn, >=1.3)
param_validation = types.ModuleType("sklearn.utils._param_validation")

# Define a minimal replacement for Interval used in _parameter_constraints
class Interval:
    def __init__(self, dtype, left, right, closed="neither"):
        self.dtype = dtype
        self.left = left
        self.right = right
        self.closed = closed

# Define a minimal replacement for StrOptions used in _parameter_constraints
class StrOptions:
    def __init__(self, options):
        self.options = set(options)

# Add the custom classes to the fake module
param_validation.Interval = Interval
param_validation.StrOptions = StrOptions

# Inject the fake module into sys.modules before skglm is imported
# This prevents skglm from raising an ImportError if sklearn < 1.3
sys.modules["sklearn.utils._param_validation"] = param_validation

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [3]:
import os

# Add the src directory to the Python path
notebook_dir = os.path.dirname(os.path.abspath("__file__"))
src_path = os.path.abspath(os.path.join(notebook_dir, '..', 'src/sigbert'))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

# Now import our custom modules
from _utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
# Load the pretrained model
tokenizer, model, device = load_nlp_model(path_model="../models/OncoBERT_v1.0")

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at ../models/OncoBERT_v1.0 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3) Application

In [None]:
# Define path to input reports
path_import = ...
df = import_and_prepare_dataframe(path_import)

# Define export path
export_path = ...
print("export_path = ", export_path)

cols_to_drop = ['text', 'word_embeddings', 'embeddings']

# Process and export the dataset with sentence embeddings
df_short_fin = process_and_export_embeddings(
    df, tokenizer, model, device, export_path, 
    method_embd="Arora", cols_to_drop=cols_to_drop
)