# Example for converting patient data to instruction data

This notebook demonstrates the core workflow for converting raw clinical data into **Instruction Tuning** examples for the TwinWeaver model.

We will walk through the process of:
1.  **Loading Data**: Importing raw tabular data (longitudinal events and static demographics).
2.  **Configuration**: Setting up the pipeline to match your data schema.
3.  **Splitting**: Generating "Splits" (input/output samples) from a patient's timeline.
4.  **Conversion**: Transforming these splits into text-based **Instruction** (Input) and **Answer** (Target) pairs suitable for fine-tuning an LLM.

In [1]:
import pandas as pd

from twinweaver import (
    DataManager,
    Config,
    DataSplitterForecasting,
    DataSplitterEvents,
    ConverterInstruction,
    DataSplitter,
)

PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Basic Setup


### Load Data

We require three standardized dataframes to construct the patient digital twin:

* **Events (`df_events`)**: The longitudinal history in 'long' format (one row per event). Required columns:
    * `patientid`: Unique patient identifier.
    * `date`: Date of the event.
    * `event_category`: High-level grouping (e.g., 'lab', 'drug', 'condition', 'lot').
    * `event_name`: Specific variable name (e.g., 'Hemoglobin', 'Metformin').
    * `event_value`: The result/value (e.g., '12.5', 'Start').
    * `event_descriptive_name`: Natural language description used in the text prompt.
* **Constant (`df_constant`)**: Static patient information (one row per patient). Contains demographics like birth year, gender, and histology.
* **Constant Description (`df_constant_description`)**: Metadata mapping constant columns to natural language descriptions. Columns: `variable`, `comment`.

In [2]:
# Load data - generated example data
df_events = pd.read_csv("./example_data/events.csv")
df_constant = pd.read_csv("./example_data/constant.csv")
df_constant_description = pd.read_csv("./example_data/constant_description.csv")

### Configuration and Data Manager

We initialize the `Config` object, which serves as the central control for column mapping, token limits, and prompt templates. You can override defaults here to match your specific dataset schema (e.g., specifying which columns in `df_constant` to include).

The `DataManager` then ingests the raw dataframes, handling preprocessing steps like date parsing, unique event mapping, and train/test splitting at the patient level.

In [3]:
config = Config()  # Override values here to customize pipeline
config.constant_columns_to_use = [
    "birthyear",
    "gender",
    "histology",
    "smoking_history",
]  # Manually set from constant DF
config.constant_birthdate_column = "birthyear"


dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
dm.infer_var_types()

### Initialize Splitters and Converter

To generate diverse training examples, we use specialized **Data Splitters**:
* `DataSplitterEvents`: Identifies time points for predicting discrete outcomes (e.g., progression, death).
* `DataSplitterForecasting`: Identifies time points for forecasting continuous variables (e.g., future lab values).

The `ConverterInstruction` is the core engine that transforms these data points into tokenized text. It respects a token budget (e.g., 8192 tokens) to ensure the generated prompts fit within the model's context window.

In [4]:
# This data splitter handles event prediction tasks
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()

# This data splitter handles forecasting tasks
data_splitter_forecasting = DataSplitterForecasting(
    data_manager=dm,
    config=config,
)
# If you don't want to do forecasting QA, proportional sampling, or 3-sigma filtering, you can skip this step
data_splitter_forecasting.setup_statistics()

# We will also use the easier interface that combines both data splitters
data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)

# Set up the converter instruction
converter = ConverterInstruction(
    nr_tokens_budget_total=8192,
    config=config,
    dm=dm,
    variable_stats=data_splitter_forecasting.variable_stats,  # Optional, needed for forecasting QA tasks
)

This model config has set a `rope_parameters['original_max_position_embeddings']` field, to be used together with `max_position_embeddings` to determine a scaling factor. Please set the `factor` field of `rope_parameters`with this ratio instead -- we recommend the use of this field over `original_max_position_embeddings`, as it is compatible with most model architectures.


## Examine patient data

From the data manager we can get the patient, for example this patientid.

In [5]:
patientid = dm.all_patientids[4]
patientid

'p4'

Let's checkout the data of the patient. `patient_data` is a dictionary containing the patient's data, with two keys: 
- `"events"`: A pandas DataFrame containing all time-series events
                        (original events and molecular data combined and sorted
                        by date).
- `"constant"`: A pandas DataFrame containing the static (constant)
                data for the patient.

In [6]:
patient_data = dm.get_patient_data(patientid)
patient_data["events"].head(20)

Unnamed: 0,patientid,date,event_category,event_name,event_value,event_descriptive_name,meta_data,source
9778,p4,2021-03-26,main_diagnosis,initial_diagnosis,NSCLC,initial cancer diagnosis,Non-Small Cell Lung Cancer,events
9779,p4,2021-03-26,diagnosis,C34.90,diagnosed,Malignant neoplasm of unspecified part of unsp...,Malignant neoplasm of unspecified part of unsp...,events
9798,p4,2021-04-13,signature,TMB,3.7,Tumor Mutational Burden,NGS,genetic
9797,p4,2021-04-13,biomarker_ihc,PD-L1,>= 50%,PD-L1 Expression (TPS),IHC 22C3,genetic
9796,p4,2021-04-13,gene_sv,KEAP1,Present,KEAP1 Missense,NGS,genetic
9795,p4,2021-04-13,gene_sv,STK11,Present,STK11 Missense,NGS,genetic
9794,p4,2021-04-13,gene_sv,TP53,Present,TP53 Missense,NGS,genetic
9793,p4,2021-04-13,basic_biomarker,HER2,Wild Type,HER2,NGS,genetic
9791,p4,2021-04-13,basic_biomarker,NTRK2,Wild Type,NTRK2,NGS,genetic
9790,p4,2021-04-13,basic_biomarker,NTRK1,Wild Type,NTRK1,NGS,genetic


In [7]:
patient_data["constant"]

Unnamed: 0,patientid,birthyear,gender,histology,smoking_history,data_split
4,p4,1950,Female,Squamous Cell Carcinoma,Former,train


## Convert patient data to string

### Generate Training Splits

A single patient's timeline can yield multiple training examples. A **Split** represents a specific point in time (the "split date") where we divide the data:
* **Input**: History *before* the split date.
* **Target**: Future events or values *after* the split date.

The `get_splits_from_patient_with_target` method samples valid split points (anchored to sampling random times around the lines of therapy) and determines appropriate targets (forecasting vs. event prediction). This allows the model to learn from various stages of a patient's journey.

In [8]:
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
    patient_data,
)

Now for each split, we can generate these strings. We just pick the first one as an example.

In [9]:
split_idx = 0
p_converted = converter.forward_conversion(
    forecasting_splits=forecasting_splits[split_idx],
    event_splits=events_splits[split_idx],
    override_mode_to_select_forecasting="both",
)

Converted integer ages in birthyear to age format


### Inspect the Output

The `forward_conversion` method returns `p_converted`, a dictionary containing the final LLM training example:

* **`instruction`**: The full text prompt. It includes the patient's history (demographics + events) followed by the specific task questions.
* **`answer`**: The target completion string containing the correct answers for the tasks.
* **`meta`**: Structured metadata used to generate the text, useful for debugging or evaluation.

In [10]:
print(p_converted["instruction"])

The following is a patient, starting with the demographic data, following visit by visit everything that the patient experienced. All lab codes refer to LOINC codes.

Starting with demographic data:
	Age of patient at first event is 71 years,
	Gender of the patient is Female,
	Histological subtype of NSCLC is Squamous Cell Carcinoma,
	Smoking status at diagnosis is Former.

On the first visit, the patient experienced the following: 
	Malignant neoplasm of unspecified part of unspecified bronchus or lung (C34.90) is diagnosed,
	initial cancer diagnosis is nsclc.

2.57 weeks later, the patient visited and experienced the following: 
	<genetic>
	ALK is Wild Type,
	BRAF is Wild Type,
	EGFR is Wild Type,
	HER2 is Wild Type,
	KRAS is Wild Type,
	MET is Wild Type,
	NTRK1 is Wild Type,
	NTRK2 is Wild Type,
	NTRK3 is Wild Type,
	RET is Wild Type,
	ROS1 is Wild Type,
	PD-L1 Expression (TPS) is >= 50%,
	KEAP1 Missense is Present,
	STK11 Missense is Present,
	TP53 Missense is Present,
	Tumor Mutat

In [11]:
print(p_converted["answer"])

Task 1 is forecasting QA:
3 weeks later, the patient visited and experienced the following: 
	hemoglobin - 718-7 is b,
	potassium - 2823-3 is b.

3 weeks later, the patient visited and experienced the following: 
	hemoglobin - 718-7 is c,
	potassium - 2823-3 is b.

Task 2 is time to event prediction:
Here is the prediction: the event (next line of therapy) was not censored and occurred.

Task 3 is forecasting:
3 weeks later, the patient visited and experienced the following: 
	hemoglobin - 718-7 is 13.18,
	potassium - 2823-3 is 3.98.

3 weeks later, the patient visited and experienced the following: 
	hemoglobin - 718-7 is 13.3,
	potassium - 2823-3 is 3.91.




In [12]:
p_converted["answer"]

'Task 1 is forecasting QA:\n3 weeks later, the patient visited and experienced the following: \n\themoglobin - 718-7 is b,\n\tpotassium - 2823-3 is b.\n\n3 weeks later, the patient visited and experienced the following: \n\themoglobin - 718-7 is c,\n\tpotassium - 2823-3 is b.\n\nTask 2 is time to event prediction:\nHere is the prediction: the event (next line of therapy) was not censored and occurred.\n\nTask 3 is forecasting:\n3 weeks later, the patient visited and experienced the following: \n\themoglobin - 718-7 is 13.18,\n\tpotassium - 2823-3 is 3.98.\n\n3 weeks later, the patient visited and experienced the following: \n\themoglobin - 718-7 is 13.3,\n\tpotassium - 2823-3 is 3.91.\n\n'

## Reverse Conversion: Text to Structured Data

Finally, we demonstrate the **Reverse Conversion** process. This is the inverse of the instruction generation step. It takes the text string (which would be generated by the model during inference) and parses it back into structured Pandas DataFrames.

This capability is crucial for:
* **Evaluation**: Comparing the model's text predictions against ground truth data programmatically.
* **Integration**: Converting the model's narrative outputs back into downstream clinical systems or dashboards.

In this example, we take the `answer` string we just generated and confirm it can be reconstructed into a structured dataframe using the `reverse_conversion` method.

In [13]:
date = reference_dates["date"][0]
return_list = converter.reverse_conversion(p_converted["answer"], dm, date)
return_list[0]["result"]

Unnamed: 0,date,event_category,event_name,event_descriptive_name,event_value,source
0,2021-07-13,lab,hemoglobin_-_718-7,hemoglobin - 718-7,b,events
1,2021-07-13,lab,potassium_-_2823-3,potassium - 2823-3,b,events
2,2021-08-03,lab,hemoglobin_-_718-7,hemoglobin - 718-7,c,events
3,2021-08-03,lab,potassium_-_2823-3,potassium - 2823-3,b,events
