# Chunking via GPT4; Propositions

Raw data that has been broken down into *propositins* via GPT4 prompt; the goal here is to either use an internal 200+ param model to create this 45k dataset for Flan-T5 training OR find a public dataset ~ 46k reports and what not; must be more than just discharge summaries AND use GPT4 to create Propositionizer. 

In [1]:
rpt_progress = [
  "The patient's name is Jordan M.",
  "The date of the progress note is June 2, 2025.",
  "The progress note was authored by Dr. A. Patel.",
  "Dr. A. Patel is a psychiatry resident.",
  "Jordan M. reported feeling 'a little more steady' on June 2, 2025.",
  "Jordan M. shared that writing in their journal has helped organize thoughts.",
  "Jordan M. shared that writing in their journal has reduced anxiety.",
  "Jordan M. expressed nervousness about returning home.",
  "Jordan M. acknowledged that the outpatient plan is 'a good next step.'",
  "Jordan M. was alert and oriented to person, place, and time.",
  "Jordan M.'s affect was brighter than on previous days.",
  "Jordan M. was not in acute distress on June 2, 2025.",
  "Jordan M. engaged actively in the morning CBT group.",
  "Jordan M. offered supportive comments to peers during the CBT group.",
  "Jordan M.'s mood appeared improved on June 2, 2025.",
  "Jordan M. showed increased motivation on June 2, 2025.",
  "Jordan M. showed increased insight on June 2, 2025.",
  "Jordan M. did not report suicidal ideation on June 2, 2025.",
  "Jordan M. showed a good response to sertraline.",
  "The treatment plan includes continuing sertraline at 100mg daily.",
  "The treatment plan includes encouraging Jordan M. to use journaling as a coping strategy.",
  "The treatment plan includes revisiting discharge planning on June 3, 2025."
]

In [2]:
rpt_referral = [
  "The patient's name is Jordan M.",
  "The date of the referral note is June 3, 2025.",
  "The referring clinician is Dr. A. Patel.",
  "Jordan M. was referred to the SunnyCare Outpatient Mental Health Program – CBT Track.",
  "Jordan M. is being discharged following stabilization.",
  "Jordan M.'s stabilization followed a suicide attempt.",
  "Jordan M.'s suicide attempt was related to depressive symptoms.",
  "Jordan M.'s suicide attempt was related to recent psychosocial stressors.",
  "Jordan M. showed a positive response to inpatient treatment.",
  "Jordan M. expressed motivation to continue structured therapy.",
  "Weekly Cognitive Behavioral Therapy (CBT) sessions were requested for Jordan M.",
  "Psychiatric follow-up for medication management was requested for Jordan M.",
  "Case management support was requested for Jordan M., if available.",
  "Jordan M. has been diagnosed with Major Depressive Disorder.",
  "The Major Depressive Disorder diagnosed in Jordan M. is recurrent.",
  "The Major Depressive Disorder diagnosed in Jordan M. is moderate in severity.",
  "Jordan M. is currently prescribed sertraline 100mg daily.",
  "Jordan M. engages well in group therapy.",
  "Jordan M. uses journaling as a coping skill.",
  "Jordan M.'s initial intake appointment is scheduled for June 10, 2025 at 2:00 PM.",
  "Jordan M. has been provided with contact details for the intake appointment.",
  "Jordan M. has been provided with a calendar reminder for the intake appointment."
]


In [3]:
rpt_discharge = [
  "The patient's name is Jordan M.",
  "Jordan M. is 29 years old.",
  "Jordan M. was admitted to the hospital on May 28, 2025.",
  "Jordan M. was discharged from the hospital on June 4, 2025.",
  "Jordan M. was diagnosed with Major Depressive Disorder.",
  "The Major Depressive Disorder diagnosed in Jordan M. is recurrent.",
  "The Major Depressive Disorder diagnosed in Jordan M. is moderate in severity.",
  "The reason for Jordan M.'s admission was a suicide attempt by overdose.",
  "Jordan M.'s suicide attempt followed a recent job loss.",
  "Jordan M.'s suicide attempt followed a period of social isolation.",
  "Jordan M. engaged well with the inpatient care team.",
  "Jordan M. participated in individual therapy during hospitalization.",
  "Jordan M. participated in group therapy during hospitalization.",
  "Jordan M. responded positively to medication adjustments.",
  "Jordan M. was prescribed sertraline during hospitalization.",
  "The dosage of sertraline for Jordan M. was titrated to 100mg.",
  "Jordan M. showed increased insight into psychological triggers.",
  "Jordan M. identified journaling as a helpful coping strategy.",
  "Jordan M. identified walking as a helpful coping strategy.",
  "Jordan M. has weekly contact with their sister.",
  "Jordan M. expressed motivation to reconnect with a friend in the city.",
  "Jordan M. was referred to an outpatient CBT program.",
  "Jordan M.'s first outpatient CBT appointment is scheduled for June 10.",
  "Jordan M.'s safety plan was reviewed.",
  "Jordan M.'s safety plan was shared.",
  "Jordan M. received a crisis line contact card.",
  "Jordan M. was provided with discharge medications for 30 days.",
  "Jordan M. shared insightful reflections in the journaling group.",
  "Jordan M. was kind toward peers during hospitalization.",
  "Jordan M. was encouraging toward peers during hospitalization.",
  "Jordan M. expressed a clear desire to build a routine.",
  "Jordan M. expressed a clear desire to return to part-time work.",
  "Jordan M.'s preferred name is Jordan.",
  "Jordan M.'s pronouns are they/them."
]

This is a placeholder for any kind of storage i.e. SQLite

In [4]:
storage = []

Let's create a `storage` and add our data to it

In [5]:
def add_to_storage(pid, dt, rtype, store, chunks):
    store += [{'pid': pid, 'report_date': dt, 'text': c, 'report_type': rtype, 'chunk_id': i}
              for i, c in enumerate(chunks)]

In [6]:
add_to_storage('Jordan', '2023-01-12', 'discharge_summary', storage, rpt_discharge)

In [7]:
add_to_storage('Jordan', '2023-01-17', 'referral', storage, rpt_referral)

In [8]:
add_to_storage('Jordan', '2023-02-10', 'progress_note', storage, rpt_progress)

In [9]:
storage[:5]

[{'pid': 'Jordan',
  'report_date': '2023-01-12',
  'text': "The patient's name is Jordan M.",
  'report_type': 'discharge_summary',
  'chunk_id': 0},
 {'pid': 'Jordan',
  'report_date': '2023-01-12',
  'text': 'Jordan M. is 29 years old.',
  'report_type': 'discharge_summary',
  'chunk_id': 1},
 {'pid': 'Jordan',
  'report_date': '2023-01-12',
  'text': 'Jordan M. was admitted to the hospital on May 28, 2025.',
  'report_type': 'discharge_summary',
  'chunk_id': 2},
 {'pid': 'Jordan',
  'report_date': '2023-01-12',
  'text': 'Jordan M. was discharged from the hospital on June 4, 2025.',
  'report_type': 'discharge_summary',
  'chunk_id': 3},
 {'pid': 'Jordan',
  'report_date': '2023-01-12',
  'text': 'Jordan M. was diagnosed with Major Depressive Disorder.',
  'report_type': 'discharge_summary',
  'chunk_id': 4}]

## BM25

Why BM25? Cool, storage exists now but we should add an attribute called `bm25`. Wrong. We need to build a seperate index for all docs + some way to search bm25 index. We have a 1:1 relationship between index of match in bm25 index and store

In [10]:
from rank_bm25 import BM25Okapi

In [11]:
def build_bm25(store): return BM25Okapi(s['text'].split() for s in store)

In [12]:
bm25_index = build_bm25(storage) ; bm25_index

<rank_bm25.BM25Okapi at 0x7fd0f7b9ffd0>

In [13]:
def search_bm25(query, store, bm25, k=5):
    q_tok = query.split()
    scores = bm25.get_scores(q_tok)
    topk = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [store[i] for i in topk]

## Query Expansion

Given user query to GPT4 *User Query:* `When was the patient diagnosed and what with?` we can convert it to the following more 'aligned' queries:

```[
  "When was the patient diagnosed?",
  "What was the patient's diagnosis?"
]
```

Let's just write a functionality for GPT4 API for the query expansion:

In [14]:
import os

In [16]:
from openai import OpenAI
import json
import re

In [17]:
client = OpenAI()

In [18]:
def expand_query_gpt(query, model='gpt-4o'):
    messages = [
        {
            "role": "system",
            "content": (
                "You will be given a user query. Your task is to rewrite or expand the query so that it better matches "
                "the structure of previously extracted propositions from clinical text. Follow these rules:\n\n"
                "- Split compound or complex questions into simple, focused questions.\n"
                "- Maintain the original phrasing from the input whenever possible.\n"
                "- Replace pronouns or vague terms (e.g., they, it, this) with explicit noun phrases based on the context implied by the query.\n"
                "- Add minimal clarifying context so that each question can be understood independently (i.e., decontextualize it).\n\n"
                "Output a list of reformulated questions as strings in JSON format.\n\n"
                "Do not generate answers. Only rewrite the query into a more proposition-aligned form."
            )
        },
        {"role": "user", "content": f'User Query: "{query}"'}
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )

    content = response.choices[0].message.content.strip()

    if content.startswith("```"):
        content = re.sub(r"^```(?:json)?\s*|\s*```$", "", content.strip(), flags=re.DOTALL)

    try:
        return json.loads(content)
    except json.JSONDecodeError:
        print("Malformed JSON from model:")
        print(content)
        return [query]

This API costs $

In [19]:
# expand_query_gpt("Does the patient have anything they're looking forward to?")

In [20]:
QUERY = ["When was the patient diagnosed?","What was the patient's diagnosis?"]

In [21]:
QUERY_SENTIMENT = ["What is the patient's overall sentiment based on the clinical notes?", "Does the patient require an encouraging follow-up note?",
  "Does the patient require a tough love approach in the follow-up note?"]

In [22]:
QUERY_FUTUREPLANS = ['Is the patient looking forward to any specific events?', 'Is the patient anticipating any particular activities?',
 'Does the patient have any future plans they are excited about?']

**Searching against BM25 index**

In [23]:
res = [
    {'query': q, 'chunk': chunk}
    for q in QUERY_SENTIMENT
    for chunk in search_bm25(q, storage, bm25_index, k=100)
]

In [24]:
res[:3]

[{'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-01-12',
   'text': "The patient's name is Jordan M.",
   'report_type': 'discharge_summary',
   'chunk_id': 0}},
 {'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-01-17',
   'text': "The patient's name is Jordan M.",
   'report_type': 'referral',
   'chunk_id': 0}},
 {'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-02-10',
   'text': "The patient's name is Jordan M.",
   'report_type': 'progress_note',
   'chunk_id': 0}}]

## ReRanking via Embedding Strategy

In [25]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Let's use Qwen2.5 0.6B; it's ranked fairly high on the HuggingFace MTEB Leaderboard for our initial embedding model

In [26]:
from sentence_transformers import SentenceTransformer

In [27]:
qwen_embed = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/9.71k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

In [28]:
def embed_texts(texts, is_query=False):
    if is_query: return qwen_embed.encode(texts, prompt_name="query")
    return qwen_embed.encode(texts)

In [29]:
def rerank(query, chunks, top_k=10):
    qv = embed_texts([query], is_query=True)[0]
    doc_vecs = embed_texts([c['chunk']['text'] for c in chunks], is_query=False)
    sims = cosine_similarity([qv], doc_vecs)[0]
    idxs = np.argsort(sims)[-top_k:][::-1]
    return [chunks[i] for i in idxs]

In [30]:
reranked_chunks = []

In [31]:
for q in QUERY_SENTIMENT:
    res_q = [r for r in res if r['query'] == q] # filtering res for a given query, q
    reranked_chunks += rerank(q, res_q, top_k=10)

In [32]:
reranked_chunks

[{'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-01-17',
   'text': 'Jordan M. showed a positive response to inpatient treatment.',
   'report_type': 'referral',
   'chunk_id': 8}},
 {'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-01-12',
   'text': 'Jordan M. responded positively to medication adjustments.',
   'report_type': 'discharge_summary',
   'chunk_id': 13}},
 {'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-02-10',
   'text': 'The progress note was authored by Dr. A. Patel.',
   'report_type': 'progress_note',
   'chunk_id': 2}},
 {'query': "What is the patient's overall sentiment based on the clinical notes?",
  'chunk': {'pid': 'Jordan',
   'report_date': '2023-01-12',
   'text': 'Jordan M. engaged well with the inpati

## Prompt Generation for LLM

In [33]:
def format_prompt(reranked_chunks):
    # build context passages
    context_blocks = [f"Query: {r['query']}\nPassage: {r['chunk']['text']}"
    for r in reranked_chunks]

    # unique queries in order
    all_queries = list({r['query'] for r in reranked_chunks})

    prompt = (
        "Refer to the passages below and answer the following question with just a few words.\n\n"
        + "\n\n".join(context_blocks)
        + "\n\nRefer to the context above and answer the following question with just a few words.\n"
        + "Question: " + " / ".join(all_queries)
    )

    return prompt

In [34]:
# prompt = format_prompt(reranked_chunks) ; prompt

In [35]:
def format_prompt_collapsed(reranked_chunks):
    # Unique queries (preserve order)
    seen, queries = set(), []
    for r in reranked_chunks:
        q = r['query']
        if q not in seen:
            seen.add(q)
            queries.append(q)
    
    # Deduplicate chunks (optional)
    seen_texts, context_blocks = set(), []
    for r in reranked_chunks:
        txt = r['chunk']['text']
        if txt not in seen_texts:
            seen_texts.add(txt)
            context_blocks.append(f"Passage: {txt}")

    prompt = (
        "Refer to the passages below and answer the following question in a short, informative sentence or two.\n\n"
        + "\n\n".join(context_blocks)
        + "\n\nQuestion: " + " / ".join(queries)
    )
    
    return prompt

In [36]:
prompt = format_prompt_collapsed(reranked_chunks) ; prompt

"Refer to the passages below and answer the following question in a short, informative sentence or two.\n\nPassage: Jordan M. showed a positive response to inpatient treatment.\n\nPassage: Jordan M. responded positively to medication adjustments.\n\nPassage: The progress note was authored by Dr. A. Patel.\n\nPassage: Jordan M. engaged well with the inpatient care team.\n\nPassage: Jordan M. was encouraging toward peers during hospitalization.\n\nPassage: Jordan M. was kind toward peers during hospitalization.\n\nPassage: The patient's name is Jordan M.\n\nPassage: Jordan M. expressed motivation to continue structured therapy.\n\nPassage: Psychiatric follow-up for medication management was requested for Jordan M.\n\nPassage: The treatment plan includes revisiting discharge planning on June 3, 2025.\n\nPassage: The date of the referral note is June 3, 2025.\n\nPassage: Jordan M. was provided with discharge medications for 30 days.\n\nPassage: Jordan M. has been provided with a calendar r

**Are we getting all the passages in this prompt?!**

In [38]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [39]:
model_name = "Qwen/Qwen2.5-32B-Instruct"

In [40]:
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto")

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/63.2k [00:00<?, ?B/s]

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

model-00001-of-00017.safetensors:   0%|          | 0.00/3.92G [00:00<?, ?B/s]

model-00008-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00002-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00004-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00006-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00003-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00007-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00005-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00009-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00010-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00011-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00012-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00013-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00015-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00014-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00016-of-00017.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

model-00017-of-00017.safetensors:   0%|          | 0.00/3.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the cpu.


In [41]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

In [42]:
def ask_qwen(prompt, max_tokens=512):
    msgs = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    txt = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([txt], return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=max_tokens)

    # Trim prompt from output
    trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
    return tokenizer.batch_decode(trimmed, skip_special_tokens=True)[0].strip()


In [43]:
ask_qwen(prompt)

'Based on the clinical notes, Jordan M. exhibits a positive sentiment and engagement with treatment, indicating that an encouraging follow-up note would be appropriate rather than a tough love approach.'

## Results:

**Is the patient looking forward to any specific events?', 'Is the patient anticipating any particular activities?','Does the patient have any future plans they are excited about?'**

**Answer from 32B LLM**

'Jordan M. is looking forward to the outpatient plan and is motivated to continue with structured therapy, indicating anticipation for these future activities.'