In [1]:
import pandas as pd
from datasets import (Dataset, load_dataset)
from transformers import AutoTokenizer
import json
from helper import init_ipynb

init_ipynb()

True

In [5]:
questions = [
    "A 60-year-old woman develops generalized seizure activity lasting 10 minutes; seizure activity appears to arrest after administration of 4 mg of IV lorazepam. She has chronic kidney disease but is otherwise in good health. Which of the following is the best next pharmacological step in management?",
    "The cortical hamartomas of tuberous sclerosis display different MRI characteristics depending on a patient’s age. Which of the following pathophysiologic processes is responsible for this age-related change in the MRI characteristics of these lesions?",
    "A patient with which of the following conditions should not be placed on a ketogenic diet?",
    "10-year-old child with epilepsy since age 2 that is refractory to medical treatment has been followed serially with brain MRI scans, which show progressive atrophy of the left hemisphere. What is the most likely diagnosis?"
]

answers = [
    "The correct answer is C (IV fosphenytoin). Regardless of whether prolonged seizure activity stops after the administration of an appropriate dose of a benzodiazepine, rapid administration of a longer-acting anticonvulsant is generally recommended. This allows for prevention of additional seizures as the effect of the benzodiazepine wears off over the course of several hours. Of the options listed, fosphenytoin is the preferred option. A continuous infusion of propofol or midazolam is not indicated in this setting unless clinical or EEG evidence of ongoing seizures (ie, refractory status epilepticus) exists.",
    "Cortical hamartomas (tubers) are the most characteristic lesions in tuberous sclerosis complex. These lesions can cause focal seizures, which in some patients may be refractory to antiepileptic drugs; however, not all tubers are epileptogenic. The MRI appearance of tubers changes with myelination. In neonates they are hyperintense on T1 and hypointense on T2-weighted images compared to the surrounding white matter. In older children they are hyperintense on T2-weighted images with poorly defined borders.",
    "Ketogenic diets are contraindicated in patients with pancreatitis, hepatic failure, primary carnitine deficiency, carnitine palmitoyl transferase I and II deficiency, carnitine translocase deficiency, beta-oxidation defects, pyruvate carboxylase deficiency, and porphyria. In the intensive care setting where diet therapy is being considered for treatment of refractory status epilepticus, the ketogenic diet is also contraindicated in patients who cannot tolerate enteral feeds, including those with ileus, who are receiving a propofol infusion (to avoid fatal propofol infusion syndrome), and in patients who have metabolic, hemodynamic, or cardiorespiratory instability.",
    "MRI in Rasmussen encephalitis shows progressive atrophy of one of the cerebral hemispheres, usually beginning in the opercular region."
]

with open("docs/Guidelines_q_a.json", "r") as f:
    guidelines = json.load(f)

In [12]:
dataset = Dataset.from_pandas(
    pd.DataFrame.from_records(guidelines)
)

In [14]:
dataset = dataset.rename_column("q", "questions")
dataset = dataset.rename_column("a", "answers")
dataset

Dataset({
    features: ['questions', 'answers', 'reference'],
    num_rows: 1074
})

In [15]:
dataset = dataset.map(lambda x : {
    "answers" : f"{x['answers']}. For more information, please read {x['reference']}."
})

dataset["answers"][0]

Map:   0%|          | 0/1074 [00:00<?, ? examples/s]

'The first step in diagnosing a TNE is taking a detailed history and performing a physical examination.. For more information, please read Franco, A. C., et al. (2021). Management of a first unprovoked epileptic seizure in adolescence and adulthood. Epileptic Disorders, 23(4), 537-551..'

In [16]:
dataset = dataset.train_test_split(test_size=.3)

In [17]:
dataset.push_to_hub("cryptoni/epilepsy_guidelines_QA")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/403 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/cryptoni/epilepsy_guidelines_QA/commit/8cf3efac63a1507488397f0bb8bd8c875737040e', commit_message='Upload dataset', commit_description='', oid='8cf3efac63a1507488397f0bb8bd8c875737040e', pr_url=None, pr_revision=None, pr_num=None)

In [27]:
LLAMA2_MODEL_MAX_LENGTH=2024


def load_sft_data(tok):
    """
        Prepare the raw dataset for, with tokenization and deterministic split
        in train and eval.

        args :    
            - tok (AutoTokenizer) : tokenizer
        
        returns :
            train and test dataset
    """
    dataset = load_dataset("cryptoni/epilepsy_guidelines_QA")

    splits = ["train", "test"]
    cols = ["questions", "answers"]
    for split in splits: 
        dataset[split] = dataset[split].map(lambda x : tok(x["questions"], return_tensors="pt", max_length=min(tok.model_max_length, LLAMA2_MODEL_MAX_LENGTH), truncation=True, padding="max_length"), batched=True)
        dataset[split] = dataset[split].map(lambda x : {
            "target" : tok(x["answers"], return_tensors="pt", max_length=min(tok.model_max_length, LLAMA2_MODEL_MAX_LENGTH), truncation=True, padding="max_length").input_ids
            }, batched=True)
    
    return dataset["train"], dataset["test"]


tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tok.pad_token = tok.eos_token
dataset = load_sft_data(tok)
dataset

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

(Dataset({
     features: ['questions', 'answers', 'input_ids', 'attention_mask', 'target'],
     num_rows: 2
 }),
 Dataset({
     features: ['questions', 'answers', 'input_ids', 'attention_mask', 'target'],
     num_rows: 2
 }))

In [26]:
dataset[0][0].keys()

dict_keys(['questions', 'answers', 'input_ids', 'attention_mask', 'target'])

In [8]:
from helper import init_ipynb
envfound = init_ipynb()
from transformers import AutoTokenizer, LlamaForCausalLM



llm = LlamaForCausalLM.from_pretrained("checkpoints/epitron_baseline_PMCo_M7B_e3", local_files_only=True)
tok = AutoTokenizer.from_pretrained("checkpoints/epitron_baseline_PMCo_M7B_e3", local_files_only=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [4]:
import os
from pathlib import Path

Path(os.getcwd())/"checkpoints"

PosixPath('/home/antoinemagron/smb_share/TeamMembers/Magron_Antoine/EpiLLM/checkpoints/t')

# SFT model

In [1]:
from transformers import LlamaForCausalLM, AutoTokenizer


llm = LlamaForCausalLM.from_pretrained("cryptoni/epitron_LL3_PMC_N6")
llm.load_adapter("cryptoni/epitron_sft_n6_full")
tok = AutoTokenizer.from_pretrained("cryptoni/epitron_LL3_PMC_N6")

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/13.7M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [1]:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest


from huggingface_hub import snapshot_download

qa_lora_path = snapshot_download(repo_id="cryptoni/epitron_sft_n6_full")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [3]:
llm = LLM("cryptoni/epitron_LL3_PMC_N6", enable_lora=True)

INFO 06-18 13:31:58 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='cryptoni/epitron_LL3_PMC_N6', speculative_config=None, tokenizer='cryptoni/epitron_LL3_PMC_N6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cryptoni/epitron_LL3_PMC_N6)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 

In [4]:
llm.generate(["What is the importance of EEG in diagnosing a first unprovoked epileptic seizure?"], sampling_params=SamplingParams(temperature=0), lora_request=LoRARequest("q&a_adapter", 1, qa_lora_path))[0].outputs[0].text



Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.34it/s]


' A prospective study\nA. M. Al-Haddad\nA. M'

In [10]:
# import os

# os.chdir("..")
from evaluation import MCQBenchmark


mcq = MCQBenchmark("docs/self_assessment/aes7_processed.json", prompt_template=lambda:())

FileNotFoundError: [Errno 2] No such file or directory: 'docs/self_assessment/aes7_processed.json'

# Advanced guidelines :

In [2]:
with open("docs/Guidelines_q_a_improved.json", 'r') as f:
    qa = pd.DataFrame.from_records(json.load(f))

In [3]:
## Conversational 
for _ in qa[~qa.follow_up.isna()].follow_up.values:
    display(_[0])

conversations = [_[0] for _ in qa[~qa.follow_up.isna()].follow_up.values]

{'q': 'That sounds like a lot. Does it affect people of all ages?',
 'a': "Yes, epilepsy can affect people of all ages. It's more common in young children and older adults, but anyone can develop epilepsy at any time in their life.",
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'What could cause someone to develop epilepsy?',
 'a': 'Epilepsy can be caused by various factors, including genetic influences, head trauma, brain conditions like tumors or strokes, infectious diseases, prenatal injury, and developmental disorders.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'Why is it more common in young children and older adults?',
 'a': "In young children, epilepsy can be due to developmental issues or genetic factors. In older adults, it is often related to other health problems like strokes or Alzheimer's disease.",
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'Are there any reasons why more men might have epilepsy than women?',
 'a': "The exact reasons aren't completely understood, but it might be related to different risks of head injuries, genetic factors, and other health conditions that differ between men and women.",
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'Are there regions where epilepsy is more prevalent?',
 'a': 'Epilepsy is found worldwide, but the prevalence can be higher in low- and middle-income countries due to limited access to healthcare and higher rates of conditions like brain infections.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'Why might African Americans be more affected by epilepsy?',
 'a': 'This could be due to a combination of factors, including disparities in healthcare access, higher rates of certain health conditions, and socioeconomic factors.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'What kind of challenges might children with epilepsy face?',
 'a': 'Children with epilepsy might face challenges like learning difficulties, social stigma, and managing their condition alongside school and other activities. However, with proper treatment and support, many overcome these challenges.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'What kind of increased risks are there?',
 'a': 'Some increased risks include injury during a seizure, sudden unexpected death in epilepsy (SUDEP), and complications from underlying conditions that cause epilepsy. Proper management and treatment are crucial to minimize these risks.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'How can I find a support group?',
 'a': 'You can start by checking with your healthcare provider, local hospitals, or organizations like the Epilepsy Foundation. Many communities also have local support groups and online forums.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

{'q': 'What are the direct and indirect costs associated with epilepsy?',
 'a': 'Direct costs include medical expenses like hospital visits, medications, and treatments. Indirect costs involve lost productivity due to missed work or school, as well as the financial impact on families and caregivers.',
 'reference': 'https://www.epilepsy.com/what-is-epilepsy/statistics (last reviewed February 2019)'}

In [4]:
patient_sources = [_ for _ in qa.reference.unique() if "epilepsyontario.org" in _ or "epilepsy.com" in _]
patient_qa = qa[qa.reference.isin(patient_sources)]
print("Patient questions : ", len(patient_qa.index))
patient_qa.head(5)


Patient questions :  48


Unnamed: 0,q,a,reference,follow_up
212,What should I do if my friend is having a seiz...,Stay calm and keep your friend safe until they...,"Epilepsy Ontario, https://epilepsyontario.org/...",
213,I have epilepsy. What is epilepsy?,Epilepsy is a common neurological condition ch...,"Epilepsy Ontario, https://epilepsyontario.org/...",
214,I am epileptic. How is epilepsy diagnosed?,A physician or nurse practitioner diagnoses ep...,"Epilepsy Ontario, https://epilepsyontario.org/...",
215,I am a patient with epilepsy. How is epilepsy ...,Medication is the most common and effective tr...,"Epilepsy Ontario, https://epilepsyontario.org/...",
216,I have epilepsy. Are there different kinds of ...,"Yes, there are many types of seizures, categor...","Epilepsy Ontario, https://epilepsyontario.org/...",


In [5]:
technical_qa = qa[~qa.reference.isin(patient_sources)]
print("Technical open questions : ", len(technical_qa.index))
technical_qa.head(5)

Technical open questions :  1593


Unnamed: 0,q,a,reference,follow_up
0,What is the first step in diagnosing a transie...,The first step in diagnosing a TNE is taking a...,"Franco, A. C., et al. (2021). Management of a ...",
1,What is the importance of EEG in diagnosing a ...,An EEG is important for diagnosing a first unp...,"Franco, A. C., et al. (2021). Management of a ...",
2,When should an MRI be considered for a first u...,An MRI with an epilepsy protocol should be con...,"Franco, A. C., et al. (2021). Management of a ...",
3,What factors should be considered when decidin...,The decision to treat a first unprovoked epile...,"Franco, A. C., et al. (2021). Management of a ...",
4,What is the typical treatment regimen for a fi...,The typical treatment regimen for a first unpr...,"Franco, A. C., et al. (2021). Management of a ...",


### Improve the dataset

In [6]:
with open("docs/conversations.txt", "r") as f:
    interactions = [_ for _ in f.read().split("\n") if "{" not in _ and "}" not in _ and _ != ""]

conversations_ds = []
for i, interaction in enumerate(interactions):
    if(i + 1 != len(interactions)):
        conversations_ds.append(
            [interaction, interactions[i + 1]]
        )

conversations_ds = [_ for _ in conversations_ds if _[1] != "START"]
print(len(conversations_ds))

170


In [7]:
for conversation in conversations_ds:
    print("*"*200)
    print(conversation[0])
    print(conversation[1])

********************************************************************************************************************************************************************************************************
START
Doctor: Good morning! How can I assist you today?
********************************************************************************************************************************************************************************************************
Doctor: Good morning! How can I assist you today?
Patient: Hi Doctor, I was recently diagnosed with epilepsy and I'm trying to understand more about it. What exactly causes epilepsy?
********************************************************************************************************************************************************************************************************
Patient: Hi Doctor, I was recently diagnosed with epilepsy and I'm trying to understand more about it. What exactly causes epilepsy?
Doctor: Epilepsy is prima

## Transform MCQ

In [20]:
from models import OpenAIGPT
from tqdm.notebook import tqdm
tqdm.pandas()


gpt = OpenAIGPT("gpt-4", 1)
def create_mcq(q, a):
    return gpt.query([
        {
            "role": "system",
            "content": "You're a neurologist preparing an exam for the best students in the country. Transform this question into an MCQ question. You will write a creative and complicated MCQ and finish by writing the answer"
        },
        {   
            "role": "user",
            "content": 
            f""" 
                Question: {q}
                Answer: {a}
            """
        }
    ]
    )


In [26]:
mcq_sample = technical_qa.sample(100)


mcq_sample["mcq"] = mcq_sample[["q", "a"]].progress_apply(lambda x : create_mcq(x[0], x[1]), axis=1)

  0%|          | 0/100 [00:00<?, ?it/s]

  mcq_sample["mcq"] = mcq_sample[["q", "a"]].progress_apply(lambda x : create_mcq(x[0], x[1]), axis=1)


In [28]:
print(mcq_sample.mcq.values[0])

MCQ Question: 

During an evaluation for epilepsy surgery, intracranial EEG monitoring is often employed. What are the primary objectives of implementing this type of assessment in the process? 

A. To measure blood flow in the brain and to visualize the structure of the brain.
B. To record sleep patterns and circadian rhythms in patients with epilepsy. 
C. To record ictal and interictal electrographic data for epileptogenic zone delineation and to determine the location of eloquent cortex to define safety margins for surgery. 
D. To examine the cognitive abilities of the patient and to analyze their psychological well-being.

Answer: C. To record ictal and interictal electrographic data for epileptogenic zone delineation and to determine the location of eloquent cortex to define safety margins for surgery.


In [50]:
mcq_sample.to_csv("docs/MCQ_q_a.csv")

In [41]:
## verify

answers_letter = []
for x in mcq_sample.mcq:
    print("-"*100)
    print(x.split("\n")[-1])
    print(x.split("Answer:")[1])
    print(x.split("Answer:")[1].strip()[0])
    answers_letter.append(
        x.split("Answer:")[1].strip()[0]
    )

----------------------------------------------------------------------------------------------------
Answer: C. To record ictal and interictal electrographic data for epileptogenic zone delineation and to determine the location of eloquent cortex to define safety margins for surgery.
 C. To record ictal and interictal electrographic data for epileptogenic zone delineation and to determine the location of eloquent cortex to define safety margins for surgery.
C
----------------------------------------------------------------------------------------------------
Answer: B) Establish management strategies for epilepsy therapies to help reduce seizure incidences and SUDEP risk while taking into account patient preferences and balances the risks and benefits of any novel regime.
 B) Establish management strategies for epilepsy therapies to help reduce seizure incidences and SUDEP risk while taking into account patient preferences and balances the risks and benefits of any novel regime.
B
----

In [42]:
mcq_sample["answer_letter"] = answers_letter

In [64]:
mcq_sample["mcq_answer_full"] = mcq_sample["mcq"].apply(lambda x : x.split("\n")[-1])
mcq_sample["mcq_question_clean"] = mcq_sample["mcq"].apply(lambda x : "\n".join(x.split("\n")[0:-1]))

In [65]:
for x in mcq_sample["mcq_question_clean"].values:
    print("-"*100)
    print(x)

----------------------------------------------------------------------------------------------------
MCQ Question: 

During an evaluation for epilepsy surgery, intracranial EEG monitoring is often employed. What are the primary objectives of implementing this type of assessment in the process? 

A. To measure blood flow in the brain and to visualize the structure of the brain.
B. To record sleep patterns and circadian rhythms in patients with epilepsy. 
C. To record ictal and interictal electrographic data for epileptogenic zone delineation and to determine the location of eloquent cortex to define safety margins for surgery. 
D. To examine the cognitive abilities of the patient and to analyze their psychological well-being.

----------------------------------------------------------------------------------------------------
MCQ Question: 

What is the optimal course of action that clinicians should undertake to manage the risk of Sudden Unexpected Death in Epilepsy (SUDEP) in patients