<a href="https://colab.research.google.com/github/hewansirak/iCog-Trainings/blob/main/Model_Comparison_Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model** **Comparison** **between** **GPT/Gemini vs** **BioFinetuned Models**


## Preprocessing and Initial Functions

In [1]:
!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [2]:
import requests
import re
from typing import Union, Dict, Optional, List

In [3]:
from google.colab import files
import json

In [4]:
# Load MedCAT-annotated abstracts
with open("medcatServiceResponse1.json", "r") as f:
    medcat_json_1 = json.load(f)

with open("medcatServiceResponse2.json", "r") as f:
    medcat_json_2 = json.load(f)


This medcat json is found in two files named medcatServiceResponse1.json and medcatServiceResponse2.json uploaded inside the collab - the are annotated abstracts by medCAT



In [5]:
def parse_medcat_response(medcat_json):
    """
    Parses MedCAT's response JSON to keep only the required fields.

    Returns:
        dict: {
            "text": <original text>,
            "annotations": [
                {
                    "pretty_name": ...,
                    "cui": ...,
                    "types": [...],
                    "detected_name": ...
                },
                ...
            ]
        }
    """
    result = medcat_json.get("result", {})
    raw_annotations = result.get("annotations", [])
    text = result.get("text", "")

    filtered_annotations = []

    if raw_annotations and isinstance(raw_annotations[0], dict):
        for _, annotation in raw_annotations[0].items():
            filtered = {
                "pretty_name": annotation.get("pretty_name"),
                "detected_name": annotation.get("detected_name"),
                "cui": annotation.get("cui"),
                "types": annotation.get("types", []),

            }
            filtered_annotations.append(filtered)

    return {
        "text": text,
        "annotations": filtered_annotations
    }

##### test case for the above method
# text = "The patient was diagnosed with cancer."
# json_response = annotate_with_medcat(text)
# cleaned = parse_medcat_response(json_response)
# print(json.dumps(cleaned, indent=2))

In [6]:
parsed_1 = parse_medcat_response(medcat_json_1)
parsed_2 = parse_medcat_response(medcat_json_2)

In [7]:
parsed_1

{'text': 'An age-related decline in immune functions, referred to as immunosenescence, is partially responsible for the increased prevalence and severity of infectious diseases, and the low efficacy of vaccination in elderly persons. Immunosenescence is characterized by a decrease in cell-mediated immune function as well as by reduced humoral immune responses. Age-dependent defects in T- and B-cell function coexist with age-related changes within the innate immune system. In this review, we discuss the mechanisms and consequences of age-associated immune alterations as well as their implications for health in old age.',
 'annotations': [{'pretty_name': 'C0851454',
   'detected_name': 'age~related',
   'cui': 'C0851454',
   'types': ['']},
  {'pretty_name': 'Reduced',
   'detected_name': 'decline',
   'cui': 'C0392756',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'functioning immune',
   'detected_name': 'immune~function',
   'cui': 'C1817756',
   'types': ['Organ or Tissue 

In [8]:
parsed_2

{'text': 'The current evidence-based guideline on self-medication in migraine and tension-type headache of the German, Austrian and Swiss headache societies and the German Society of Neurology is addressed to physicians engaged in primary care as well as pharmacists and patients. The guideline is especially concerned with the description of the methodology used, the selection process of the literature used and which evidence the recommendations are based upon. The following recommendations about self-medication in migraine attacks can be made: The efficacy of the fixed-dose combination of acetaminophen, acetylsalicylic acid and caffeine and the monotherapies with ibuprofen or naratriptan or acetaminophen or phenazone are scientifically proven and recommended as first-line therapy. None of the substances used in self-medication in migraine prophylaxis can be seen as effective. Concerning the self-medication in tension-type headache, the following therapies can be recommended as first-li

In [9]:
FOL_generation_prompt = """You are a biomedical text reasoning assistant. Your task is to extract relationships from biomedical text and express them in the form of subject–predicate–object triples, formatted as JSON.

Only use concepts from the provided annotations as the **subject** and **object** of each triple. The **predicate** should describe the relationship between them, based on the context of the original text.

Each annotation contains:
- `pretty_name`: a normalized biomedical concept
- `detected_name`: the phrase as it appears in the original text
- `types`: the semantic category of the concept

Use `pretty_name` for the subject and object values. The `detected_name` shows the original wording found in the text but should not appear in the output.

---

### Example

Text:
"The patient was diagnosed with cancer."

Annotations:
- Concept: Patients (Type: Patient or Disabled Group, Mentioned as: "patient")
- Concept: cancer diagnosis (Type: Diagnostic Procedure, Mentioned as: "diagnosed~with~cancer")

Expected Output (JSON):
```{{
  "triples": [
    {{
      "subject": "Patients",
      "predicate": "diagnosed_with",
      "object": "cancer diagnosis"
    }}
  ]
}}```

---

Now, extract triples from the following input:

Text:
{texts}

Annotations:
{concepts}

"""

In [10]:
import re

def smart_truncate_prompt(prompt, tokenizer, max_input_tokens=900):
    """
    Truncate the prompt to fit within max_input_tokens, respecting sentence boundaries.
    """

    # Split by sentence endings
    sentences = re.split(r'(?<=[.!?])\s+', prompt)

    selected_sentences = []
    total_tokens = 0

    for sentence in sentences:
        sentence_tokens = tokenizer.tokenize(sentence)
        if total_tokens + len(sentence_tokens) > max_input_tokens:
            break
        selected_sentences.append(sentence)
        total_tokens += len(sentence_tokens)

    truncated_prompt = ' '.join(selected_sentences)

    # Final safety check
    final_tokens = tokenizer.tokenize(truncated_prompt)
    assert len(final_tokens) <= max_input_tokens, "Truncated prompt still too long."

    return truncated_prompt


## **Gemini LLM**

In [None]:
import google.generativeai as genai
from google.colab import userdata
import os

In [None]:
def gemini_generate(prompt: str):
    try:
        api_key = userdata.get('GOOGLE_API_KEY')
        if not api_key:
            raise ValueError("GOOGLE_API_KEY not found in Colab secrets. Please add it.")
        genai.configure(api_key=api_key)

        model = genai.GenerativeModel(model_name="gemini-1.5-flash-latest")
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Gemini API Error: {str(e)}")
        raise

In [None]:
def generate_triples_from_concepts(parsed_medcat_response, prompt):
    """
    Generate First-Order Logic (FOL) relationships from annotated MedCAT concepts using a Gemini LLM.

    Args:
        parsed_medcat_response (dict): Parsed response containing annotations and original text.
        prompt (str): Prompt with placeholders like {concepts} and {texts}.

    Returns:
        str: LLM-generated FOL output.
    """

    annotations = parsed_medcat_response.get("annotations", [])
    text = parsed_medcat_response.get("text", "")

    concepts_str = "\n".join([
        f"- Concept: {c['pretty_name']} (Type: {', '.join(c['types'])}, Mentioned as: \"{c['detected_name']}\")"
        for c in annotations
    ])

    # Debugging print statements to see the values being passed
    print("Concepts String:")
    print(concepts_str)
    print("Text:")
    print(text)

    filled_prompt = prompt.format(concepts=concepts_str, texts=text)

    print("Filled Prompt:")
    print(filled_prompt)

    # When changing generate function HERE
    response_text = gemini_generate(filled_prompt)
    return response_text.strip()


In [None]:
triples_1 = generate_triples_from_concepts(parsed_1, FOL_generation_prompt)
triples_2 = generate_triples_from_concepts(parsed_2, FOL_generation_prompt)

Concepts String:
- Concept: C0851454 (Type: , Mentioned as: "age~related")
- Concept: Reduced (Type: Qualitative Concept, Mentioned as: "decline")
- Concept: functioning immune (Type: Organ or Tissue Function, Mentioned as: "immune~function")
- Concept: Referring (Type: Functional Concept, Mentioned as: "referred")
- Concept: immunosenescence (Type: Physiologic Function, Mentioned as: "immunosenescence")
- Concept: Person Responsible (Type: Finding, Mentioned as: "partially~responsible")
- Concept: High Prevalence (Type: Quantitative Concept, Mentioned as: "increased~prevalence")
- Concept: Severities (Type: Qualitative Concept, Mentioned as: "severity")
- Concept: Communicable Diseases (Type: Disease or Syndrome, Mentioned as: "infectious~disease")
- Concept: low (Type: Qualitative Concept, Mentioned as: "low")
- Concept: Effectiveness (Type: Qualitative Concept, Mentioned as: "efficacy~of")
- Concept: Vaccination (Type: Therapeutic or Preventive Procedure, Mentioned as: "vaccination"

Use similar prompt to the 3 models one being gpt, others are finetuned models like `microsoft/BiomedNLP-PubMedBERT` and `dmis-lab/biobert-v1.1`

- create similar function like openai generate

In [None]:
triples_1

'```json\n{\n  "triples": [\n    {\n      "subject": "C0851454",\n      "predicate": "related_to",\n      "object": "Reduced"\n    },\n    {\n      "subject": "Reduced",\n      "predicate": "in",\n      "object": "functioning immune"\n    },\n    {\n      "subject": "Referring",\n      "predicate": "refers_to",\n      "object": "immunosenescence"\n    },\n    {\n      "subject": "immunosenescence",\n      "predicate": "responsible_for",\n      "object": "High Prevalence"\n    },\n    {\n      "subject": "High Prevalence",\n      "predicate": "of",\n      "object": "Communicable Diseases"\n    },\n    {\n      "subject": "immunosenescence",\n      "predicate": "responsible_for",\n      "object": "Severities"\n    },\n    {\n      "subject": "Severities",\n      "predicate": "of",\n      "object": "Communicable Diseases"\n    },\n    {\n      "subject": "immunosenescence",\n      "predicate": "responsible_for",\n      "object": "low"\n    },\n    {\n      "subject": "low",\n      "predic

In [None]:
triples_2

'```json\n{\n  "triples": [\n    {\n      "subject": "Guidelines",\n      "predicate": "is_addressed_to",\n      "object": "Physicians"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "is_addressed_to",\n      "object": "Pharmacist"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "is_addressed_to",\n      "object": "Patients"\n    },\n    {\n      "subject": "Physicians",\n      "predicate": "involved_in",\n      "object": "Primary Health Care"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "concerns",\n      "object": "Description"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "concerns",\n      "object": "Methodology"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "concerns",\n      "object": "Choose"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "concerns",\n      "object": "Process"\n    },\n    {\n      "subject": "Guidelines",\n      "predicate": "concerns",\n  

## **BioGPT**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT")
model = AutoModelForCausalLM.from_pretrained("microsoft/BioGPT")

def biogpt_generate(prompt: str, max_new_tokens=150):
    max_input_len = model.config.max_position_embeddings - max_new_tokens

    safe_prompt = smart_truncate_prompt(prompt, tokenizer, max_input_tokens=max_input_len)

    inputs = tokenizer(safe_prompt, return_tensors="pt")

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/927k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/696k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

In [None]:
def generate_triples_from_concepts(parsed_medcat_response, prompt):
    """
    Generate First-Order Logic (FOL) relationships from annotated MedCAT concepts using a Gemini LLM.

    Args:
        parsed_medcat_response (dict): Parsed response containing annotations and original text.
        prompt (str): Prompt with placeholders like {concepts} and {texts}.

    Returns:
        str: LLM-generated FOL output.
    """

    annotations = parsed_medcat_response.get("annotations", [])
    text = parsed_medcat_response.get("text", "")

    concepts_str = "\n".join([
        f"- Concept: {c['pretty_name']} (Type: {', '.join(c['types'])}, Mentioned as: \"{c['detected_name']}\")"
        for c in annotations
    ])

    # Debugging print statements to see the values being passed
    print("Concepts String:")
    print(concepts_str)
    print("Text:")
    print(text)

    filled_prompt = prompt.format(concepts=concepts_str, texts=text)

    print("Filled Prompt:")
    print(filled_prompt)

    # When changing generate function HERE
    response_text = biogpt_generate(filled_prompt)
    return response_text.strip()


In [None]:
triples_1 = generate_triples_from_concepts(parsed_1, FOL_generation_prompt)
triples_2 = generate_triples_from_concepts(parsed_2, FOL_generation_prompt)

Concepts String:
- Concept: C0851454 (Type: , Mentioned as: "age~related")
- Concept: Reduced (Type: Qualitative Concept, Mentioned as: "decline")
- Concept: functioning immune (Type: Organ or Tissue Function, Mentioned as: "immune~function")
- Concept: Referring (Type: Functional Concept, Mentioned as: "referred")
- Concept: immunosenescence (Type: Physiologic Function, Mentioned as: "immunosenescence")
- Concept: Person Responsible (Type: Finding, Mentioned as: "partially~responsible")
- Concept: High Prevalence (Type: Quantitative Concept, Mentioned as: "increased~prevalence")
- Concept: Severities (Type: Qualitative Concept, Mentioned as: "severity")
- Concept: Communicable Diseases (Type: Disease or Syndrome, Mentioned as: "infectious~disease")
- Concept: low (Type: Qualitative Concept, Mentioned as: "low")
- Concept: Effectiveness (Type: Qualitative Concept, Mentioned as: "efficacy~of")
- Concept: Vaccination (Type: Therapeutic or Preventive Procedure, Mentioned as: "vaccination"

In [None]:
triples_1

'You are a biomedical text reasoning assistant. Your task is to extract relationships from biomedical text and express them in the form of subject predicate object triples, formatted as JSON. Only use concepts from the provided annotations as the * * subject * * and * * object * * of each triple. The * * predicate * * should describe the relationship between them, based on the context of the original text. Each annotation contains: - pretty _ name: a normalized biomedical concept - detected _ name: the phrase as it appears in the original text - types: the semantic category of the concept Use pretty _ namefor the subject and object values. The detected _ nameshows the original wording found in the text but should not appear in the output. --- # # # Example Text: "The patient was diagnosed with cancer." Annotations: - Concept: Patients (Type: Patient or Disabled Group, Mentioned as: "patient") - Concept: cancer diagnosis (Type: Diagnostic Procedure, Mentioned as: "diagnosed ~ with ~ can

In [None]:
triples_2

'You are a biomedical text reasoning assistant. Your task is to extract relationships from biomedical text and express them in the form of subject predicate object triples, formatted as JSON. Only use concepts from the provided annotations as the * * subject * * and * * object * * of each triple. The * * predicate * * should describe the relationship between them, based on the context of the original text. Each annotation contains: - pretty _ name: a normalized biomedical concept - detected _ name: the phrase as it appears in the original text - types: the semantic category of the concept Use pretty _ namefor the subject and object values. The detected _ nameshows the original wording found in the text but should not appear in the output. --- # # # Example Text: "The patient was diagnosed with cancer." Annotations: - Concept: Patients (Type: Patient or Disabled Group, Mentioned as: "patient") - Concept: cancer diagnosis (Type: Diagnostic Procedure, Mentioned as: "diagnosed ~ with ~ can

## **BioMed**


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = AutoModelForCausalLM.from_pretrained("stanford-crfm/BioMedLM")

# Optional: Set EOS token if needed (fixes padding issues in generation)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def biomedlm_generate(prompt: str, max_new_tokens: int = 150):
    max_input_tokens = model.config.n_positions - max_new_tokens
    tokenized = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_input_tokens).to(device)

    # Optional: token length
    input_len = tokenized['input_ids'].shape[1]
    print(f"Prompt token length: {input_len} / {model.config.n_positions}")

    # Generate output
    with torch.no_grad():
        output = model.generate(
            **tokenized,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id
        )

    # Decode and return
    return tokenizer.decode(output[0], skip_special_tokens=True)


tokenizer_config.json:   0%|          | 0.00/267 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/602k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/276k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/876 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/10.7G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/10.7G [00:00<?, ?B/s]

In [None]:
def generate_triples_from_concepts(parsed_medcat_response, prompt):
    """
    Generate First-Order Logic (FOL) relationships from annotated MedCAT concepts using a Gemini LLM.

    Args:
        parsed_medcat_response (dict): Parsed response containing annotations and original text.
        prompt (str): Prompt with placeholders like {concepts} and {texts}.

    Returns:
        str: LLM-generated FOL output.
    """

    annotations = parsed_medcat_response.get("annotations", [])
    text = parsed_medcat_response.get("text", "")

    concepts_str = "\n".join([
        f"- Concept: {c['pretty_name']} (Type: {', '.join(c['types'])}, Mentioned as: \"{c['detected_name']}\")"
        for c in annotations
    ])

    print("Concepts String:")
    print(concepts_str)
    print("Text:")
    print(text)

    filled_prompt = prompt.format(concepts=concepts_str, texts=text)

    print("Filled Prompt:")
    print(filled_prompt)

    response_text = biomedlm_generate(filled_prompt)
    return response_text.strip()

In [None]:
test_prompt = "What is gene mutation?"

generated_text = biomedlm_generate(test_prompt)
print(generated_text)

Prompt token length: 5 / 1024
What is gene mutation?


In [None]:
test_prompt = "Who is Hewan?"

generated_text = biomedlm_generate(test_prompt)
print(generated_text)

Prompt token length: 7 / 1024
Who is Hewan?" A total of 228 patients (59.2%).

Discussion {#Sec15}


In [None]:
triples_1 = generate_triples_from_concepts(parsed_1, FOL_generation_prompt)
triples_2 = generate_triples_from_concepts(parsed_2, FOL_generation_prompt)

Concepts String:
- Concept: C0851454 (Type: , Mentioned as: "age~related")
- Concept: Reduced (Type: Qualitative Concept, Mentioned as: "decline")
- Concept: functioning immune (Type: Organ or Tissue Function, Mentioned as: "immune~function")
- Concept: Referring (Type: Functional Concept, Mentioned as: "referred")
- Concept: immunosenescence (Type: Physiologic Function, Mentioned as: "immunosenescence")
- Concept: Person Responsible (Type: Finding, Mentioned as: "partially~responsible")
- Concept: High Prevalence (Type: Quantitative Concept, Mentioned as: "increased~prevalence")
- Concept: Severities (Type: Qualitative Concept, Mentioned as: "severity")
- Concept: Communicable Diseases (Type: Disease or Syndrome, Mentioned as: "infectious~disease")
- Concept: low (Type: Qualitative Concept, Mentioned as: "low")
- Concept: Effectiveness (Type: Qualitative Concept, Mentioned as: "efficacy~of")
- Concept: Vaccination (Type: Therapeutic or Preventive Procedure, Mentioned as: "vaccination"

In [None]:
triples_1

'You are a biomedical text reasoning assistant. Your task is to extract relationships from biomedical text and express them in the form of subject–predicate–object triples, formatted as JSON.\n\nOnly use concepts from the provided annotations as the **subject** and **object** of each triple. The **predicate** should describe the relationship between them, based on the context of the original text.\n\nEach annotation contains:\n- `pretty_name`: a normalized biomedical concept\n- `detected_name`: the phrase as it appears in the original text\n- `types`: the semantic category of the concept\n\nUse `pretty_name` for the subject and object values. The `detected_name` shows the original wording found in the text but should not appear in the output.\n\n---\n\n### Example\n\nText:\n"The patient was diagnosed with cancer."\n\nAnnotations:\n- Concept: Patients (Type: Patient or Disabled Group, Mentioned as: "patient")\n- Concept: cancer diagnosis (Type: Diagnostic Procedure, Mentioned as: "diagn

In [None]:
triples_2

'You are a biomedical text reasoning assistant. Your task is to extract relationships from biomedical text and express them in the form of subject–predicate–object triples, formatted as JSON.\n\nOnly use concepts from the provided annotations as the **subject** and **object** of each triple. The **predicate** should describe the relationship between them, based on the context of the original text.\n\nEach annotation contains:\n- `pretty_name`: a normalized biomedical concept\n- `detected_name`: the phrase as it appears in the original text\n- `types`: the semantic category of the concept\n\nUse `pretty_name` for the subject and object values. The `detected_name` shows the original wording found in the text but should not appear in the output.\n\n---\n\n### Example\n\nText:\n"The patient was diagnosed with cancer."\n\nAnnotations:\n- Concept: Patients (Type: Patient or Disabled Group, Mentioned as: "patient")\n- Concept: cancer diagnosis (Type: Diagnostic Procedure, Mentioned as: "diagn

## **Medalpaca**

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer_medalpaca = AutoTokenizer.from_pretrained("medalpaca/medalpaca-7b")
model_medalpaca = AutoModelForCausalLM.from_pretrained(
    "medalpaca/medalpaca-7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer_medalpaca.pad_token = tokenizer_medalpaca.eos_token
model_medalpaca.config.pad_token_id = tokenizer_medalpaca.eos_token_id

def medalpaca_generate(prompt: str, max_new_tokens=150):
    formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:"
    inputs = tokenizer_medalpaca(formatted_prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)

    with torch.no_grad():
        outputs = model_medalpaca.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7,
            pad_token_id=tokenizer_medalpaca.pad_token_id
        )
    return tokenizer_medalpaca.decode(outputs[0], skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggin

config.json:   0%|          | 0.00/542 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/7.18G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.89G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.88G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [13]:
def medalpaca_generate(prompt: str, max_new_tokens=150):
    formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:"
    inputs = tokenizer_medalpaca(formatted_prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)

    with torch.no_grad():
        outputs = model_medalpaca.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7,
            pad_token_id=tokenizer_medalpaca.pad_token_id
        )
    return tokenizer_medalpaca.decode(outputs[0], skip_special_tokens=True)

In [14]:
test_prompt = "What is gene mutation?"

generated_text = medalpaca_generate(test_prompt)
print(generated_text)

### Instruction:
What is gene mutation?

### Response:A gene mutation is a change in a DNA sequence that is passed on to daughter cells. Most mutations are not beneficial, but some may be helpful or even necessary for the development of the individual or the species.

### What is the difference between a point mutation and a frameshift mutation?

A point mutation is a type of mutation that results in a single nucleotide change, while a frameshift mutation occurs when there is a deletion or insertion of multiple nucleotides. Frameshift mutations can disrupt the reading frame of the gene, which can lead to a change in the amino acid sequence of the resulting protein.

### What is


In [15]:
test_prompt = "What are the symptoms of HIV?"

generated_text = medalpaca_generate(test_prompt)
print(generated_text)

### Instruction:
What are the symptoms of HIV?

### Response:Early HIV infection does not cause any symptoms. This is called the " asymptomatic" stage. As the infection develops, people may experience a flu -like illness, headache, rash, or swollen lymph nodes. The symptoms go away after a few weeks. Some people may notice swollen lymph nodes in the groin, buttocks, or neck. These symptoms are often the result of infection with other viruses. People who have AIDS often have a weakened immune system and can develop opportunistic infections. These infections are caused by germs that normally do not cause problems in people with healthy immune


In [16]:
test_prompt = "represent the following in first order logic in subject predicate and object format: Kebede is the parent of Hlina?"

generated_text = medalpaca_generate(test_prompt)
print(generated_text)

### Instruction:
represent the following in first order logic in subject predicate and object format: Kebede is the parent of Hlina?

### Response:No

```
KBD : Person
KBD.name "KEBDE" .
KBD.age 25 .
KBD.sex "male" .
KBD/mother <KBD.name> .
KBD/father <KBD.name> .

HL : Person
HL.name "Hlina" .
HL.age 20 .
HL.sex "female" .
HL/mother <KBD.name> .
HL/father <KBD.name> .
```

In the above answer, we have two persons, Kebede and Hlina. Kebede has two parents, his mother


In [17]:
def generate_triples_from_concepts(parsed_medcat_response, prompt):
    """
    Generate First-Order Logic (FOL) relationships from annotated MedCAT concepts using a Gemini LLM.

    Args:
        parsed_medcat_response (dict): Parsed response containing annotations and original text.
        prompt (str): Prompt with placeholders like {concepts} and {texts}.

    Returns:
        str: LLM-generated FOL output.
    """

    annotations = parsed_medcat_response.get("annotations", [])
    text = parsed_medcat_response.get("text", "")

    concepts_str = "\n".join([
        f"- Concept: {c['pretty_name']} (Type: {', '.join(c['types'])}, Mentioned as: \"{c['detected_name']}\")"
        for c in annotations
    ])

    # Debugging print statements to see the values being passed
    print("Concepts String:")
    print(concepts_str)
    print("Text:")
    print(text)

    filled_prompt = prompt.format(concepts=concepts_str, texts=text)

    print("Filled Prompt:")
    print(filled_prompt)

    # When changing generate function HERE
    response_text = medalpaca_generate(filled_prompt)
    return response_text.strip()

In [18]:
triples_1 = generate_triples_from_concepts(parsed_1, FOL_generation_prompt)

Concepts String:
- Concept: C0851454 (Type: , Mentioned as: "age~related")
- Concept: Reduced (Type: Qualitative Concept, Mentioned as: "decline")
- Concept: functioning immune (Type: Organ or Tissue Function, Mentioned as: "immune~function")
- Concept: Referring (Type: Functional Concept, Mentioned as: "referred")
- Concept: immunosenescence (Type: Physiologic Function, Mentioned as: "immunosenescence")
- Concept: Person Responsible (Type: Finding, Mentioned as: "partially~responsible")
- Concept: High Prevalence (Type: Quantitative Concept, Mentioned as: "increased~prevalence")
- Concept: Severities (Type: Qualitative Concept, Mentioned as: "severity")
- Concept: Communicable Diseases (Type: Disease or Syndrome, Mentioned as: "infectious~disease")
- Concept: low (Type: Qualitative Concept, Mentioned as: "low")
- Concept: Effectiveness (Type: Qualitative Concept, Mentioned as: "efficacy~of")
- Concept: Vaccination (Type: Therapeutic or Preventive Procedure, Mentioned as: "vaccination"

In [19]:
FOL_prompt_medalpaca = """
You are a biomedical assistant. Extract subject–predicate–object triples from the text using the concepts below.

Use only `pretty_name` values from the annotations as subject and object.

Text:
{texts}

Annotations:
{concepts}

Respond ONLY with a JSON array of subject–predicate–object triples, NO explanation, NO Python code.

Give response in this format:
{{

  "triples": [
    {{
      "subject": "X",
      "predicate": "Y",
      "object": "Z"
    }}
  ]
}}

"""


In [20]:
test_response = {
    "text": "The patient has cancer and was treated with chemotherapy.",
    "annotations": [
        {"pretty_name": "Patients", "detected_name": "patient", "types": ["Population Group"]},
        {"pretty_name": "cancer", "detected_name": "cancer", "types": ["Disease or Syndrome"]},
        {"pretty_name": "chemotherapy", "detected_name": "chemotherapy", "types": ["Therapeutic Procedure"]}
    ]
}

result = generate_triples_from_concepts(test_response, FOL_prompt_medalpaca)
print(result)

Concepts String:
- Concept: Patients (Type: Population Group, Mentioned as: "patient")
- Concept: cancer (Type: Disease or Syndrome, Mentioned as: "cancer")
- Concept: chemotherapy (Type: Therapeutic Procedure, Mentioned as: "chemotherapy")
Text:
The patient has cancer and was treated with chemotherapy.
Filled Prompt:

You are a biomedical assistant. Extract subject–predicate–object triples from the text using the concepts below.

Use only `pretty_name` values from the annotations as subject and object.

Text:
The patient has cancer and was treated with chemotherapy.

Annotations:
- Concept: Patients (Type: Population Group, Mentioned as: "patient")
- Concept: cancer (Type: Disease or Syndrome, Mentioned as: "cancer")
- Concept: chemotherapy (Type: Therapeutic Procedure, Mentioned as: "chemotherapy")

Respond ONLY with a JSON array of subject–predicate–object triples, NO explanation, NO Python code.

Give response in this format:
{

  "triples": [
    {
      "subject": "X",
      "pre

In [21]:
test_response = {
    "text": "An age-related decline in immune functions, referred to as immunosenescence, is partially responsible for the increased prevalence and severity of infectious diseases, and the low efficacy of vaccination in elderly persons.",
    'annotations': [{'pretty_name': 'C0851454',
   'detected_name': 'age~related',
   'cui': 'C0851454',
   'types': ['']},
  {'pretty_name': 'Reduced',
   'detected_name': 'decline',
   'cui': 'C0392756',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'functioning immune',
   'detected_name': 'immune~function',
   'cui': 'C1817756',
   'types': ['Organ or Tissue Function']},
  {'pretty_name': 'Referring',
   'detected_name': 'referred',
   'cui': 'C0205543',
   'types': ['Functional Concept']},
  {'pretty_name': 'immunosenescence',
   'detected_name': 'immunosenescence',
   'cui': 'C0596761',
   'types': ['Physiologic Function']},
  {'pretty_name': 'Person Responsible',
   'detected_name': 'partially~responsible',
   'cui': 'C1273518',
   'types': ['Finding']},
  {'pretty_name': 'High Prevalence',
   'detected_name': 'increased~prevalence',
   'cui': 'C1512456',
   'types': ['Quantitative Concept']},
  {'pretty_name': 'Severities',
   'detected_name': 'severity',
   'cui': 'C0439793',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'Communicable Diseases',
   'detected_name': 'infectious~disease',
   'cui': 'C0009450',
   'types': ['Disease or Syndrome']},
  {'pretty_name': 'low',
   'detected_name': 'low',
   'cui': 'C0205251',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'Effectiveness',
   'detected_name': 'efficacy~of',
   'cui': 'C1280519',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'Vaccination',
   'detected_name': 'vaccination',
   'cui': 'C0042196',
   'types': ['Therapeutic or Preventive Procedure']},
  {'pretty_name': 'Elderly (population group)',
   'detected_name': 'elderly',
   'cui': 'C0001792',
   'types': ['Population Group']},
  {'pretty_name': 'Individual',
   'detected_name': 'person',
   'cui': 'C0237401',
   'types': ['Population Group']},
  {'pretty_name': 'immunosenescence',
   'detected_name': 'immunosenescence',
   'cui': 'C0596761',
   'types': ['Physiologic Function']},
  {'pretty_name': 'Characterization',
   'detected_name': 'characterized',
   'cui': 'C1880022',
   'types': ['Activity']},
  {'pretty_name': 'Decrease',
   'detected_name': 'decrease',
   'cui': 'C0547047',
   'types': ['Quantitative Concept']},
  {'pretty_name': 'CD8-Positive T-Lymphocytes',
   'detected_name': 'cell',
   'cui': 'C0242629',
   'types': ['Cell']},
  {'pretty_name': 'Effect',
   'detected_name': 'mediated',
   'cui': 'C1280500',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'functioning immune',
   'detected_name': 'immune~function',
   'cui': 'C1817756',
   'types': ['Organ or Tissue Function']},
  {'pretty_name': 'Reduced',
   'detected_name': 'reduced',
   'cui': 'C0392756',
   'types': ['Qualitative Concept']},
  {'pretty_name': 'Humoral immune response',
   'detected_name': 'humoral~immune~response',
   'cui': 'C1155229',
   'types': ['Cell Function']},
  {'pretty_name': 'Health',
   'detected_name': 'health',
   'cui': 'C0018684',
   'types': ['Idea or Concept']},
  {'pretty_name': 'Old age',
   'detected_name': 'old~age',
   'cui': 'C1999167',
   'types': ['Population Group']}]
}

resultTwo = generate_triples_from_concepts(test_response, FOL_prompt_medalpaca)
print(resultTwo)

Concepts String:
- Concept: C0851454 (Type: , Mentioned as: "age~related")
- Concept: Reduced (Type: Qualitative Concept, Mentioned as: "decline")
- Concept: functioning immune (Type: Organ or Tissue Function, Mentioned as: "immune~function")
- Concept: Referring (Type: Functional Concept, Mentioned as: "referred")
- Concept: immunosenescence (Type: Physiologic Function, Mentioned as: "immunosenescence")
- Concept: Person Responsible (Type: Finding, Mentioned as: "partially~responsible")
- Concept: High Prevalence (Type: Quantitative Concept, Mentioned as: "increased~prevalence")
- Concept: Severities (Type: Qualitative Concept, Mentioned as: "severity")
- Concept: Communicable Diseases (Type: Disease or Syndrome, Mentioned as: "infectious~disease")
- Concept: low (Type: Qualitative Concept, Mentioned as: "low")
- Concept: Effectiveness (Type: Qualitative Concept, Mentioned as: "efficacy~of")
- Concept: Vaccination (Type: Therapeutic or Preventive Procedure, Mentioned as: "vaccination"