#LLM-Based Biomedical Data Summarization


In [None]:
# Step 1: Install needed packages (run this once)
!pip install langchain langchain-community transformers





transformers — Hugging Face’s library for working with state-of-the-art pre-trained language models like BioGPT.

In [None]:
# Step 2: Write mock_api.py
mock_api_code = '''
import json

def get_benchling_eln_entries(query="compound"):
    entries = [
        {
            "id": "eln_001",
            "title": "Transfection with Compound 9831",
            "createdAt": "2024-06-21T15:30:00Z",
            "content": "HEK293T cells treated with compound 9831 at 10 µM. 85% inhibition. Moderate cytotoxicity.",
            "author": "Dr. Maya Singh"
        },
        {
            "id": "eln_002",
            "title": "Dose Response Study for Compound 9831",
            "createdAt": "2024-06-24T12:00:00Z",
            "content": "Compound 9831 tested at 1, 10, 50 µM. IC50 = 8.2 µM. Minimal off-target effects.",
            "author": "Dr. Ravi Kumar"
        },
        {
            "id": "eln_003",
            "title": "Compound 9831 on SH-SY5Y Cells",
            "createdAt": "2024-06-28T09:15:00Z",
            "content": "No inhibition on SH-SY5Y cells. Likely cell-line specific effect.",
            "author": "Dr. Maya Singh"
        },
        {
            "id": "eln_004",
            "title": "Compound 5291 Initial Trial",
            "createdAt": "2024-06-30T10:00:00Z",
            "content": "Compound 5291 tested on A549 cells. 40% inhibition at 20 µM. Stable response.",
            "author": "Dr. Wei Zhang"
        },
        {
            "id": "eln_005",
            "title": "IC50 of Compound 5291",
            "createdAt": "2024-07-01T13:00:00Z",
            "content": "Dose response of Compound 5291 gives IC50 = 22 µM. No cytotoxicity observed.",
            "author": "Dr. Wei Zhang"
        },
    ]
    return [e for e in entries if query.lower() in e["title"].lower() or query.lower() in e["content"].lower()]

def get_cdd_compound_info(compound_name="compound"):
    compounds = [
        {
            "compound_id": "cmpd_9831",
            "name": "Compound 9831",
            "structure": "CC1=CC(=O)NC(C)=C1",
            "assays": [
                {"assay_id": "a_001", "title": "HEK293T Inhibition", "result": "85", "units": "%", "dose": "10 µM"},
                {"assay_id": "a_002", "title": "IC50 HEK293T", "result": "8.2", "units": "µM", "dose": "1-50 µM"},
                {"assay_id": "a_003", "title": "CYP450 Off-target", "result": "Low", "units": "qualitative"}
            ]
        },
        {
            "compound_id": "cmpd_5291",
            "name": "Compound 5291",
            "structure": "CC(C)C1=CC=CC=C1",
            "assays": [
                {"assay_id": "a_004", "title": "A549 Inhibition", "result": "40", "units": "%", "dose": "20 µM"},
                {"assay_id": "a_005", "title": "IC50 A549", "result": "22", "units": "µM", "dose": "1-100 µM"}
            ]
        }
    ]
    return [c for c in compounds if compound_name.lower() in c["name"].lower()]
'''
with open("mock_api.py", "w") as f:
    f.write(mock_api_code)





This code creates a Python file named mock_api.py that simulates (mocks) two APIs providing biomedical data:

`get_benchling_eln_entries(query)`:
Returns a filtered list of mock Electronic Lab Notebook (ELN) entries that mention the given query in their title or content. These entries simulate experimental notes related to compounds, including details like cell tests, inhibition percentages, IC50 values, and authorship.

`get_cdd_compound_info(compound_name)`:
Returns a filtered list of mock compound data from a Chemical Data Database (CDD). Each compound record includes its ID, name, chemical structure (as a string), and associated assay results like inhibition percentages, IC50 values, and off-target effects.

This mock API lets your code fetch example experimental and assay data without needing a real external service.
It helps you develop and test your summarization pipeline with consistent, predictable data.

For better understanding (Preeti):

Why do we use an API, even if the data is defined in JSON inside your code:

**Modularity & Abstraction**: The API functions hide the details of how data is stored or fetched. Your main code just “asks” the API for data without worrying about where it comes from.

**Simulating Real-world Use**: In real projects, data usually lives on servers or databases and you access it via APIs. Using mock APIs helps you develop and test your code as if you’re working with a real external service.

**Easier Updates & Maintenance**: If the data changes or grows, you only need to update it inside the API functions, not throughout your whole codebase.

**Reusability**: APIs let you reuse the same data retrieval logic in different parts of your project or even in other projects.

In [None]:
llm_chat_code = '''
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

#1. Load BioGPT model and tokenizer
model_name = "microsoft/BioGPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

#2. Build HF text-generation pipeline
hf_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=300,
    temperature=0.7,
    return_full_text=False,
)

#3. Wrap the pipeline with LangChain's LLM interface
pipe = HuggingFacePipeline(pipeline=hf_pipeline)

#4. Define prompt template
prompt_template = PromptTemplate(
    input_variables=["eln", "cdd"],
    template=(
        "You are a biomedical assistant. Summarize the compound's performance using the data below.\\n"
        "ELN entries:\\n{eln}\\n\\n"
        "CDD data:\\n{cdd}\\n\\n"
        "Summary:"
    )
)

#5. Create the LLMChain
chain = LLMChain(llm=pipe, prompt=prompt_template)

#6. Helper function to clean and format input data
def summarize_data(eln_entries, cdd_data):
    eln_str = "\\n".join([f"{e['title']}: {e['content']}" for e in eln_entries])
    cdd_str = "\\n".join([
        f"{c['name']}:\\n" + "\\n".join([f"- {a['title']}: {a['result']} {a['units']}" for a in c["assays"]])
        for c in cdd_data
    ])

    try:
        return chain.invoke({"eln": eln_str, "cdd": cdd_str})
    except Exception as e:
        return f"[Error during summarization: {e}]"
'''

with open("llm_chat.py", "w") as f:
    f.write(llm_chat_code)


This file creates a reusable tool that takes experimental data, asks BioGPT to summarize it, and gives you a natural-language result. It sets up everything needed for this — model loading, formatting, prompting, and running — all in one place.

1. Loads the BioGPT model and tokenizer — designed for biomedical text generation. Causal LM is Causal Language modeling used for predicting next word.

2. This line builds a text-generation engine using your model and tokenizer, and customizes how much it writes, how random it is, and what part of the output you see.

3. This line turns your Hugging Face model (`hf_pipeline`) into a LangChain-compatible LLM object so you can use it in LangChain’s powerful chains and prompting frameworks.

4.  This code defines a reusable template that combines ELN and CDD data into a clear, consistent prompt for the LLM to generate a scientific summary. LLMs like GPT or BioGPT don’t “just know” what to do — they rely heavily on prompt engineering.


5. It creates a chain that will:
Take ELN + CDD data as input;
Format them using your prompt_template;
Pass the full prompt to your model;
Return the model’s generated summary text

6. This function formats ELN and CDD data into text, sends it to the language model chain for summarization, and returns the summary or an error if something goes wrong.

In [None]:
!pip install -U sacremoses

# Step 4: Import and run the summarization inline
from mock_api import get_benchling_eln_entries, get_cdd_compound_info
from llm_chat import summarize_data

compound_name = "compound 9831"
eln_data = get_benchling_eln_entries(compound_name)
cdd_data = get_cdd_compound_info(compound_name)

print("=== ELN Data ===")
for e in eln_data:
    print(f"- {e['title']}: {e['content']}")

print("\n=== CDD Data ===")
for c in cdd_data:
    print(f"- {c['name']} assays:")
    for a in c["assays"]:
        print(f"  * {a['title']} = {a['result']} {a['units']}")

print("\n=== LLM Summary ===")
print(summarize_data(eln_data, cdd_data))



Device set to use cpu
  pipe = HuggingFacePipeline(pipeline=hf_pipeline)
  


=== ELN Data ===
- Transfection with Compound 9831: HEK293T cells treated with compound 9831 at 10 µM. 85% inhibition. Moderate cytotoxicity.
- Dose Response Study for Compound 9831: Compound 9831 tested at 1, 10, 50 µM. IC50 = 8.2 µM. Minimal off-target effects.
- Compound 9831 on SH-SY5Y Cells: No inhibition on SH-SY5Y cells. Likely cell-line specific effect.

=== CDD Data ===
- Compound 9831 assays:
  * HEK293T Inhibition = 85 %
  * IC50 HEK293T = 8.2 µM
  * CYP450 Off-target = Low qualitative

=== LLM Summary ===
{'eln': 'Transfection with Compound 9831: HEK293T cells treated with compound 9831 at 10 µM. 85% inhibition. Moderate cytotoxicity.\nDose Response Study for Compound 9831: Compound 9831 tested at 1, 10, 50 µM. IC50 = 8.2 µM. Minimal off-target effects.\nCompound 9831 on SH-SY5Y Cells: No inhibition on SH-SY5Y cells. Likely cell-line specific effect.', 'cdd': 'Compound 9831:\n- HEK293T Inhibition: 85 %\n- IC50 HEK293T: 8.2 µM\n- CYP450 Off-target: Low qualitative', 'text': 

###About the code:

This code block fetches mock experimental and assay data for a compound, displays the raw data, and then produces a natural language summary of that data using a biomedical language model. It demonstrates an end-to-end workflow from data retrieval to AI-powered summarization.

###About the output:

The summary is returned as a dictionary containing:

`eln`: The raw concatenated ELN text input.

`cdd`: The raw concatenated CDD assay text input.

`text`: The generated summary by the language model, which says:

"In summary, compound 9831 is a safe and potent small molecule for in vitro use."

Till now, we worked on : generating a summary by the language model (BioGPT) for that particular compund

Code using ChatGPT