This repository contains the code and assets for a prototype system that builds a biomedical knowledge graph for gut health and inflammatory bowel disease (IBD) and exposes it via a Streamlit QA interface. The project compares a custom GenAI‑driven extraction pipeline with Neo4j’s LLM Graph Builder on shared biomedical articles (e.g. Burger et al. for the Human Gut Cell Atlas common coordinate framework).
The main goals are:
- Extract entities and relations from gut / IBD literature using LLMs.
- Populate a custom knowledge graph and a Neo4j Aura instance with consistent schema.
- Support question‑answering over the graph (Streamlit app).
- Qualitatively compare answers and graph structure against the Neo4j LLM graph builder.
-
src/
Core Python code:- preprocessing (PDF → text, sentence splitting)
- LLM prompts and extraction
- KG construction and Neo4j loading
- Streamlit app entry point
-
data/raw/– source PDFs or text (not all tracked in Git)processed/– cleaned text, sentence indices, extracted triples, etc.
-
eval/
Notebooks and scripts for:- baseline keyword search
-
images/
Screenshots and figures used in the dissertation and README (Streamlit UI, Neo4j graphs, HGCA examples). -
requirement.txt
Python dependencies for the project.
- Clone the repo
git clone https://github.com/lostiithi/MED_LLM.git
cd MED_LLM- Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate- Install dependencies
pip install -r requirement.txt- Configure environment variables
Create a .env file (not committed to Git) with at least:
OPENAI_API_KEY=...
NEO4J_URI=...
NEO4J_USER=...
NEO4J_PASSWORD=...Example:
python src/parse/parse_pdf.py \
--input data/raw \
--output data/processedThis produces cleaned text and sentence indices for each PDF.
python src/run_extraction.py \
--config src/config/extraction_hgca.yml \
--input data/processed \
--output data/processed/sentences.csvThis step calls the LLM to extract entities and relations and saves triples to CSV.
python src/llm_test.py \
--triples data/processed/sentences.csv \
--output data/processed/llm_entities.csv
--output data/processed/llm_relations.csvUse Neo4j import option to upload the csv files to proceed.
Neo4j connection details are taken from .env.
To start the Gut / IBD Knowledge Graph QA interface:
streamlit run src/app.pyThe app lets you:
- Ask HGCA competency questions (e.g. Crohn’s lesions in terminal ileum).
- See answers generated over the custom KG.
- Inspect retrieved nodes, edges, and supporting sentences.
Under eval/ you will find:
- Baseline keyword search for each competency question.
For questions about this project, please contact:
- Mohammed Ihthisham Neelam Kadavil (GitHub:
lostithi)