# Medical Research Project

[paperai](https://github.com/neuml/paperai) is an AI application for medical and scientific papers.

This notebook will show an end to end process that builds a research report on `Colon Cancer in Young Adults`, a growing concern. We'll filter PubMed abstracts, build an `txtai` embeddings index and then run a research report.

# Install dependencies

Install `paperetl`, `paperai` and all dependencies. This step also downloads input data to process.

In [None]:
%%capture
!pip install git+https://github.com/neuml/paperai

# Download NLTK data
!python -c "import nltk; nltk.download(['punkt', 'punkt_tab', 'averaged_perceptron_tagger_eng'])"

# Build article repository

The first step is filtering through the PubMed abstracts matching `Colon Cancer`. We'll filter for articles matching the MeSH code [D015179](https://meshb.nlm.nih.gov/record/ui?ui=D015179).

This notebook assumes the [PubMed baseline dataset](https://pubmed.ncbi.nlm.nih.gov/download/) is available in the local `/data/sources/pubmed/data` directory.

In [None]:
!python -m paperetl.file /data/sources/pubmed/data /data/sources/pubmed/subsets/CRC/data /data/sources/pubmed/subsets/CRC/config

Once complete, there will be the following SQLite database with over 100K article abstracts.

In [6]:
!sqlite3 /data/sources/pubmed/subsets/CRC/data/articles.sqlite "SELECT COUNT(*) FROM articles"

109054


In [11]:
!sqlite3 /data/sources/pubmed/subsets/CRC/data/articles.sqlite "SELECT Title FROM articles LIMIT 3"

A phase II trial of interferon alpha-2b with folinic acid and 5-fluorouracil administered by 4-hour infusion in metastatic colorectal carcinoma.
In vitro modulation of haematoporphyrin derivative photodynamic therapy on colorectal carcinoma multicellular spheroids by verapamil.
Accurate method to measure the percentage hepatic replacement by tumour and its use in prognosis of patients with advanced colorectal cancer.


# Build the paperai index

Now that the articles have been parsed, let's build a paperai index. We'll index the articles and vectorize each section using the [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings) vector model. For this example, we'll limit the index to the top 10,000 most cited articles within the dataset.

In [None]:
!python -m paperai.index /data/sources/pubmed/subsets/colorectal/data neuml/pubmedbert-base-embeddings 0 10000

Once complete, additional index files will now be available in the data directory.

In [12]:
!ls /data/sources/pubmed/subsets/CRC/data

articles.sqlite  config.json  embeddings  ids


# Research report

Now that the data is all ready, let's build a report!

In [None]:
%%writefile crc.yml
name: ColonCancer
options:
    llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf
    system: You are a medical literature document parser. You extract fields from data.
    template: |
        Quickly extract the following field using the provided rules and context.

        Rules:
          - Keep it simple, don't overthink it
          - ONLY extract the data
          - NEVER explain why the field is extracted
          - NEVER restate the field name only give the field value
          - Say no data if the field can't be found within the context

        Field:
        {question}

        Context:
        {context}

    context: 5
    params:
        maxlength: 4096
        stripthink: True

Research:
    query: colon cancer young adults
    columns:
        - name: Date
        - name: Study
        - name: Study Link
        - name: Journal
        - {name: Sample Size, query: number of patients, question: Sample Size}
        - {name: Objective, query: objective, question: Study Objective}
        - {name: Causes, query: possible causes, question: List of possible causes}
        - {name: Detection, query: diagnosis, question: List of ways to diagnose}

This report defines the report name, the [RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/) parameters and columns to generate. We'll use a local LLM designed for working with medical literature but any supported txtai LLM can be used (Transformers, llama.cpp, APIs such as OpenAI/Claude etc). See the [LLM pipeline](https://neuml.github.io/txtai/pipeline/text/llm/) for more on this.

In [None]:
!python -m paperai.report report.yml 10 csv /data/sources/pubmed/subsets/CRC/data/

In [13]:
import pandas as pd
from IPython.display import display, HTML

display(HTML(pd.read_csv("Research.csv").to_html(index=False)))

Date,Study,Study Link,Journal,Sample Size,Objective,Causes,Detection
2023-04-27,Young-onset colorectal cancer.,https://pubmed.ncbi.nlm.nih.gov/37105987,Nature reviews. Disease primers,no data,"To better understand the underlying mechanism and define relationships between environmental factors and YO-CRC development, long-term prospective studies are needed with lifestyle data collected from childhood.","hereditary cancer syndrome, antibiotic use, low physical activity, obesity, environmental factors",no data
2020-04-06,Trends in the epidemiology of young-onset colorectal cancer: a worldwide systematic review.,https://pubmed.ncbi.nlm.nih.gov/32252672,BMC cancer,no data,Assess trends in young-onset colorectal cancer (yCRC) incidence and pooled overall APCi.,"- Increasing prevalence and incidence trends of young-onset colorectal cancer (yCRC) in adults under 50 years, with annual percentage changes (APCp +2.6, APCi up to +4.03) \n- Increased risk of rectal cancer in adults less than 50 years (9/14 studies) with significant annual percentage change in incidence (p < 0.001)",no data
2019-02-28,Early-onset colorectal cancer in young individuals.,https://pubmed.ncbi.nlm.nih.gov/30520562,Molecular oncology,no data,Provide considerations for future perspectives.,"hereditary cancer predisposing syndromes, familial CRC",no data
2018-03-28,Colorectal Cancer in the Young.,https://pubmed.ncbi.nlm.nih.gov/29616330,Current gastroenterology reports,no data,The Study Objective is to determine whether current screening approaches should be modified and whether causation and treatment options may differ in a molecular subset of EOCRCs.,no data,no data
2017-05-15,The Growing Challenge of Young Adults With Colorectal Cancer.,https://pubmed.ncbi.nlm.nih.gov/28516436,"Oncology (Williston Park, N.Y.)",no data,"address specific issues pertaining to AYA patients with colorectal cancer, including evaluation for hereditary colorectal cancer syndromes, clinicopathologic and biologic features unique to AYA patients with colorectal cancer, treatment outcomes, and survivorship.",no data,evaluation for hereditary colorectal cancer syndromes
2017-02-28,Colorectal cancer is a leading cause of cancer incidence and mortality among adults younger than 50 years in the USA: a SEER-based analysis with comparison to other young-onset cancers.,https://pubmed.ncbi.nlm.nih.gov/27864324,Journal of investigative medicine : the official publication of the American Federation for Clinical Research,no data,"summarize extracted data, both overall, and stratified by sex.","breast cancer, lung cancer",no data
2016-04-07,Different risk factors for advanced colorectal neoplasm in young adults.,https://pubmed.ncbi.nlm.nih.gov/27053853,World journal of gastroenterology,70428,To evaluate and compare odds ratios (OR) for ACRN between young-adults (YA < 50 years) and older-adults (OA ≥ 50 years).,"age, male sex, current smoking, family history of colorectal cancer, diabetes mellitus related factors, obesity, CEA, low-density lipoprotein-cholesterol",colonoscopy
2015-11-01,High Prevalence of Hereditary Cancer Syndromes in Adolescents and Young Adults With Colorectal Cancer.,https://pubmed.ncbi.nlm.nih.gov/26195711,Journal of clinical oncology : official journal of the American Society of Clinical Oncology,no data,The Study Objective is to determine whether patients diagnosed with colorectal cancer at age 35 years or younger should receive genetic counseling regardless of their family history and phenotype.,hereditary cancer syndromes,genetic counseling
2015-03-30,Colorectal cancer in young adults.,https://pubmed.ncbi.nlm.nih.gov/25480403,Digestive diseases and sciences,no data,Further studies are needed regarding this topic.,"behavioral and environmental causes, inherited CRC syndromes",no data
2004-03-30,Colorectal cancer in the young.,https://pubmed.ncbi.nlm.nih.gov/15006562,American journal of surgery,55,Characterize CRC in the young population and determine how CRC in this population should be further addressed regarding detection and treatment.,no data,no data


The results show the fields we asked for and when available, the generated data. This is a great way to do a deep pass over a large dataset to identify the most relevant articles to deep dive into during the research process. 

An important philosophy of `paperai` is that it's not using LLMs to summarize and abstract away the raw data. It's building a helpful index that can enable researchers to more easily get at the raw data they care about.