# HealTac 2024 Tutorial
## PubMed abstract information extraction

- Yunsoo Kim (yunsoo.kim.23@ucl.ac.uk), Jinge Wu (jinge.wu.20@ucl.ac.uk), Honghan Wu (honghan.wu@ucl.ac.uk)

<a target="_blank" href="https://colab.research.google.com/github/knowlab/healtac_2024_tutorial.github.io/blob/main/pubmed_IE.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Installing dependencies

In [None]:
!nvidia-smi

In [None]:
!pip install datasets transformers accelerate

In [None]:
import requests
from bs4 import BeautifulSoup

We are going to use entrez eutil to get abstract

In [None]:
def get_pmids(term, maxCount):
    if int(maxCount)<10000:
        r=requests.get(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={term}&RetMax={maxCount}')
        soup=BeautifulSoup(r.text)
        try:
            return [i.get_text(strip=True) for i in soup.find_all('id')]
        except:
            return []
    else:
        pmid_list=[]
        for i in range(0, int(maxCount), 9999):
            r=requests.get(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={term}&RetMax=9999&retstart={i}')
            soup=BeautifulSoup(r.text)
            try:
                pmid_list=pmid_list+[i.get_text(strip=True) for i in soup.find_all('id')]
            except:
                return []
        return pmid_list

In [None]:
def get_pubmed_abstracts(pmid):
    r=requests.get(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}')
    soup = BeautifulSoup(r.text)
    try:
        return '\n'.join([i.get_text(strip=True) for i in soup.find_all("abstracttext")])
    except:
        return ""

In [None]:
pmids=get_pmids('SCN5A', 1)
print(pmids)

In [None]:
get_pubmed_abstracts('38849547')

In [None]:
data={'pmid': '15485686',
 'text': 'A novel SCN5A mutation manifests as a malignant form of long QT syndrome with perinatal onset of tachycardia/bradycardia.\nOBJECTIVE: Congenital long QT syndrome (LQTS) with in utero onset of the rhythm disturbances is associated with a poor prognosis. In this study we investigated a newborn patient with fetal bradycardia, 2:1 atrioventricular block and ventricular tachycardia soon after birth. METHODS: Mutational analysis and DNA sequencing were conducted in a newborn. The 2:1 atrioventricular block improved to 1:1 conduction only after intravenous lidocaine infusion or a high dose of mexiletine, which also controlled the ventricular tachycardia. RESULTS: A novel, spontaneous LQTS-3 mutation was identified in the transmembrane segment 6 of domain IV of the Na(v)1.5 cardiac sodium channel, with a G-->A substitution at codon 1763, which changed a valine (GTG) to a methionine (ATG). The proband was heterozygous but the mutation was absent in the parents and the sister. Expression of this mutant channel in tsA201 mammalian cells by site-directed mutagenesis revealed a persistent tetrodotoxin-sensitive but lidocaine-resistant current that was associated with a positive shift of the steady-state inactivation curve, steeper activation curve and faster recovery from inactivation. We also found a similar electrophysiological profile for the neighboring V1764M mutant. But, the other neighboring I1762A mutant had no persistent current and was still associated with a positive shift of inactivation. CONCLUSIONS: These findings suggest that the Na(v)1.5/V1763M channel dysfunction and possible neighboring mutants contribute to a persistent inward current due to altered inactivation kinetics and clinically congenital LQTS with perinatal onset of arrhythmias that responded to lidocaine and mexiletine.',
 'entities': ['sodium',
  'valine (GTG) to a methionine (ATG)',
  'atrioventricular block',
  'tachycardia',
  'long QT syndrome',
  'bradycardia',
  'mexiletine',
  'Na(v)1.5',
  'SCN5A',
  'LQTS',
  'G-->A substitution at codon 1763',
  'V1763M',
  'lidocaine',
  'V1764M',
  'arrhythmias',
  'I1762A',
  'patient',
  'tetrodotoxin',
  'tsA201',
  'ventricular tachycardia']}

In [None]:
text=data['text']

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

In [None]:
model = AutoModelForCausalLM.from_pretrained(
        'bluesky333/medphi2',
        device_map="auto",
        trust_remote_code=True,
    ).eval()

model.generation_config = GenerationConfig.from_pretrained(
    'bluesky333/medphi2', trust_remote_code=True
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    'bluesky333/medphi2'
)

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
def get_response(question):
    inputs = tokenizer(question, return_tensors="pt", return_attention_mask=False).to(DEVICE)
    response = model.generate(**inputs,
                         eos_token_id=tokenizer.eos_token_id,  # End of sequence token
                         pad_token_id=tokenizer.eos_token_id,  # Pad token
                         max_new_tokens=1024,
                        )
    output = tokenizer.decode(response.squeeze()[len(inputs['input_ids'][0]):], skip_special_tokens=True)
    return output

In [None]:
data['entities']

In [None]:
get_response("Extract biomedical named entities. \n"+text)

In [None]:
print(get_response("Extract biomedical named entities and map them with their types. \n"+text))

In [None]:
print(get_response("Extract biomedical named entities and map them with their types. Give me the result in a table.  \n"+text))

In context learning

In [None]:
print(get_response("Extract biomedical named entities and map them with their types. Also, do relation extraction. Give me the result in a table.  \n"+text))

In [None]:
print(get_response("Extract biomedical named entities and map them with their types. Also, do relation extraction. The relationship format should be Gene A <-> Disease B if Gene A has an association with Disease B. Give me the result in a table.  \n"+text))

Chain of Thought

In [None]:
get_pubmed_abstracts('38849547')

In [None]:
text0='Sudden infant death syndrome (SIDS) is still the leading cause of death for newborns in developed countries. The pathophysiological mechanisms have not been fully clarified, but in some of SIDS cases variants of genes associated with inherited cardiac conditions are found. In this study, an analysis of SCD-related genes was performed to determine the prevalence of rare pathogenic (P) or likely pathogenic (LP) variants that could provide an unambiguous explanation for the fatal event. A cohort of 76 SIDS cases underwent Next-Generation Sequencing (NGS) analysis with a custom panel of SCD-related genes. Rare variants were classified according to the guidelines provided by the American College of Medical Genetics and Genomics (ACMG) and the specifications of the ClinGen association. Post-mortem genetic testing identified 50 (65.8%) carriers of at least one variant in SCD genes. 104 rare genetic variants were found, 65.4% in genes encoding structural proteins. Only 4 out of 76 cases (5.3%) hosted at least a P or LP variant found in genes with structural or structural/arrhythmogenic functions (SLC22A5, SCN5A, MYL3and TTN). 99 variants were classified as of uncertain significance (VUS)'

In [None]:
get_response("Calculate the chi-square p-value of the difference in the distribution of variants between gene groups by function. \n"+text0)

In [None]:
print(get_response("Calculate the chi-square p-value of the difference in the distribution of variants between gene groups by function. Show me the step by step calculation. \n"+text0))