# What if we use LLM as synthetic data generator for NER / RE ?

## Loading a model with Ollama

In [None]:
! sudo apt update && sudo apt install pciutils lshw
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
!nohup ollama serve > ollama.log 2>&1 &

In [None]:
!ollama run vicuna:13b “What is the Capital of Spain ?”

In [None]:
!ollama list

### How good is the model to 'solve' the task ?

---

Here we want to extract relations between Organisms and Chemicals, in the form of \
```cypher
(Organism) -[produces]-> Chemical
```

We are going to give the LLM few demonstrations of the task on manually curated examples (5-shots)



<style>body {text-align: justify}</style>

#### Example 1 [PMID 24048364](https://pubmed.ncbi.nlm.nih.gov/24048364/)

**Title: Producers and important dietary sources of ochratoxin A and citrinin**

**Abstract:**
<font color='green'>Ochratoxin A</font> (OTA) is a very important mycotoxin, and its research is focused right now on the new findings of OTA, like being a complete carcinogen, information about OTA producers and new exposure sources of OTA. <font color='green'>Citrinin</font> (CIT) is another important mycotoxin, too, and its research turns towards nephrotoxicity. Both additive and synergistic effects have been described in combination with OTA. OTA is produced in foodstuffs by Aspergillus Section Circumdati (<font color='blue'>Aspergillus ochraceus</font>, <font color='blue'>A. westerdijkiae</font>, <font color='blue'>A. steynii</font>) and Aspergillus Section Nigri (<font color='blue'>Aspergillus carbonarius</font>, <font color='blue'>A. foetidus</font>, <font color='blue'>A. lacticoffeatus</font>, <font color='blue'>A. niger</font>, <font color='blue'>A. sclerotioniger</font>, <font color='blue'>A. tubingensis</font>), mostly in subtropical and tropical areas. OTA is produced in foodstuffs by <font color='blue'>Penicillium verrucosum</font> and <font color='blue'>P. nordicum</font>, notably in temperate and colder zones. CIT is produced in foodstuffs by Monascus species (<font color='blue'>Monascus purpureus</font>, <font color='blue'>M. ruber</font>) and Penicillium species (<font color='blue'>Penicillium citrinum</font>, <font color='blue'>P. expansum</font>, <font color='blue'>P. radicicola</font>, <font color='blue'>P. verrucosum</font>). OTA was frequently found in foodstuffs of both plant origin (e.g., cereal products, coffee, vegetable, liquorice, raisins, wine) and animal origin (e.g., pork/poultry). CIT was also found in foodstuffs of vegetable origin (e.g., cereals, pomaceous fruits, black olive, roasted nuts, spices), food supplements based on rice fermented with red microfungi Monascus purpureus and in foodstuffs of animal origin (e.g., cheese).

In [None]:
tiab = "Producers and important dietary sources of ochratoxin A and citrinin. Ochratoxin A (OTA) is a very important mycotoxin, and its research is focused right now on the new findings of OTA, like being a complete carcinogen, information about OTA producers and new exposure sources of OTA. Citrinin (CIT) is another important mycotoxin, too, and its research turns towards nephrotoxicity. Both additive and synergistic effects have been described in combination with OTA. OTA is produced in foodstuffs by Aspergillus Section Circumdati (Aspergillus ochraceus, A. westerdijkiae, A. steynii) and Aspergillus Section Nigri (Aspergillus carbonarius, A. foetidus, A. lacticoffeatus, A. niger, A. sclerotioniger, A. tubingensis), mostly in subtropical and tropical areas. OTA is produced in foodstuffs by Penicillium verrucosum and P. nordicum, notably in temperate and colder zones. CIT is produced in foodstuffs by Monascus species (Monascus purpureus, M. ruber) and Penicillium species (Penicillium citrinum, P. expansum, P. radicicola, P. verrucosum). OTA was frequently found in foodstuffs of both plant origin (e.g., cereal products, coffee, vegetable, liquorice, raisins, wine) and animal origin (e.g., pork/poultry). CIT was also found in foodstuffs of vegetable origin (e.g., cereals, pomaceous fruits, black olive, roasted nuts, spices), food supplements based on rice fermented with red microfungi Monascus purpureus and in foodstuffs of animal origin (e.g., cheese)."

In [None]:
prompt = f"""Instructions: The task is to extract natural-products relationships between organisms and chemicals from the input text.
Return all the extracted relations where the text report evidence that an organism produces a chemical, in the form of 'organism produces chemical.'
Here are some examples:
INPUT: The antimicrobially active EtOH extracts of Maytenus heterophylla yielded a new dihydroagarofuran alkaloid, 1beta-acetoxy-9alpha-benzoyloxy-2beta,6alpha-dinicotinoyloxy-beta-dihydroagarofuran, together with the known compounds beta-amyrin, maytenfolic acid, 3alpha-hydroxy-2-oxofriedelane-20alpha-carboxylic acid, lup-20(29)-ene-1beta,3beta-diol, (-)-4'-methylepigallocatechin, and (-)-epicatechin.
OUTPUT: Maytenus heterophylla produces 1beta-acetoxy-9alpha-benzoyloxy-2beta,6alpha-dinicotinoyloxy-beta-dihydroagarofuran. Maytenus heterophylla produces beta-amyrin. Maytenus heterophylla produces maytenfolic acid. Maytenus heterophylla produces 3alpha-hydroxy-2-oxofriedelane-20alpha-carboxylic acid. Maytenus heterophylla produces lup-20(29)-ene-1beta,3beta-diol. Maytenus heterophylla produces (-)-4'-methylepigallocatechin. Maytenus heterophylla produces (-)-epicatechin.
INPUT: Ten new ergosteroids, gloeophyllins A-J (1-10), have been isolated from the solid cultures of Gloeophyllum abietinum.
OUTPUT: Gloeophyllum abietinum produces gloeophyllin A. Gloeophyllum abietinum produces gloeophyllin B. Gloeophyllum abietinum produces gloeophyllin C. Gloeophyllum abietinum produces gloeophyllin D. Gloeophyllum abietinum produces gloeophyllin E. Gloeophyllum abietinum produces gloeophyllin F. Gloeophyllum abietinum produces gloeophyllin G. Gloeophyllum abietinum produces gloeophyllin H. Gloeophyllum abietinum produces gloeophyllin I. Gloeophyllum abietinum produces gloeophyllin J.
INPUT: The present work describes the isolation of the cyclic peptides geodiamolides A, B, H and I (1-4) from G. corticostylifera and their anti-proliferative effects against sea urchin eggs and human breast cancer cell lineages.
OUTPUT: G. corticostylifera produces geodiamolide A. G. corticostylifera produces geodiamolide B. G. corticostylifera produces geodiamolide H. G. corticostylifera produces geodiamolide I.
INPUT: Four new cyclic peptides, patellamide G (2) and ulithiacyclamides E-G (3-5), along with the known patellamides A-C (6-8) and ulithiacyclamide B (9), were isolated from the ascidian Lissoclinum patella collected in Pohnpei, Federated States of Micronesia.
OUTPUT: Lissoclinum patella produces patellamide G. Lissoclinum patella produces ulithiacyclamide E. Lissoclinum patella produces ulithiacyclamide F. Lissoclinum patella produces ulithiacyclamide G. Lissoclinum patella produces patellamide A. Lissoclinum patella produce patellamide B. Lissoclinum patella produces patellamide C. Lissoclinum patella produces ulithiacyclamide B.
INPUT: Chemical investigation of Trogopterus faeces has led to the isolation of seven flavonoids. Their structures were elucidated by chemical and spectral analyses. In an anticoagulative assay, three kaempferol coumaroyl rhamnosides had significant antithrombin activity. This is the first report on the occurrence of flavonoid glycosides in Trogopterus faeces.
OUTPUT: Trogopterus faeces produces flavonoids. Trogopterus faeces produces kaempferol coumaroyl rhamnosides. Trogopterus faeces produces flavonoid glycosides.
INPUT: {tiab}
OUTPUT: """

In [None]:
from openai import OpenAI
client = OpenAI(
    api_key="FREE",
    base_url="http://127.0.0.1:11434/v1",
)

print(f"What we send to the model: \n{prompt}")

response = client.chat.completions.create(
    model="vicuna:13b",
    messages=[
            {"role": "system", "content": "You are un expert in Relation Extraction"},
            {"role": "user", "content": prompt},
        ],
    )
output = response.choices[0].message.content

print(f"What we recieved: \n {output}")

### But how good in the LLM at doing the opposite ? Meaning, generating synthetic abstracts from labels ?

In [None]:
# prompt = "Instructions: Given a title, a list of keywords and main findings, create an abstract for a scientific article.\nTitle: Presence of free myxol and 4-hydroxymyxol and absence of myxol glycosides in Anabaena variabilis ATCC 29413, and proposal of a biosynthetic pathway of carotenoids.\nKeywords: free forms, biosynthesis pathway, biosynthetic pathway, glycosyl transferase, gene, configurations, molecular structures, configuration mixture, biosynthesis.\nMain findings: Echinenone, Canthaxanthin and Myxol were isolated from Trichormus variabilis. Echinenone, Myxol and Canthaxanthin were isolated from Anabaena variabilis.\nAbstract:"
prompt = "Instructions: Given a title, a list of keywords and main findings, create an abstract for a scientific article.\nTitle: Anti-inflammatory compounds from Sideritis javalambrensis n-hexane extract.\nKeywords: carrageenan mouse paw edema test, anti-inflammatory activities, active principles, purified fractions, esters of tyrosol with fatty acids.\nMain findings: Tyrosyl stearate, OH-Manoyl oxide, Tyrosyl palmitate, [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] docosanoate and [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] tetracosanoate were isolated from Sideritis javalambrensis. OH-Manoyl oxide, [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] docosanoate, [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] tetracosanoate, Tyrosyl stearate and Tyrosyl palmitate were isolated from Sideritis pungens. Tyrosyl stearate, [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] docosanoate, OH-Manoyl oxide, [(2S)-2-amino-3-(4-hydroxyphenyl)propanoyl] tetracosanoate and Tyrosyl palmitate were isolated from Sideritis linearifolia.\nAbstract"

In [None]:
print(f"What we send to the model: \n{prompt}")

response = client.chat.completions.create(
    model="vicuna:13b",
    messages=[
            {"role": "system", "content": "You are un expert in synthetic abstract generation."},
            {"role": "user", "content": prompt},
        ],
    )
output = response.choices[0].message.content

print(f"What we recieved: \n {output}")

## Can we fine-tune a model on syntehtic abstracts ? Does it work well ? (yes !)


---

[Here](https://zenodo.org/records/8422294/files/synthetic-datasets-RE-NP-vicuna-13B-v1.3.tar.gz) you can find a synthetic dataset of abstracts generated with vicuna:13b.


In [None]:
!pip install transformers
!pip install peft
!pip install sacremoses



In [None]:
import gc
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

In [None]:
# be sure to have enought GPU
! pgrep ollama

In [None]:
# ! kill xx

In [None]:
# model and adapters path
model_hf = "microsoft/BioGPT-Large"
lora_adapters = "mdelmas/BioGPT-Large-Natural-Products-RE-Diversity-synt-v1.0" # You can also try: mdelmas/BioGPT-Large-Natural-Products-RE-Extended-synt-v1.0

# Load model and plug adapters using peft
model = AutoModelForCausalLM.from_pretrained(model_hf, device_map={"":0})
model = PeftModel.from_pretrained(model, lora_adapters)
model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(model_hf)

In [None]:
def inference(text):
  device = torch.device("cuda")

  # Decoding arguments
  EVAL_GENERATION_ARGS = {"max_length": 1024,
                          "do_sample": False,
                          "forced_eos_token_id": tokenizer.eos_token_id,
                          "num_beams": 3,
                          "early_stopping": True,
                          "length_penalty": 1.5,
                          # "temperature": 0
                          }

  # Prepare the input
  input_text = text + tokenizer.eos_token + tokenizer.bos_token

  # Tokenize
  input_tokens = tokenizer(input_text, return_tensors='pt')
  input_tokens.to(device)

  # Generate
  with torch.no_grad():
    beam_output = model.generate(**input_tokens, **EVAL_GENERATION_ARGS)
  output = tokenizer.decode(beam_output[0][len(input_tokens["input_ids"][0]):], skip_special_tokens=True)

  # Parse and print
  rels = output.strip().split("; ")
  for rel in rels:
    print("- " + rel)
  torch.cuda.empty_cache()
  gc.collect()

In [None]:
title_text = "Producers and important dietary sources of ochratoxin A and citrinin."
abstract_text = "Ochratoxin A (OTA) is a very important mycotoxin, and its research is focused right now on the new findings of OTA, like being a complete carcinogen, information about OTA producers and new exposure sources of OTA. Citrinin (CIT) is another important mycotoxin, too, and its research turns towards nephrotoxicity. Both additive and synergistic effects have been described in combination with OTA. OTA is produced in foodstuffs by Aspergillus Section Circumdati (Aspergillus ochraceus, A. westerdijkiae, A. steynii) and Aspergillus Section Nigri (Aspergillus carbonarius, A. foetidus, A. lacticoffeatus, A. niger, A. sclerotioniger, A. tubingensis), mostly in subtropical and tropical areas. OTA is produced in foodstuffs by Penicillium verrucosum and P. nordicum, notably in temperate and colder zones. CIT is produced in foodstuffs by Monascus species (Monascus purpureus, M. ruber) and Penicillium species (Penicillium citrinum, P. expansum, P. radicicola, P. verrucosum). OTA was frequently found in foodstuffs of both plant origin (e.g., cereal products, coffee, vegetable, liquorice, raisins, wine) and animal origin (e.g., pork/poultry). CIT was also found in foodstuffs of vegetable origin (e.g., cereals, pomaceous fruits, black olive, roasted nuts, spices), food supplements based on rice fermented with red microfungi Monascus purpureus and in foodstuffs of animal origin (e.g., cheese)."
text = title_text + " " + abstract_text
inference(text)

#### Example 2 [PMID 12193025](https://pubmed.ncbi.nlm.nih.gov/12193025/)

**Title: 5-hydroxytryptamine-derived alkaloids from two marine sponges of the genus Hyrtios**

**Abstract:**
Indonesian specimens of the marine sponges <font color='blue'>Hyrtios erectus</font> and <font color='blue'>H. reticulatus</font> were found to contain 5-hydroxytryptamine-derived alkaloids. Their structures were determined on the basis of their spectral properties. <font color='blue'>H. erectus</font> contained <font color='green'>hyrtiosulawesine</font> (4), a new beta-carboline alkaloid, together with the already known alkaloids <font color='green'>5-hydroxyindole-3-carbaldehyde</font> (1), <font color='green'>hyrtiosin B</font> (2), and <font color='green'>5-hydroxy-3-(2-hydroxyethyl)indole</font> (3). <font color='blue'>H. reticulatus</font> contained the novel derivative <font color='green'>1,6-dihydroxy-1,2,3,4-tetrahydro-beta-carboline</font> (11) together with <font color='green'>serotonin</font> (5), <font color='green'>6-hydroxy-1-methyl-1,2,3,4-tetrahydro-beta-carboline</font> (7), and <font color='green'>6-hydroxy-3,4-dihydro-1-oxo-beta-carboline</font> (9).

In [None]:
title_text = "5-hydroxytryptamine-derived alkaloids from two marine sponges of the genus Hyrtios"
abstract_text = "Indonesian specimens of the marine sponges Hyrtios erectus and H. reticulatus were found to contain 5-hydroxytryptamine-derived alkaloids. Their structures were determined on the basis of their spectral properties. H. erectus contained hyrtiosulawesine (4), a new beta-carboline alkaloid, together with the already known alkaloids 5-hydroxyindole-3-carbaldehyde (1), hyrtiosin B (2), and 5-hydroxy-3-(2-hydroxyethyl)indole (3). H. reticulatus contained the novel derivative 1,6-dihydroxy-1,2,3,4-tetrahydro-beta-carboline (11) together with serotonin (5), 6-hydroxy-1-methyl-1,2,3,4-tetrahydro-beta-carboline (7), and 6-hydroxy-3,4-dihydro-1-oxo-beta-carboline (9)."
text = title_text + " " + abstract_text
inference(text)