# Extracting information from paper

This notebook illustrates some examples of working with text data using small, local language models.

## Running this notebook on a newer MacBook with Apple Silicon Chip

You will need an environment with Python and Jupyter installed. To create an environment with Anaconda for Python 3.12, execute: 

```
conda create --name llm-narrative python=3.12
conda activate llm-narrative
conda install jupyter
jupyter notebook
```

## Running this notebook on older MacBooks or any other machine

Please run this script on [Google Colab](https://colab.research.google.com/). After opening the notebook there, please change the settings to using a GPU, check [here](https://www.geeksforgeeks.org/how-to-use-gpu-in-google-colab/) for instructions on how to do that.



### Install required libraries

For the newer MacBooks with Apple Chips we will use `mlx-lm` to load a small, quantized version of the Llama 3 8b instruct model, so that it can run on a single laptop (https://ollama.com/library/llama3). For older MacBooks and other machines we will use a quantized version of the model provided by the hugging face community (https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit).

Depending on the machine, different packages are required and will be installed below.

In [1]:
import platform
import requests

# for the newer MacBooks with the Apple Chip
# changed for testing but change back later
if platform.processor() == 'arm':
    ! pip install mlx-lm torch transformers
# for all other machines
else:
    ! pip install torch transformers optimum accelerate auto-gptq bitsandbytes



### Install Llama 3 - 8b
Next we install the quantized version of the Llama 8b language model.



In [2]:
if platform.processor() == 'arm':
    from mlx_lm import load, generate
    model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")
else:
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
    import torch

    MODEL_ID="astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit"
    tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit")

    config = AutoConfig.from_pretrained(MODEL_ID)
    config.quantization_config["disable_exllama"] = False
    config.quantization_config["exllama_config"] = {"version":2}

    model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID, 
            device_map='auto', 
            torch_dtype=torch.bfloat16, 
            trust_remote_code=True, 
            # low_cpu_mem_usage=True,
            # load_in_4bit=True,
            config=config,
        )

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Running the model with an example prompt

We show that the model can run with an example prompt. First we define the system prompt, which tells the model what character to adopt. Then we give it an instruction to introduce itself. Again, depending on the machine and therefore model used, we use slightly different functions to generate output.

In [35]:
from transformers import pipeline
from IPython.display import display

SYSTEM_MSG = "You are a helpful chatbot assistant."

def generateFromPrompt(promptStr,maxTokens=100):
    if platform.processor() == 'arm':
      messages = [ {"role": "system", "content": SYSTEM_MSG},
              {"role": "user", "content": promptStr}, ]
      input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
      prompt = tokenizer.decode(input_ids)
      response = generate(model, tokenizer, prompt=prompt,max_tokens=maxTokens)
    else:
      message = [{"role": "user", "content": promptStr},]
      pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,max_new_tokens=maxTokens)
      result = pipe(message)
      response = result[0]['generated_text'][1]['content']
    return(response)


response = generateFromPrompt("Please introduce yourself")

print(response+"...")

"Hello! My name is Ada, and I'm a helpful chatbot assistant. I'm here to assist you with any questions or tasks you may have. I'm a large language model, trained on a vast amount of text data, which enables me to understand and respond to natural language inputs.\n\nI'm designed to be friendly, approachable, and knowledgeable. I can help you with a wide range of topics, from general knowledge and entertainment to more specific areas like science, technology, and health...."

###  Now we need the following functions to search the internet for papers. We will use the API OpenAlex for this.

In [None]:
# function to parse the text 
def reconstruct_text(inverted_index):
    word_index = [] 
    for k,v in inverted_index.items(): 
        for index in v: 
            word_index.append([k,index])
            
    word_index = sorted(word_index,key = lambda x : x[1])
    
    word_list = []
    for i in range(len(word_index)):
        word_list.append(word_index[i][0])
        
 
    separator = ' ' 
    reconstructed_text = separator.join(word_list) 

    return reconstructed_text

# function that uses openalex to search web for papers
def search_openalex(search_phrase, result_count=10, min_year='2013'):
    base_url = "https://api.openalex.org/works"  # Replace with the actual API endpoint
    
    # Create filters
    filters = [
        "has_abstract:true",
        "has_fulltext:true",
        f"from_publication_date:{min_year}-01-01"
    ]
    
    # Construct the query parameters
    params = {
    "search": search_phrase,
    "filter": str.join(",", filters),  # Only return works with abstracts
    "per_page": result_count,  # Limit the search
    }
    
    r = requests.get(base_url, params=params)
    res_json = r.json()
    
    abstract_list = []
    for i in range(len(res_json["results"])):
        abstract_list.append(reconstruct_text(res_json["results"][i]['abstract_inverted_index']))
        
    return res_json["results"], abstract_list

### Let's try getting papers for our search now!

In [None]:
# enter your search phrase
search_phrase = 'depression randomized control trial'
# enter how many abstracts you want to receive
number_of_abstracts = 10
# collect the abstracts
res_, abstract = search_openalex(search_phrase, number_of_abstracts)
# show title
res_[0]['title']
# show abstract
abstract[0]

### Now, we want to analyse them

In [None]:
# Function to extract knowledge graphs from paper/ abstract given via input
# 
# Written 2024 by Joshua Sammet, Chair of medical Knowledge and Decision, University of St. Gallen
#

import torch
import transformers
import argparse
import logging
import json
import os

# Needed for OpenAlex
import requests

#logger = get_logger(__name__, log_level="INFO")

def parse_args():
    parser = argparse.ArgumentParser(description="Simple example of a training script.")
    parser.add_argument(
        "--pretrained_model_name_or_path",
        type=str,
        default=None,
        required=True,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
    )
    
# Function to reconstruct the inverted index storing of the abstracts
def reconstruct_text(inverted_index):
    word_index = [] 
    for k,v in inverted_index.items(): 
        for index in v: 
            word_index.append([k,index]) 
    word_index = sorted(word_index,key = lambda x : x[1])
    
    word_list = []
    for i in range(len(word_index)):
        word_list.append(word_index[i][0])
        
    separator = ' ' 
    reconstructed_text = separator.join(word_list) 

    return reconstructed_text

# Function to search OpenAlex database
def search_openalex(search_phrase, result_count=10, min_year='2013'):
    base_url = "https://api.openalex.org/works"  # Replace with the actual API endpoint
    
    # Create filters
    filters = [
        "has_abstract:true",
        "has_fulltext:true",
        f"from_publication_date:{min_year}-01-01"
    ]
    
    # Construct the query parameters
    params = {
    "search": search_phrase,
    "filter": str.join(",", filters),  # Only return works with abstracts
    "per_page": result_count,  # Limit the search
    }
    
    r = requests.get(base_url, params=params)
    res_json = r.json()
    
    abstract_list = []
    for i in range(len(res_json["results"])):
        abstract_list.append(reconstruct_text(res_json["results"][i]['abstract_inverted_index']))
        
    return res_json["results"], abstract_list

#--------------Prompts--------------
PICO_message_abstract="""You are an expert agent specialized in extracting PICO elements on abstracts from scientific publications. 
The PICO elements are population, intervention, comparison and outcome. Your task is to identify the entities and 
relations requested from an abstract of an scientific paper that is given to you in a prompt.
You must generate the output in a JSON containing a list with JSON objects having the following keys: 
"head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity from the provided user prompt, 
the "head_type" key must contain the type of the extracted head entity which must be one of the PICO elements, the "relation" key must contain the type of relation 
between the "head" and the "tail", the "tail" key must represent the text of an extracted entity which is the tail
of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract around 10 entities and relations.
"""
PICO_bulletpoints_abstract="""You are an expert agent specialized in extracting PICO elements on abstracts from scientific publications. 
The PICO elements are population, intervention, comparison and outcome. Your task is to identify and extract the content from an abstract of an scientific paper that is given to you in a prompt.
You must generate the output in the form of a list of bullet points. The content of each bullet point should summarize an aspect of the abstract with regard to at least one PICO element. Please mention the relevant PICO elements at the beginning of each bullet point.
Attempt to extract all relevant content from the abstract with around 10 bullet points.
"""
#Answer wih 'I am ready' if you understood."""

#the "head_type" key must contain the type of the extracted head entity which must be a topic from the field of clinical trials in medicine or one of the PICO elements

PICO_message_papers="""You are an expert agent specialized in extracting PICO elements from scientific publications. 
The PICO elements are population, intervention, comparison and outcome. They are created to systemically review clinical literature.
Your task is to identify the entities and relations requested from an scientific paper that is given to you in a user prompt.
You must generate the output in a JSON containing a list with JSON objects having the following keys: 
"head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity from the provided user prompt, 
the "head_type" key must contain the type of the extracted head entity which must be a topic from the field of 
clinical trials in medicine or one of the PICO elements, the "relation" key must contain the type of relation 
between the "head" and the "tail", the "tail" key must represent the text of an extracted entity which is the tail
of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
Answer wih 'I am ready' if you understood."""
entity_message=""""""
entity_and_relations_message=""""""
relations_message=""""""

"""
Function to find publications of interest and analyse them
INPUT:
model_name - name of specified model that should be used
search_phrase - Topic that should be search for (e.g. 'depression treatment random controlled trial'
number_of_abstracts - How many papers should be checked
entities - define the entities in the knowledge graph
relations - define the relations in the knowledge graph
prompt - Specify which prompt should be used to instruct the model

OUTPUT:
returns json file with knowledge graph for each apper
"""
def find_and_analyse(model_name, search_phrase, number_of_abstracts=10, entities=None, relations=None, prompt=None):
    # Get papers from OpenAlex
    results, abstracts = search_openalex(search_phrase, number_of_abstracts)
    
    # Define system prompt message
    if prompt==None:
        if entities==None:
            if relations==None:
                system_message = PICO_bulletpoints_abstract #PICO_message_abstract
            else:
                system_message = relations_message
        else:
            if relations==None:
                system_message = entity_message
            else:
                system_message = entity_and_relations_message
    else:
        system_message = prompt
    
    # Setup LLM
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', device_map='cuda', attn_implementation="flash_attention_2")
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_name
        # , use_fast=False, # only use for Orca because of bug
    )
    prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n <|im_start|> assistant\n "

    #inputs = tokenizer(prompt, return_tensors='pt', return_attention_mask=False).to("cuda")
    #output_ids = model.generate(inputs["input_ids"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
    
    for i in range(len(abstracts)):
        print(f"For paper number {i+1}, titled {results[i]['title']}, the knowledge graph of the abstract gives the following information:\n")
        #prompt = f"<|im_start|>user\n Extract the knowledge graph fom the following abstract:\n {abstracts[i]}<|im_end|>\n<|im_start|> assistant\n "
        prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n Extract the knowledge graph fom the following abstract:\n {abstracts[i]}<|im_end|>\n<|im_start|> assistant\n "

        inputs = tokenizer(prompt, return_tensors='pt', return_attention_mask=False).to("cuda")
        output_ids = model.generate(inputs["input_ids"],max_new_tokens=250)
        answer = tokenizer.batch_decode(output_ids)[0]
        cut_answer = answer.split("<|im_start|> assistant\n",1)[1]
        #json_answer = json.loads(cut_answer.split("</s>",1)[0])
        print(cut_answer + '\n')
        #print(answer)
        

    return None

In [None]:
# analyse the papers
automated_paper_analyser.find_and_analyse("meta-llama/Meta-Llama-3-8B", search_phrase, number_of_abstracts)