# Extracting information from paper

This notebook illustrates some examples of working with text data using small, local language models.

## Running this notebook on a newer MacBook with Apple Silicon Chip

You will need an environment with Python and Jupyter installed. To create an environment with Anaconda for Python 3.12, execute: 

```
conda create --name llm-narrative python=3.12
conda activate llm-narrative
conda install jupyter
jupyter notebook
```

## Running this notebook on older MacBooks or any other machine

Please run this script on [Google Colab](https://colab.research.google.com/). After opening the notebook there, please change the settings to using a GPU, check [here](https://www.geeksforgeeks.org/how-to-use-gpu-in-google-colab/) for instructions on how to do that.



### Install required libraries

For the newer MacBooks with Apple Chips we will use `mlx-lm` to load a small, quantized version of the Llama 3 8b instruct model, so that it can run on a single laptop (https://ollama.com/library/llama3). For older MacBooks and other machines we will use a quantized version of the model provided by the hugging face community (https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit).

Depending on the machine, different packages are required and will be installed below.

In [2]:
import platform
import requests

In [None]:
# for the newer MacBooks with the Apple Chip
# changed for testing but change back later
if platform.processor() == 'arm':
    ! pip install mlx-lm torch transformers
# for all other machines
else:
    ! pip install torch transformers optimum accelerate auto-gptq bitsandbytes



### Install Model
Next we install the quantized version of the selected language model.

In [3]:
if platform.processor() == 'arm':
    from mlx_lm import load, generate
    model, tokenizer = load ('mlx-community/gemma-3n-E4B-it-lm-4bit') #load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")  # load ('mlx-community/gemma-3n-E4B-it-lm-4bit')
else:
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
    import torch

    MODEL_ID="astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit"
    tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit")

    config = AutoConfig.from_pretrained(MODEL_ID)
    config.quantization_config["disable_exllama"] = False
    config.quantization_config["exllama_config"] = {"version":2}

    model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID, 
            device_map='auto', 
            torch_dtype=torch.bfloat16, 
            trust_remote_code=True, 
            config=config,
        )

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

### Running the model with an example prompt

We show that the model can run with an example prompt. First we define the system prompt, which tells the model what character to adopt. Then we give it an instruction to introduce itself. Again, depending on the machine and therefore model used, we use slightly different functions to generate output.

In [4]:
from transformers import pipeline
from IPython.display import display

SYSTEM_MSG = "You are a helpful assistant for processing scientific literature concisely."

def generateFromPrompt(promptStr,maxTokens=20):
    if platform.processor() == 'arm':
      messages = [ {"role": "model", "content": SYSTEM_MSG},
              {"role": "user", "content": promptStr}, ]
      tokenizer.chat_template = """
{%- for message in messages %}
    <start_of_turn> {{-  message['role']\n + message['content']}} <end_of_turn> \n
{%- endfor %}
    <start_of_turn>model 
"""
      input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
      prompt = tokenizer.decode(input_ids)
      response = generate(model, tokenizer, prompt=prompt,max_tokens=maxTokens)
    else:
      message = [{"role": "user", "content": promptStr},]
      pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,max_new_tokens=maxTokens)
      result = pipe(message)
      response = result[0]['generated_text'][1]['content']
    return(response)


response = generateFromPrompt("Please introduce yourself")

print(response+"...")



Hello! I am a large language model created by Google DeepMind. I'm an open...


###  Now we need the following functions to search the internet for papers. We will use the API OpenAlex for this.

In [5]:
# function to parse the text 
def reconstruct_text(inverted_index):
    word_index = [] 
    if inverted_index: 
        for k,v in inverted_index.items(): 
            for index in v: 
                word_index.append([k,index])
                
        word_index = sorted(word_index,key = lambda x : x[1])
        
        word_list = []
        for i in range(len(word_index)):
            word_list.append(word_index[i][0])        
 
        separator = ' ' 
        reconstructed_text = separator.join(word_list) 

        return reconstructed_text
    else:
        return ''

# function that uses openalex to search web for papers
def search_openalex(search_phrase, result_count=10, min_year='2013'):
    base_url = "https://api.openalex.org/works"  # Replace with the actual API endpoint
    
    # Create filters
    filters = [
        "has_abstract:true",
        "has_fulltext:true",
        f"from_publication_date:{min_year}-01-01"
    ]
    
    # Construct the query parameters
    params = {
    "search": search_phrase,
    "filter": str.join(",", filters),  # Only return works with abstracts
    "per_page": result_count,  # Limit the search
    }
    
    r = requests.get(base_url, params=params)
    res_json = r.json()
    
    abstract_list = []
    for i in range(len(res_json["results"])):
        abstract_list.append(reconstruct_text(res_json["results"][i]['abstract_inverted_index']))
        
    return res_json["results"], abstract_list

### Let's try getting papers for our search now!

In [6]:
# enter your search phrase
search_phrase = 'depression randomised controlled trial'
# enter how many abstracts you want to receive
number_of_abstracts = 10
# collect the abstracts
res_, abstract = search_openalex(search_phrase, number_of_abstracts)

#### Now let's try extracting some information from the abstracts...

In [7]:
for i in range(number_of_abstracts): 
    print(f'===== {i} =====')
    print(res_[i]['title'])
    print(abstract[i])
    if abstract[i]:
        population = generateFromPrompt("PICO: Please extract the population, if any, mentioned in the following study abstract: "+abstract[i]+". Return only the concise answer or 'NONE' if not mentioned. Response: ")
        intervention = generateFromPrompt("PICO: Please extract the intervention, if any, mentioned in the following study abstract: "+abstract[i]+". Return only the concise answer or 'NONE' if not mentioned. Response: ")
        comparator = generateFromPrompt("PICO: Please extract the comparator, if any, mentioned in the following study abstract: "+abstract[i]+". Return only the concise answer or 'NONE' if not mentioned. Response: ")
        outcome = generateFromPrompt("PICO: Please extract the outcome, if any, mentioned in the following study abstract: "+abstract[i]+". Return only the concise answer or 'NONE' if not mentioned. Response: ")
        print("Population:",population,", intervention:",intervention,", comparator:",comparator,", outcome:",outcome)

===== 0 =====
Improving Adherence and Clinical Outcomes in Self-Guided Internet Treatment for Anxiety and Depression: Randomised Controlled Trial
Australian and New Zealand Clinical Trials Registry ACTRN12610001058066.
Population: 
NONE
 , intervention: 
NONE
 , comparator: 
NONE
 , outcome: 
NONE

===== 1 =====
Physical exercise and internet-based cognitive–behavioural therapy in the treatment of depression: Randomised controlled trial
Background Depression is common and tends to be recurrent. Alternative treatments are needed that are non-stigmatising, accessible and can be prescribed by general medical practitioners. Aims To compare the effectiveness of three interventions for depression: physical exercise, internet-based cognitive–behavioural therapy (ICBT) and treatment as usual (TAU). A secondary aim was to assess changes in self-rated work capacity. Method A total of 946 patients diagnosed with mild to moderate depression were recruited through primary healthcare centres across 