<a href="https://colab.research.google.com/github/marcelotournier/llm_notebook_playground/blob/main/Mistral7_pubmed_summarizer_gguf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarizing Pubmed Automated Article RSS feeds about Type 2 Diabetes

This notebook contains a demo of an LLM summarizing the newest papers on Type 2 Diabetes from a Pubmed RSS Feed

You can create your own RSS feeds and test with this code, using the guide at https://www.nlm.nih.gov/pubs/techbull/mj05/mj05_rss.html

It is recommended that RSS feeds don't have more than 5 articles, due to the limitations of the model token context length and GPU power.

### Config:
- Select the menu "Runtime", then "Change runtime type"
- In the option "Hardware accelerator" choose "T4 GPU"

### Using:
- `generate(prompt)` will give you the whole response at once (can take a while)
- `stream(prompt)` will print one token per time, as in ChatGPT

In [7]:
# Install llama-cpp-python bindings for CUDA GPU
# "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
!CMAKE_ARGS="-DLLAMA_CUBLAS=ON" \
  pip install llama-cpp-python --quiet

In [8]:
# Config params - do not touch here unless you know what you are doing.

INPUT_MODEL_PATH = "https://huggingface.co/TheBloke/SciPhi-Mistral-7B-32k-GGUF/resolve/main/sciphi-mistral-7b-32k.Q4_K_M.gguf"
OUTPUT_MODEL_PATH = "downloaded_model.gguf"
GPU_LAYERS = 999
SYS_PROMPT = """You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
Summarize the findings of this XML string containing scientific articles into a blog post article. Include the references whenever convenient.
{prompt}
"""

In [9]:
# Download the model
import os


os.system(f"curl -L {INPUT_MODEL_PATH} -o {OUTPUT_MODEL_PATH}")

0

In [None]:
# Construting model - do not touch here unless you know what you are doing.
from llama_cpp import Llama


llm = Llama(
      model_path='downloaded_model.gguf',
      n_gpu_layers=999, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=32000, # Uncomment to increase the context window
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
# Testing model
print(llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      # max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      # stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
))

Llama.generate: prefix-match hit

llama_print_timings:        load time =     360.45 ms
llama_print_timings:      sample time =       9.25 ms /    16 runs   (    0.58 ms per token,  1729.73 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =     516.84 ms /    16 runs   (   32.30 ms per token,    30.96 tokens per second)
llama_print_timings:       total time =     597.27 ms /    17 tokens


{'id': 'cmpl-4534da59-35c0-4001-b3d1-1b3c65c1fe50', 'object': 'text_completion', 'created': 1710251604, 'model': 'downloaded_model.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Mercury 2. Venus 3. Earth 4.', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 14, 'completion_tokens': 16, 'total_tokens': 30}}


In [None]:
# Helper functions to generate outputs
def generate(prompt):
    return llm(SYS_PROMPT.format(prompt=prompt),
               stream=False)


def stream(prompt):
    for token in llm(SYS_PROMPT.format(prompt=prompt)):
        print(token, end='', flush=True)

In [None]:
# Testing the model
print(generate("tell me about diatetes"))



KeyboardInterrupt: 

# Usage examples:

You can create your own Pubmed RSS and change the variable `pubmed_rss` below with your link. Restrict the Pubmed RSS to display only the 5 most recent articles, for better results

In [None]:
import requests

pubmed_rss = "https://pubmed.ncbi.nlm.nih.gov/rss/search/1PQPGz2gzLgGzu9Ukv6Gc6AJfbcOKwnGqdz9iHxCjqhaIYdM-F/?limit=10&utm_campaign=pubmed-2&fc=20240306113136"

In [None]:
text = requests.get(pubmed_rss).text

In [None]:
# This takes a long time - I think it is due to the text size in the input
generate(text)

In [None]:
# This takes a long time - I think it is due to the text size in the input
stream(text)