<a href="https://colab.research.google.com/github/peterverhaar/exploring_ai/blob/main/A_RAG_example_using_HF_and_LangChain_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building a simple Retrieval-Augmented Generation (RAG) pipeline**

In [1]:
!pip install transformers
!pip install chromadb
!pip install sentence-transformers
!pip install langchain
!pip install langchain_community
!pip install PyPDF2
!pip install pypdf
!pip install accelerate
!pip install numpy



In [2]:
import requests
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModel, pipeline
from sentence_transformers import SentenceTransformer
import torch


In [3]:
response = requests.get('https://barcelona-declaration.org/downloads/BarcelonaDeclaration.pdf')
with open("BarcelonaDeclaration.pdf",'wb') as out:
  out.write(response.content)

## **Loading a PDF file**
Assunimg the file is already uploaded in the session storage [link](https://barcelona-declaration.org/downloads/BarcelonaDeclaration.pdf)

In [4]:
loader_pdf = PyPDFLoader("BarcelonaDeclaration.pdf")
pages = loader_pdf.load_and_split()

In [5]:
print(f'The pdf file contains {len(pages)} pages.')

The pdf file contains 13 pages.


In [6]:
full_text = ''
for page in pages:
  full_text += page.page_content + ''

print(f'The full pdf contains {len(full_text)} characters.')

The full pdf contains 17933 characters.


## **Initialize the text splitter**
This can be a very important task, as different splitting methods and chunk sizes can lead to different results.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=10)
chunks = text_splitter.split_text(full_text)

*Only if you want to see the documents distribution before and after the split, run the code below. Also, a chunk sample*

In [8]:
total_characters = 0
for chunk in chunks:
  total_characters += len(chunk)

print(f'Average length: {round(total_characters/len(chunks),2)}')

Average length: 941.95


## **Initialize embeddings**
Also different embedding models may be better suited for some specific tasks

*Note: an HF_TOKEN is needed

In [9]:

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)



Optional - Dispay the embeddigs - if you wish to see asample of the embeddig

In [10]:
sample_chunk = chunks[0]
print(sample_chunk)

April 16, 2024  /  DOI: https://doi.org/10.5281/zenodo.10958522Preamble
Commitments
Annex A :  Background and context
Annex B :  Definitions1
2
4
8INDEXVast amounts of information are being used to manage the research enterprise, from information 
about research actors and their activities to information about inputs and outputs in the research 
process and signals of the use, esteem, and societal impact of research. This information often 
plays a vital role in the distribution of resources and the evaluation of researchers and institutions. 
Research performing and research funding organizations use this information to set strategic 
priorities. The information is also indispensable for researchers and societal stakeholders to find and 
assess relevant research outputs.
However, a large share of all research information  is locked inside proprietary infrastructures. It is 
managed by companies that are accountable primarily to their shareholders, not to the research


In [11]:
import numpy as np

sample_embedding = np.array(embeddings.embed_query(sample_chunk))

print("Size of the embedding: ", sample_embedding.shape)
#print("Sample embedding of a document chunk: ", sample_embedding)

Size of the embedding:  (768,)
Sample embedding of a document chunk:  [ 6.50507258e-03  9.65032075e-03 -2.22585108e-02  5.31656574e-03
  4.31291275e-02 -3.36131342e-02  3.34306099e-02  5.60918311e-03
 -4.44583409e-02 -3.26967239e-03 -5.77347688e-02 -3.76554392e-02
 -5.88920563e-02  6.84667099e-03 -3.71541791e-02  2.64626667e-02
  6.45871088e-02  1.51348242e-03  2.09604483e-02  8.59784521e-03
 -2.92910170e-02  5.45725673e-02 -1.86856538e-02  1.45020084e-02
  7.21657723e-02 -6.92597311e-03  8.78396723e-03 -2.16478650e-02
 -2.99356375e-02  7.72343017e-03  6.43044114e-02 -8.15685373e-03
  3.77368368e-03  2.06993688e-02  3.78303006e-02  2.73221284e-02
 -2.40054112e-02 -6.71114074e-03  2.64531351e-03 -1.19580338e-02
 -5.05552180e-02  1.60892326e-02 -2.99837049e-02 -3.23724784e-02
 -1.71764325e-02 -3.21602710e-02 -5.28261214e-02  4.30884399e-02
 -8.53959844e-03  3.74940559e-02 -5.89923412e-02 -6.90632174e-03
 -3.18766683e-02  1.38765844e-02  2.43337676e-02  3.54007967e-02
 -7.14678876e-03 -5.

## **Create vectorstore** - here we used Chroma database, but another can be chosen

In [12]:

vectorstore = Chroma.from_texts(chunks, embeddings)


Optionally, you can perform a simple search to test it directly at vector store (through similarity search)

Display more results

In [13]:
from google.colab import userdata
my_secret_key = userdata.get('HF_TOKEN')
print(my_secret_key)



In [14]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## **Create the HuggingFacePipeline with the specified model and device**

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
import torch

#Select a LLM model. Note that different models can be selected, depenting on the use case
#model_id = "microsoft/phi-1_5"
#model_id= "microsoft/Phi-3-mini-128k-instruct"
#model_id = "mistralai/Mistral-7B-v0.3"
#model_id = "BAAI/bge-m3"
#model_id = "mistralai/Mistral-7B-v0.3" # Access to this model must be authenticated (at HF)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

#Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)


if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

device = 0 if torch.cuda.is_available() else -1

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

#------------------------------------------------------------------------------
# Load the model with AutoModelForCausalLM
hf_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch_dtype,
    device_map="auto"
)

#Create the HuggingFacePipeline with a positive temperature value
hf_pipeline = pipeline(
    "text-generation",
    model=hf_model,
    tokenizer=tokenizer,
    temperature=0.1,  #model creativity
    max_new_tokens=100  #Smaller number of tokens, a faster response
)

#Set the model for the LLM based on the above defined pipeline
llm = HuggingFacePipeline(pipeline=hf_pipeline)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

  warn_deprecated(


## **Convert the vector store into a retriever object**
So, we can perform similarity searches or information retrieval based on vector embeddings

In [19]:
retriever = vectorstore.as_retriever()

## **Definition of the RAG template**
This template will format the context and question to create a prompt for generating responses

In [20]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """Use the following pieces of context to answer the question.
If you don't know the answer, just say that you don't know, don't try to make up an answer. Provide only the answer, nothing else.
{context}
Question: {question}
"""
rag_prompt = ChatPromptTemplate.from_template(rag_template)


## **Setting up a RAG chain using LangChain components**

In [21]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | rag_prompt
  | llm
  | StrOutputParser()
)


## **Asking a question through RAG chain**

NOTE: In some cases, depending on the selected LLM, there is a need to format the generated response, i.e. remove unwanted text.
Without GPU usage it can last a long time.

In [22]:
def extract_answer(text):
  parsed_text = text.split("Answer:")[1]
  return parsed_text

def answer_question(question):
  #Invoke the chain and print the answer
  answer = rag_chain.invoke(question)
  print(extract_answer(answer))

In [None]:
question = "Can you give a summary of the Barcelone declaration? List the main statements using bullet points."
answer_question(question)

In [23]:
question = "Why is the Barcelona Declaration important to the scientific community?"
answer_question(question)

 The Barcelona Declaration on Open Research Information is important to the scientific community because it provides a means by which a scholarly infrastructure can provide assurances to the community that it qualifies for the level of trust accorded to an open scholarly infrastructure. It also guarantees that not only those performing an assessment but also those being assessed have access to all ‘evidence’ considered in the assessment, offering the transparency and accountability that are crucial to foster responsible assessment practices.


In [None]:
question = "What does open research mean exactly?"
answer_question(question)

In [None]:
question = "How many members does CoARA have?"
answer_question(question)

Formatting the generated response, i. remove any unwanted text. For different models, different formats may be needed

In [None]:
question = "Can you give a French translation of the main points of the Barcelona declaration?"
answer_question(question)

In [None]:
question = "What are the main named entities in the Barcelona declaration? Give the anser in the form of a JSON file"
answer_question(question)

In [None]:
question = "Which sentences in the Barcelona declaraion express a negative sentiment?"
answer_question(question)

In [None]:
question = "Which sentences in the Barcelona declaration express a positive sentiment?"
answer_question(question)

In [None]:
question = "How does the Barcelona declaraton differ from the Berlin Open Access Declaration from 2003?"
answer_question(question)

In [None]:
question = "What does the Barcelona declaration mention about Artificial Intelligence?"
answer_question(question)

In [None]:
question = "What does the Barcelona declaration mention about open software?"
answer_question(question)

In [None]:
question = "What are the main sights in the city of Limassol? Which places should I visit?"
answer_question(question)