# NLP Exercise 4: Retrieval Augmented Generation and BERTs
---


## RAGs

To use LLMs locally, download it from here:

https://ollama.com/

Then you can pull LLMs models by pull it from your terminal:

`ollama pull mistral`

And to run it:

`ollama run mistral`


RAGs pipeline

![alt text](rag_pipeline-1.gif)

## Documents and DataBase Preparation

We will use the boardgame rules as the documents for our RAGs system

In [4]:
import os
import shutil
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.vectorstores.chroma import Chroma

In [5]:
# Define chroma and data path
chroma_path = 'Week_4\chroma'
data_path = 'Week_4\data'

In [6]:
# Load documents
def load_documents():
    document_loader = PyPDFDirectoryLoader(data_path)
    return document_loader.load()

documents = load_documents()

In [7]:
# Chunk documents
def split_documents(documents: list[Document]):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=80,
        length_function=len,
        is_separator_regex=False
    )
    return splitter.split_documents(documents)

chunks = split_documents(documents)

In [8]:
# Get chunk ids
def calculate_chunk_ids(chunks):

    # This will create IDs like "data/monopoly.pdf:6:2"
    # Page Source : Page Number : Chunk Index

    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source")
        page = chunk.metadata.get("page")
        current_page_id = f"{source}:{page}"

        # If the page ID is the same as the last one, increment the index.
        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        # Calculate the chunk ID.
        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id

    return chunks

## Vector Embedding

In this section, we will use the `OllamaEmbeddings` model from the `langchain_community` library to embed our documents. This model will help us convert the text data into numerical vectors, which can be used for various downstream tasks such as similarity search, clustering, and more.

The `OllamaEmbeddings` model is initialized with the `nomic-embed-text` model, which is specifically designed for embedding text data. 

In [1]:
from langchain_community.embeddings.ollama import OllamaEmbeddings

def embedding_function():
    embeddings = OllamaEmbeddings(model='nomic-embed-text')
    return embeddings

### Add chunking documents to the Chroma DB using the `OllamaEmbeddings`

In [9]:
def add_to_chroma(chunks: list[Document]):
    # Load the database
    db = Chroma(
        persist_directory=chroma_path, embedding_function=embedding_function()
    )

    chunks_ids = calculate_chunk_ids(chunks)

    # Add or update the documents
    existing_items = db.get(include=[])
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    new_chunks = []
    for chunk in chunks_ids:
        if chunk.metadata["id"] not in existing_ids:
            new_chunks.append(chunk)

    if len(new_chunks):
        print(f"Adding new documents: {len(new_chunks)}")
        new_chunks_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunks_ids)
        db.persist()
    else: 
        print("No new documents to add")

In [11]:
add_to_chroma(chunks)

Number of existing documents in DB: 47
No new documents to add


## Query Data

In this section, we will query the data using the RAGs system. We will use the `OllamaEmbeddings` model to embed the query text and search the Chroma database for relevant documents. The results will be formatted and displayed along with their sources.

The following steps will be performed:
1. Prepare the Chroma database with the embedding function.
2. Search the database for the most similar documents to the query text.
3. Format the results and display the response along with the sources.

The `query_rag` function will handle these steps and return the response.

In [23]:
from langchain.vectorstores.chroma import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_community.llms.ollama import Ollama

chroma_path = 'chroma'

PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [34]:
def query_rag(query_text: str):
    # Prepare the DB.
    embedding = embedding_function()
    db = Chroma(persist_directory=chroma_path, embedding_function=embedding)

    # Search the DB.
    results = db.similarity_search_with_score(query_text, k=3)

    # Combine the results into a single context string.
    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    
    # Format the prompt with the context and the query.
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    print(prompt)

    # Invoke the model with the formatted prompt.
    model = Ollama(model="mistral")
    response_text = model.invoke(prompt)

    # Extract the sources from the results.
    sources = [doc.metadata.get("id", None) for doc, _score in results]
    print(response_text)
    print(sources)

In [35]:
query_rag("How do I build a hotel?")

Human: 
Answer the question based only on the following context:

complete color-group, hdshe may buy a hotel from the Bank and erect 
it on any property of the color-group. Hdshe returns the four houses 
from that property to the Bank and pays the price for the hotel as shown 
on the Ttle Deed card. Only one hotel may be erected on any one 
property. 
BUILDING SHORTAGES: When the Bank has no houses to sell, players 
wishing to build must wait for some player to return or sell histher 
houses to the Bank before building. If there are a limited number of 
houses and hotels available and two or more players wish to buy more 
than the Bank has, the houses or hotels must be sold at auction to the 
highest bidder.

---

Following the above rules, you may buy and erect at any time as 
many houses as your judgement and financial standing will allow. But 
you must build evenly, i.e., you cannot erect more than one house on 
any one property of any color-group until you have built one house on 

# BERTS

BERTs can be used on a wide variety of language tasks:
1.   Sentiment Analysis
2.   Question Answering
3.   Text Prediction
4.   Text Generation
5.   Summarization

With a very few lines of code, BERTs can do all of the tasks above!

In [38]:
from transformers import pipeline

In [43]:
# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
classifier("I love Vietnam")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9806610345840454}]

In [50]:
# Fill Mask
unmask = pipeline("fill-mask", model="roberta-base")
unmask("This course will teach you about <mask> models.", top_k=3)

Device set to use cpu


[{'score': 0.08234003931283951,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you about mathematical models.'},
 {'score': 0.04040270298719406,
  'token': 209,
  'token_str': ' these',
  'sequence': 'This course will teach you about these models.'},
 {'score': 0.03654172271490097,
  'token': 23805,
  'token_str': ' simulation',
  'sequence': 'This course will teach you about simulation models.'}]

In [55]:
# Summarization
summarizer = pipeline("summarization")
summarizer(
    """
    Vietnam, officially the Socialist Republic of Vietnam,[g][h] is a country at the eastern edge of mainland Southeast Asia, with an area of about 331,000 square kilometres (128,000 sq mi) and a population of over 100 million, making it the world's fifteenth-most populous country. One of the two Marxist–Leninist states in Southeast Asia,[i] Vietnam shares land borders with China to the north, and Laos and Cambodia to the west. It shares maritime borders with Thailand through the Gulf of Thailand, and the Philippines, Indonesia, and Malaysia through the South China Sea. 
    Its capital is Hanoi and its largest city is Ho Chi Minh City.
    Vietnam was inhabited by the Paleolithic age, with states established in the first millennium BC on the Red River Delta in modern-day northern Vietnam. 
    The Han dynasty annexed Northern and Central Vietnam, which were subsequently under Chinese rule from 111 BC until the first dynasty emerged in 939. 
    Successive monarchical dynasties absorbed Chinese influences through Confucianism and Buddhism, and expanded southward to the Mekong Delta, conquering Champa. 
    During most of the 17th and 18th centuries, Vietnam was effectively divided into two domains of Đàng Trong and Đàng Ngoài. 
    The Nguyễn—the last imperial dynasty—surrendered to France in 1883. In 1887, its territory was integrated into French Indochina as three separate regions. 
    In the immediate aftermath of World War II, the nationalist coalition Viet Minh, led by the communist revolutionary Ho Chi Minh, launched the August Revolution and declared Vietnam's independence from the Empire of Japan in 1945.
    Vietnam went through prolonged warfare in the 20th century. After World War II, France returned to reclaim colonial power in the First Indochina War, from which Vietnam emerged victorious in 1954. 
    As a result of the treaties signed between the Viet Minh and France, Vietnam was also separated into two parts. The Vietnam War began shortly after, between the communist North Vietnam, supported by the Soviet Union and China, and the anti-communist South Vietnam, supported by the United States.
    Upon the North Vietnamese victory in 1975, Vietnam reunified as a unitary socialist state under the Communist Party of Vietnam (CPV) in 1976. An ineffective planned economy, a trade embargo by the West, and wars with Cambodia and China crippled the country further.
    In 1986, the CPV initiated economic and political reforms similar to the Chinese economic reform, transforming the country to a socialist-oriented market economy. The reforms facilitated Vietnamese reintegration into the global economy and politics.
    Vietnam is a developing country with a lower-middle-income economy. It has high levels of corruption, censorship, environmental issues and a poor human rights record. 
    It is part of international and intergovernmental institutions including the ASEAN, the APEC, the Non-Aligned Movement, the OIF, and the WTO. It has assumed a seat on the United Nations Security Council twice.
    """
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' Vietnam is one of the two Marxist–Leninist states in Southeast Asia . Its capital is Hanoi and its largest city is Ho Chi Minh City . Vietnam is a developing country with a lower-middle-income economy . It has high levels of corruption, censorship, environmental issues and a poor human rights record .'}]