# RAG using Langchain

## Packages loading & import

In [None]:
# !pip install langchain -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install langchain_community -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install langchain_huggingface -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install langchain_text_splitters -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install langchain_chroma -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install rank-bm25 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# !pip install huggingface_hub -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

In [1]:
import os
import json
import bs4
import nltk
import torch
import pickle
import numpy as np

# from pyserini.index import IndexWriter
# from pyserini.search import SimpleSearcher
from numpy.linalg import norm
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

from langchain_community.llms import Ollama
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.vectorstores import Chroma, FAISS
from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.embeddings import JinaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.docstore.document import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import WebBaseLoader
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm
USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/hui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/hui/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Hugging face login
- Please apply the model first: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- If you haven't been granted access to this model, you can use other LLM model that doesn't have to apply.
- You must save the hf token otherwise you need to regenrate the token everytime.
- When using Ollama, no login is required to access and utilize the llama model.

In [2]:
from huggingface_hub import login

hf_token = "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" # remember to remove when push to github
login(token=hf_token, add_to_git_credential=True)

In [3]:
!huggingface-cli whoami

huiyuiui
[1morgs: [0m NTHU


## TODO1: Set up the environment of Ollama

### Introduction to Ollama
- Ollama is a platform designed for running and managing large language models (LLMs) directly **on local devices**, providing a balance between performance, privacy, and control.
- There are also other tools support users to manage LLM on local devices and accelerate it like *vllm*, *Llamafile*, *GPT4ALL*...etc.

### Launch colabxterm

In [5]:
# TODO1-1: You should install colab-xterm and launch it.
# Write your commands here.
!pip install colab-xterm -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple # https://pypi.org/project/colab-xterm

%load_ext colabxterm

Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple


In [6]:
# TODO1-2: You should install Ollama.
# You may need root privileges if you use a local machine instead of Colab.
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%                                                       15.4%############################                       71.1%#######################       94.4%
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [10]:
!ollama serve

Couldn't find '/home/hui/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOm+Mfx1tKI8MiBtp8NWBp/eMR97RqgLe3dtCgrXntE1

2024/12/20 16:33:17 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/hui/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* 

In [11]:
# TODO1-3: Pull Llama3.2:1b via Ollama and start the Ollama service in the xterm
# Write your commands in the xterm
!ollama pull llama3.2:1b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏    0 B/1.3 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 74701a8c35f6...   0% ▕                ▏    0 B/1.3 GB       

## Ollama testing
You can test your Ollama status with the following cells.

In [4]:
# Setting up the model that this tutorial will use
MODEL = "llama3.2:1b" # https://ollama.com/library/llama3.2:3b
EMBED_MODEL = "jinaai/jina-embeddings-v2-base-en"

In [5]:
# Initialize an instance of the Ollama model
llm = Ollama(model=MODEL)
# Invoke the model to generate responses
response = llm.invoke("What is the capital of Taiwan?")
print(response)

The capital of Taiwan is Taipei.


  llm = Ollama(model=MODEL)


## Build a simple RAG system by using LangChain

### TODO2: Load the cat-facts dataset and prepare the retrieval database

In [None]:
!wget https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt

--2024-12-20 22:57:06--  https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt
Resolving huggingface.co (huggingface.co)... 3.169.137.5, 3.169.137.111, 3.169.137.119, ...
Connecting to huggingface.co (huggingface.co)|3.169.137.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22657 (22K) [text/plain]
Saving to: ‘cat-facts.txt.1’


2024-12-20 22:57:07 (304 MB/s) - ‘cat-facts.txt.1’ saved [22657/22657]



In [7]:
# TODO2-1: Load the cat-facts dataset (as `refs`, which is a list of strings for all the cat facts)
# Write your code here
file = open("cat-facts.txt", 'r')
content = file.readlines()
file.close()

refs = [line.strip() for line in content]

print(refs)

['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.', 'Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.', 'When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.', 'The technical term for a cat’s hairball is a “bezoar.”', 'A group of cats is called a “clowder.”', 'Female cats tend to be right pawed, while male cats are more often left pawed. Interestingly, while 90% of humans are right handed, the remaining 10% of lefties also tend to be male.', 'A cat can’t climb head first down a tree because every claw on a cat’s paw points the same way. To get down from a tree, a cat must back down.', 'Cats make about 100 different sounds. Dogs make only about 10.', 'A cat’s brain is biologically more similar to a human brain than it is to a dog’s. Both humans and cats have identical regions in their brains that 

In [8]:
from langchain_core.documents import Document
docs = [Document(page_content=doc, metadata={"id": i}) for i, doc in enumerate(refs)]

In [9]:
# Create an embedding model
model_kwargs = {'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': False}
embeddings_model = HuggingFaceEmbeddings(
    model_name=EMBED_MODEL,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [10]:
# TODO2-2: Prepare the retrieval database
# You should create a Chroma vector store.
# search_type can be “similarity” (default), “mmr”, or “similarity_score_threshold”
vector_store = FAISS.from_documents(
    # Write your code here
    documents=docs, embedding=embeddings_model
)
retriever = vector_store.as_retriever(
    # Write your code here
    search_type="similarity", search_kwargs={"k": 3}
)

In [143]:
# vector_store = Chroma.from_documents(
#     # Write your code here
#     documents=docs, embedding=embeddings_model
# )
# retriever = vector_store.as_retriever(
#     # Write your code here
#     search_type='mmr', search_kwargs={"k": 3, "fetch_k": 5}
# )

### Prompt setting

In [11]:
# TODO3: Set up the `system_prompt` and configure the prompt.
system_prompt = (
    "Use the given context to answer the question."
    "Use the key points in the context to construct a concise answer."
    "Before finalizing your answer, verify it strictly against the given context."
    "Your answer will be evaluated based on its accuracy and relevance to the context."
    "Context: {context}"
) # Write your code here
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

- For the vectorspace, the common algorithm would be used like Faiss, Chroma...(https://python.langchain.com/docs/integrations/vectorstores/) to deal with the extreme huge database.

In [12]:
# TODO4: Build and run the RAG system
# TODO4-1: Load the QA chain
# You should create a chain for passing a list of Documents to a model.
question_answer_chain = create_stuff_documents_chain(llm, prompt) # Write your code here

# TODO4-2: Create retrieval chain
# You should create retrieval chain that retrieves documents and then passes them on.
chain = create_retrieval_chain(retriever, question_answer_chain) # Write your code here

In [13]:
# Question (queries) and answer pairs
# Please do not modify this cell.
queries = [
    "How much of a day do cats spend sleeping on average?",
    "What is the technical term for a cat's hairball?",
    "What do scientists believe caused cats to lose their sweet tooth?",
    "What is the top speed a cat can travel over short distances?",
    "What is the name of the organ in a cat's mouth that helps it smell?",
    "Which wildcat is considered the ancestor of all domestic cats?",
    "What is the group term for cats?",
    "How many different sounds can cats make?",
    "What is the name of the first cat in space?",
    "How many toes does a cat have on its back paws?"
]
answers = [
    "2/3",
    "Bezoar",
    "a mutation in a key taste receptor",
    ["31 mph", "49 km"],
    "Jacobson’s organ",
    "the African Wild Cat",
    "clowder",
    "100",
    ["Felicette", "Astrocat"],
    "four",
]

In [17]:
counts = 0
for i, query in enumerate(queries):
    # TODO4-3: Run the RAG system
    response = chain.invoke({"input": query}) # Write your code here
    print(f"Query: {query}\nResponse: {response['answer']}")
    # The following lines perform evaluations.
    # if the answer shows up in your response, the response is considered correct.
    if type(answers[i]) == list:
        for answer in answers[i]:
            if answer.lower() in response['answer'].lower():
                counts += 1
                print("Correct")
                break
    else:
        if answers[i].lower() in response['answer'].lower():
            counts += 1
            print("Correct")
    print()

# TODO5: Improve to let the LLM correctly answer the ten questions.
print(f"Correct numbers: {counts}")

Query: How much of a day do cats spend sleeping on average?
Response: Based on the context, if we take the information that 2/3 of every day is spent sleeping and a nine-year-old cat has been awake for only three years out of its life, it can be inferred that a nine-year-old cat spends only 1/3 of its waking hours sleeping.

Therefore, the average number of days in a nine-year-old cat's life would be approximately 9 * (1/3) = 3 days.
Correct

Query: What is the technical term for a cat's hairball?
Response: Based on the given context, I will construct a concise answer using the key points.

The technical term for a cat's hairball is a "bezoar."
Correct

Query: What do scientists believe caused cats to lose their sweet tooth?
Response: Based on the context, it appears that scientists believe cats' loss of sweetness is due to a mutation in a key taste receptor. However, this exact quote from the given context doesn't directly answer the question about what caused cats' loss of sweetness.