# RAG

# Installing requirements

In [None]:
!pip install datasets --quiet
!pip install torch  transformers accelerate bitsandbytes pypdf chromadb sentence-transformers pydantic --quiet
!pip install llama-index llama-index-embeddings-huggingface llama-index-llms-huggingface llama-index-readers-file llama-index-vector-stores-chroma llama-index-llms-anthropic --quiet
!pip install rouge-score

# Loading Dataset

Wikipédia szedetet fogok használni datasetnek, de annak a szimplifikált változatát a gyorsabb futás érdekében.

In [None]:
from datasets import load_dataset
dataset =load_dataset("wikipedia", "20220301.simple",trust_remote_code=True)

# SET up model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

import torch
import sys
import chromadb
from llama_index.core import VectorStoreIndex, download_loader, ServiceContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from pathlib import Path
import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from llama_index.core.postprocessor.llm_rerank import LLMRerank
from llama_index.core.workflow import (
    Context,
    Workflow,
    StartEvent,
    StopEvent,
    step,
)
from llama_index.core.response.pprint_utils import pprint_response
from llama_index.core.postprocessor import SentenceTransformerRerank

Itt létrehozom a modelt (A zephyr 7b beta modellt választottam).
Használok quantifikációt, hogy könnyebben fusson a colab gépen.
Ezenkívül a paramétereket olyanra állítottam, hogy pontosabb, fókuszált válaszokat adjon.

In [None]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.5, "top_k": 25, "top_p": 0.8, "do_sample": True, "pad_token_id": tokenizer.pad_token_id},

    device_map="auto",
)



Beállítom az embedding modellt és a változókat.

In [4]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 1024
Settings.chunk_overlap = 50

Így válaszol a LLM RAG nélkül.

In [5]:
print(llm.complete("What happened on 20 of April?").text)

Here are some significant events that occurred on April 20 in different years:

- 1611: The Dutch East India Company (VOC) was founded in the Netherlands.

- 1864: During the American Civil War, Union forces led by General William Tecumseh Sherman captured the city of Macon, Georgia.

- 1896: The first modern Olympic Games opened in Athens, Greece.

- 1916: During World War I, Irish nationalists launched the Easter Rising, an unsuccessful uprising against British rule.

- 1920: The League of Nations, the first international organization for collective security, was established in the aftermath of World War I.

- 1932: Mahatma Gandhi began his famous Salt March in India, protesting against the British salt tax.

- 1961: Yuri Gagarin became the first human to travel to space, orbiting the Earth in the Vostok 1 spacecraft.

- 1968: The My Lai Massacre, a mass killing of Vietnamese civilians by


# Dataset feldolgozása
Az adatokat átalakítom Document formára, és utána elmentjük a vectorstoreindexbe őket. (A folyamat közben fel lesznek darabolva, és beágyazva, hogy lehessen később keresni belőle)

Az adatokat ChromeVectorStoreban tároljuk.

In [6]:
from llama_index.core import Document
from llama_index.core import VectorStoreIndex


documents = [
    Document(text=f"{row['title']}\n{row['text']}", id = f"doc_id_{i}")
    for i, row in enumerate(dataset["train"])
]

In [7]:
#Gyorsabb futás érdekében csak az adatok részét használjuk
documents = documents[:10000]

In [8]:
client = chromadb.PersistentClient(path="./test")
collection = client.get_or_create_collection(name="firstcollection5")

In [9]:
# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [None]:
# Create the VectorStoreIndex from the documents
index = VectorStoreIndex.from_documents(
    documents, show_progress=True,  verbose = True, storage_context =storage_context)

#Eredmény

Itt látszani fog, hogyan használja a RAG-t. Felrakjuk a kérdést, ez alapján a kérdéshez közel álló indexeket előveszi és berakja a contextbe, ami alapján válaszol az LLM.

In [17]:
query="What is the similarity between December and April?"

query_engine =index.as_query_engine(similarity_top_k=5)


import time

start_time = time.time()

response = query_engine.query(query)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time:.2f} seconds")
pprint_response(response)

Elapsed time: 8.85 seconds
Final Response: The similarity between December and April, as
mentioned in the context information, is that both months end on the
same day of the week, which is a Sunday in leap years and a Saturday
in common years. In other words, the last day of December and the last
day of April fall on the same weekday.


Itt látható a context amiből válaszol. Pár context fölösleges benne.

In [18]:
response

Response(response='The similarity between December and April, as mentioned in the context information, is that both months end on the same day of the week, which is a Sunday in leap years and a Saturday in common years. In other words, the last day of December and the last day of April fall on the same weekday.', source_nodes=[NodeWithScore(node=TextNode(id_='f72b30d1-95b5-4af8-b8bf-481c0e721074', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a706d2d2-eb02-4512-962c-79fe2181f267', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='3d7e18e0531257c7affc659f2601884f851d58f7a1d4884e121021f2fee544b1'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='dccf1bd8-c788-4604-810b-aa140308ebfe', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='a516f059a359cd6f469d172939deede77d7f56f5e778973a2ca9a1e2ef645c2d')}, text='April\nApril is the fourth month of the year in 

Itt reranking segítségével leszűkitjük a contextet (ebben az esetben 1-re, amitől a példa jól látszik, de valós esetben nem kell ilyen kicsire csökkenteni.)

In [19]:
rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=1)

In [20]:
query_engine = index.as_query_engine(similarity_top_k=5, node_postprocessors=[rerank] )

start_time = time.time()

response = query_engine.query(query)

end_time = time.time()
print(f"Elapsed time: {end_time - start_time:.2f} seconds")

response



Elapsed time: 4.22 seconds


Response(response="December and April both end on the same day of the week. This is because each month's last day is exactly 35 weeks (245 days) apart.", source_nodes=[NodeWithScore(node=TextNode(id_='4bf3bc0b-aa5b-4d07-9711-2bec85d6cb32', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='8e1eb27d-731d-4929-a8db-e0d28d3dc176', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='123ce736a765ded1a48581ad7ccd8bcac1337ed6f81f505a31ce243b2833a87c'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='619a9a74-9016-41d6-b818-4c869e916a06', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='ed65d1204aabebf228306973511f22574e4002cba35be37a6d6fbd28d7bc41e4')}, text='December\nDecember (Dec.) is the twelfth and last month of the year in the Gregorian calendar, with 31 days, coming between November and January. With the name of the month coming from the Latin decem for "ten

Itt létrehozzuk a chatbotot, adunk promptot neki, hogy milyen magatartást/választ várunk tőle.

In [15]:
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
chat_engine = index.as_chat_engine(chat_mode="context", verbose=True, memory=memory,
    system_prompt=(
        "You are a chatbot, you have to answer the questions asked. Only use the context provided, dont use any previously known information, do not hallucinate."
    ),
    node_postprocessors=[rerank])

#Chatbot
És itt lehet beszélgetni a chatbottal, ami emlékezik a beszélgetés egy részére, és a "bye"-al ki lehet lépni, ami után elfelejti az eddigi beszégetést.

In [16]:
print("If you want to leave the conversation, say bye \n______________________________\n")
while True:
  user_input = input("Enter your query: ")
  if user_input.lower() == "bye":
    break
  response = chat_engine.chat(user_input)
  pprint_response(response)
  print("\n_____________________________\n")
chat_engine.reset()

If you want to leave the conversation, say bye 
______________________________

Enter your query: What is the similarity between December and April?
Final Response: According to the context provided, December and April
both end on the same day of the week. This means that if December 31st
falls on a Wednesday, for example, then April 30th will also fall on a
Wednesday, as each other's last days are exactly 35 weeks (245 days)
apart.

_____________________________

Enter your query: and what of February?
Final Response: In the Northern Hemisphere, February is a winter
month, similar to December and January. In the Southern Hemisphere, it
is a summer month, similar to December and January as well. However,
the similarity in terms of the day of the week on which December 31st
and April 30th fall does not apply to February, as the number of days
between December 31st and February 28th (or 29th in leap years) can
vary, resulting in different weekdays for February 28th and the
corresponding 

#Válaszok értékelése:
1.: A felhasználóktól meg lehet kérdezni hogy elégedettek voltak-e a programmal. Figyelni kell rá, hogy a felhasználók gyakrabban jeleznek vissza valamiről, ami rosszul működik, mintha valami jól.

2.: Lehet egy verifikációs adathalmazt tartani elvárt válaszokkal, és ezt össze lehet hasonlítani a kapott válasszal. (pl: ROUGE,  Recall-Oriented Understudy for Gisting Evaluation, vagy embedding alapján.)

3.: Pár, az üzlethez értő szakember leteszteli olyan kérdésekkel, amik szerintük gyakran előfordulnak.

In [35]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')
scores

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

#Elérhetővé tétel
Ez a program egy API ként futna belső szerveren,hogy könnyen lehessen kezelni, hogy ki és mennyi ideig férhet hozzá. Az API elérhető lehet webes felületről/alkalmazásba beépítve/mobilról is akár, (sőt kiegészíthető hangfelismerés és felolvasással is, üzleti igény függő). Mivel a chatbot használata költséges, ezért csak visszaigazolt felhasználóval szabad használni, és előfizetéstől függően változik a limit és priority ha sokan használják.

#Továbbfejlesztés
A használt modelleket/database-t letölteni és onnan betölteni.

A kód osztályokba és függvényekbe szervezése. Könnyebb bemutatni a működését ahogyan most van egy notebookból, de később össze kell szervezni könnyebb felhasználhatóság/fejlesztés/módosítás miatt.


In [38]:
import pandas as pd
from datetime import datetime


data = {
    "Question": ["What is the similarity between December and April?","and what of February?"],
    "Response": ["""According to the context provided, December and April
both end on the same day of the week. This is because December has
exactly 35 weeks (245 days) between its last day and April's first
day, which is enough time for both months to align on the same
weekday. This similarity holds true for every year, regardless of leap
years.""",
                 """According to the context provided, February begins on
the same day of the week as March and November in common years, and on
the same day of the week as August in leap years. Additionally,
February always ends on the same day of the week as January in common
years. In a leap year, February is the only month to begin and end on
the same day of the week. So, while February does not align with
December or April in terms of weekday endings, it does have
similarities in terms of weekday startings.
"""],
}

dashboard = pd.DataFrame(data)


dashboard.head()


Unnamed: 0,Question,Response
0,What is the similarity between December and Ap...,"According to the context provided, December an..."
1,and what of February?,"According to the context provided, February be..."


In [39]:
dashboard.to_csv("dashboard_input_table.csv", index=False)