## RAG Architecture

![RAG Architecture](./images/rag_screehnshot.png)


## Initialize Embedding model

In [1]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Pine cone setup 
pip install pinecone

Add API KEY in .env


Langchain doc- https://python.langchain.com/docs/integrations/vectorstores/pinecone/

In [2]:
from pinecone import Pinecone, ServerlessSpec

  from tqdm.autonotebook import tqdm


## Initialize Vector DB


In [3]:
from dotenv import load_dotenv
import os

load_dotenv()

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))



In [4]:
import time

index_name = "langchain-test-index"  # change if desired

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

In [5]:
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=embeddings)

## Text splitting

In [6]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap =10
)

In [7]:

with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

In [8]:
file

'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language processing, parsing\n\nTokenizer\n\nDef

In [9]:
chunks =  text_splitter.create_documents([file])
chunks

[Document(metadata={}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language pro

In [10]:
for i, chunk in enumerate(chunks):
    chunk.metadata = {"source": "appendix-keywords.txt"}

chunks

[Document(metadata={'source': 'appendix-keywords.txt'}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: t

## Associate ids with each document

In [11]:
from uuid import uuid4


uuids = [str(uuid4()) for _ in range(len(chunks))]


In [12]:
uuids

['ffb63d4c-05aa-4b17-be28-d65a031ed75a',
 '70bd099d-d1db-488b-bd0b-2d7247b3d865',
 'a701e729-d87f-4714-810e-f640351bd498',
 'a23454ac-5b34-46f2-897d-cadb25b2948a',
 '6a2b3010-5a0a-4a71-bb73-44effc39c8cd',
 'd27bf8f4-883f-4694-8812-cf1dcf3d8e30',
 '4223cae7-00fc-44d1-a133-b2cdb5b42757',
 'b4b1da28-fcf0-4482-9187-8910b5788be7',
 '631fffc1-c90b-42f6-a8a3-979f8865c256',
 '43946758-a20a-4ffc-82af-164f2b970fcb',
 '8bab8a2f-903e-4e0a-876d-0baece5c73b3',
 '21d123bf-be1f-47f9-b176-0b2dcf7fa53c',
 'e2a1216f-fe23-464f-b92f-3a6c376ec3f9',
 'e982248e-08cf-4cf4-992e-80517b57181b']

## Store embeddings in vector store

In [13]:
vector_store.add_documents(documents=chunks, ids=uuids)

['ffb63d4c-05aa-4b17-be28-d65a031ed75a',
 '70bd099d-d1db-488b-bd0b-2d7247b3d865',
 'a701e729-d87f-4714-810e-f640351bd498',
 'a23454ac-5b34-46f2-897d-cadb25b2948a',
 '6a2b3010-5a0a-4a71-bb73-44effc39c8cd',
 'd27bf8f4-883f-4694-8812-cf1dcf3d8e30',
 '4223cae7-00fc-44d1-a133-b2cdb5b42757',
 'b4b1da28-fcf0-4482-9187-8910b5788be7',
 '631fffc1-c90b-42f6-a8a3-979f8865c256',
 '43946758-a20a-4ffc-82af-164f2b970fcb',
 '8bab8a2f-903e-4e0a-876d-0baece5c73b3',
 '21d123bf-be1f-47f9-b176-0b2dcf7fa53c',
 'e2a1216f-fe23-464f-b92f-3a6c376ec3f9',
 'e982248e-08cf-4cf4-992e-80517b57181b']

## Retrieve using Similarity Search

In [14]:
results = vector_store.similarity_search(
    "What is a vector store?",
    k=2,
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Tokenizer

Definition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.
Example: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].
Associated keywords: tokenization, natural language processing, parsing

VectorStore

Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.
Example: Vectors of word embeddings can be stored in a database for quick access.
Related keywords: embedding, database, vectorization, vectorization

SQL

Definition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.
Example: SELECT * FROM users WHERE age > 18; looks up information about users who are 18 years old or older.
Associated keywords: database, query, data management, data management

CSV [{'source': 'appendix-keyword

## Retrieve using similarity search and score

In [15]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=2
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.087242] Definition: Transformers are a type of deep learning model used in natural language processing, mainly for translation, summarization, text generation, etc. It is based on the Attention mechanism.
Example: Google Translator uses transformer models to perform translations between different languages.
Related keywords: Deep learning, Natural language processing, Attention

HuggingFace

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. It helps researchers and developers to easily perform NLP tasks.
Example: You can use HuggingFace's Transformers library to perform tasks such as sentiment analysis, text generation, and more.
Related keywords: natural language processing, deep learning, libraries

Digital Transformation [{'source': 'appendix-keywords.txt'}]
* [SIM=0.072915] Definition: Structured data is data that is organized according to a set format or schema. It can be easily searched and analy

## Using custom retriever
- similairty score
- threshold for similarity score
- number of documets to retrieve (k)

In [16]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 5, "score_threshold": 0.5},
)
retriever.invoke("What is a vector store?")

[Document(id='70bd099d-d1db-488b-bd0b-2d7247b3d865', metadata={'source': 'appendix-keywords.txt'}, page_content='Tokenizer\n\nDefinition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.\nExample: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].\nAssociated keywords: tokenization, natural language processing, parsing\n\nVectorStore\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nSQL\n\nDefinition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.\nExample: SELECT * FROM users WHERE age > 18; looks up information about users who are 18

# Custom RAG

In [18]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Answer the question based on the context provided. If the answer is not provided in the context, say 'I don't know'."
    "\n\nHere is the context: {context}"
    "--------------------------------"
    "\n\nHere is the question provided by the user: {question}",
    
)


query = "What is a Tokenizer?"

context = retriever.invoke(query)




In [19]:
docs = [text_info.page_content for text_info in context]
docs

['Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language processing, parsing\n\nTokenizer',
 'T

In [20]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

formatted_prompt = prompt.format(context=docs, question=query)

In [21]:
formatted_prompt

llm_response = llm.invoke(formatted_prompt)

llm_response


AIMessage(content='A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing. For example, it can split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 52, 'prompt_tokens': 953, 'total_tokens': 1005, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_06737a9306', 'finish_reason': 'stop', 'logprobs': None}, id='run-227a8949-8d8a-405f-b8e1-f4910bff9e1c-0', usage_metadata={'input_tokens': 953, 'output_tokens': 52, 'total_tokens': 1005, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [22]:
import textwrap
print(textwrap.fill(llm_response.content, width=100))

A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural
language processing. For example, it can split the sentence “I love programming.” into [“I”, “love”,
“programming”, “.”].


# RetrievalQA
Link -https://python.langchain.com/docs/versions/migrating_chains/retrieval_qa/


In [23]:
from langchain import hub
from langchain.chains import RetrievalQA

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")





In [24]:
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [25]:
qa_chain = RetrievalQA.from_llm(
    llm, 
    retriever=vector_store.as_retriever(), 
    prompt=prompt
)

qa_chain("What is a tokenizer?")

  qa_chain("What is a tokenizer?")


{'query': 'What is a tokenizer?',
 'result': 'A tokenizer is a tool that splits text data into smaller units called tokens, which are essential for preprocessing in natural language processing. For example, it can break the sentence “I love programming.” into the tokens [“I”, “love”, “programming”, “.”]. Tokenization helps in analyzing and understanding text data more effectively.'}

Conversation QA Chain - https://python.langchain.com/v0.1/docs/use_cases/chatbots/retrieval/