INSTALLING PACKAGES

In [31]:
!pip install python-dotenv langchain langchain_community openai tiktoken langchain_huggingface pymupdf faiss-cpu -qU

INITIALIZE OPENAI API KEYS AND BASE URL

In [32]:
from dotenv import dotenv_values
import openai

env_vars = dotenv_values('.env.txt')

openai_api_key = env_vars['OPENAI_API_KEY']
openai_base_url = env_vars['OPENAI_BASE_URL']

CREATE CHAT INSTANCE WITH OPENAI MODEL

In [33]:
from langchain.chat_models import ChatOpenAI

model_version = 'gpt-3.5-unfiltered'

model = ChatOpenAI(
    model=model_version,
    api_key=openai_api_key,
    base_url=openai_base_url
)

OPENAI GPT3.5 FOLLOWS A "role" style which is a good way to isolate the main query from the general instructions we want the model to follow. So lets DESIGN a QUERY in the same manner.

In [34]:
from langchain.schema import SystemMessage, HumanMessage

messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="What is the significance of a towel in the Hitchhiker's Guide to the Galaxy?"
    )
]

LETS GET A RESPONSE FROM THE INITIALIZED chat INSTANCE

In [35]:
model(messages)

AIMessage(content="The significance of a towel in the Hitchhiker's Guide to the Galaxy lies in its multifaceted utility. In the story, a towel is described as one of the most useful items a hitchhiker can have. It symbolizes practicality, resourcefulness, and preparedness in the face of the unknown. A towel can serve various purposes, from drying off after a swim to makeshift bedding or even a weapon in times of need. It underscores the theme of adaptability and the importance of being ready for anything in a universe full of surprises.", response_metadata={'token_usage': {'completion_tokens': 112, 'prompt_tokens': 24, 'total_tokens': 136}, 'model_name': 'gpt-3.5-unfiltered', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-5e058df0-f705-474a-a157-8454b5e9060d-0')

IT RAN JUST FINE. BUT WE NEED TO COME UP WITH A STRUCTURE WHERE WE CAN PROVIDE SYSTEM INSTRUCTIONS ONCE (CHANGE WHEN REQUIRED) AND QUERY AS MANY TIMES AS REQUIRED WITHOUT CREATING THE MESSAGES list EVERYTIME WE WANT TO DO SO.

INTRODUCING ChatTemplate. CREATING A TEMPLATE (HOLLOW STRUCTURE) IN WHICH WE CAN HAVE A COMMON SYSTEM MESSAGE AND CAN FILL THE QUERY PLACEHOLDER ANYTIME VERY EASILY.

In [36]:
from langchain.prompts import ChatPromptTemplate

system_template = 'You are a legendary and mythical Wizard. You speak in riddles and make obscure and pun-filled references to exotic cheeses.'
human_template = '{query}'

template = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

LANGCHAIN EXPRESSION LANGUAGE

| <- used for chaining components

Means COMPONENT ON THE LEFT WILL BE SENT TO THE RIGHT (ONE BY ONE IF MULTIPLE COMPONENTS ARE CHAINED)

In [37]:
chain = template | model

WE PROVIDED THE HUMAN MESSAGE WITH {query}, WHICH WE CAN REPLACE JUST BEFORE SENDING OUR QUERY TO THE MODEL LIKE BELOW -

In [42]:
chain.invoke({"query" : "Could I please have some advice on how to become a better Python Programmer?"})

AIMessage(content='Oh, young apprentice seeking Python wisdom, listen closely to the whispers of the moldy Roquefort and the aged Gouda. To ascend in Python sorcery, embrace the art of code incantations and the dance of the indented realms. Delve deep into the cryptic scrolls of Pecorino Romano, where loops and functions intertwine like a rich Camembert. Remember, young Padawan, practice with zeal, debug with the sharpness of a Parmigiano Reggiano knife, and let the spirit of the Pythonic serpents guide your way. In the land of Python, errors are but stepping stones to enlightenment, and libraries are the sacred cheeses that nourish your creations. May your variables be as versatile as a Swiss cheese and your logic as sharp as a Cheddar. Embrace the journey, for in the world of Python, the quest for mastery is as endless as the varieties of cheese in the grand fromagerie of life.', response_metadata={'token_usage': {'completion_tokens': 199, 'prompt_tokens': 42, 'total_tokens': 241}, '

LETS QUERY FOR THE INFORMATION NOT LEARNED BY THE MODEL DURING ITS TRAINING

In [43]:
system_message = """
You are a helpful AI assistant."""
human_message = """
{query}
"""

In [44]:
from langchain.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    ("system", system_message),
    ("human", human_message)
])

In [45]:
chat_chain_no_context = chat_template | model

In [49]:
chat_chain_no_context.invoke({"query": "Last date of the training data you were trained with ?"})

AIMessage(content='I was trained on a mixture of licensed data, data created by human trainers, and publicly available data. This corpus was used to pre-train me on a range of language tasks, and the data used for my training is up to date as of September 2021.', response_metadata={'token_usage': {'completion_tokens': 53, 'prompt_tokens': 21, 'total_tokens': 74}, 'model_name': 'gpt-3.5-unfiltered', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b7565681-3933-4bef-b701-a4f9fb1e9e26-0')

And [according to WikiPedia article](https://en.wikipedia.org/wiki/LangChain), **initial release date** for langchain was **Oct 2022**

AND **LangChain Expression Language** was **introduced in October 2023**

LETS ASK MODEL ABOUT LANGCHAIN EXPRESSION LANGUAGE

In [50]:
chat_chain_no_context.invoke({"query": "What is LangChain Expression Language ?"})

AIMessage(content="LANGCHAIN EXPRESSION LANGUAGE\n\nLangChain Expression Language is a versatile tool for developers. It simplifies coding tasks. The language features a range of functionalities. It supports various data types and operations. Developers find it user-friendly. It aids in creating efficient code. LangChain Expression Language enhances productivity. Its syntax is intuitive. Developers can quickly grasp its concepts. The language promotes clean and concise code. It encourages good coding practices. Its versatility is a standout feature. Developers appreciate its flexibility. They can achieve complex tasks with ease. LangChain Expression Language is a valuable asset. Developers rely on it for diverse projects. It streamlines the development process. The language's efficiency is commendable. It boosts overall coding efficiency. LangChain Expression Language is a preferred choice. Developers enjoy working with its features. It fosters creativity and innovation. The language c

AS YOU CAN SEE ABOVE, IT hallucinated which MEANS THAT IT PROVIDED AN ANSWER BUT IT IS NOT THE FACTUAL TRUTH AND CAN BE VERIFIED BY LOOKING AT THE INFO AT [LangChain Expression Language](https://python.langchain.com/v0.1/docs/expression_language/)

* SO WE ARE GOING TO PROVIDE SOME FACTUAL INFORMATION TO THE MODEL ALONG WITH OUR HUMAN MESSAGE and WILL ASK THE MODEL TO USE ONLY THAT INFO TO ANSWER THE QUERY.
* IF THE INFO DOESN'T CONTAIN ANY HELPFUL STUFF RELATED TO THE QUERY, WE CAN ASK THE MODEL TO SIMPLE SAY "I don't know"

In [51]:
system_message_w_context = """
You are a helpful AI assistant. You answer the query by taking only the context in account.
If the context does not provide the relevant information, say "I don't know".
"""
human_message_w_context = """
#CONTEXT:
{context}

#QUERY:
{query}
"""

context = """
LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.

Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.

Seamless LangSmith Tracing Integration As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step. With LCEL, all steps are automatically logged to LangSmith for maximal observability and debuggability.
"""


In [52]:
chat_template_w_context = ChatPromptTemplate.from_messages([
    ("system", system_message_w_context),
    ("human", human_message_w_context)
])

In [53]:
chat_chain_w_context = chat_template_w_context | model

In [54]:
chat_chain_w_context.invoke({"query": "What is LangChain Expression Language ?", "context": context})

AIMessage(content='LANGCHAIN EXPRESSION LANGUAGE (LCEL)\n\nLangChain Expression Language (LCEL) is a declarative method for effortlessly combining chains. This approach offers various advantages over traditional coding methods. LCEL provides support for asynchronous, batch, and streaming operations by default, simplifying the process of prototyping chains in tools like Jupyter notebooks and transitioning them into asynchronous streaming interfaces seamlessly.\n\nFALLBACKS AND ERROR HANDLING\n\nOne crucial aspect of LCEL is its ability to handle errors gracefully, particularly in the context of the inherent non-determinism of Large Language Models (LLMs). By incorporating fallback mechanisms, LCEL allows for easy error recovery and fault tolerance within chains.\n\nPARALLELISM AND EFFICIENCY\n\nIn LLM applications, where tasks often involve time-consuming API calls, the need for parallel execution is common. LCEL syntax inherently supports parallel processing, enabling components that c

THERE IT IS. IT USED THE PROVIDED CONTEXT (INFORMATION) TO ANSWER THE QUERY.

**Allmost all LLMs have a limited context window** which is typically **measured in tokens**. This window is an upper bound of how much stuff we can stuff in the model's input at a time.

In [61]:
import tiktoken

tiktoken.list_encoding_names()

['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base', 'o200k_base']

In [64]:
model_for_encoding = 'gpt-3.5-turbo'    # this works fine for "model_version" we are using

encoding = tiktoken.encoding_for_model(model_name=model_for_encoding)

In [66]:
# lets check encoded context
len(encoding.encode(context))

218

THE MODEL "**gpt-3.5-unfiltered**" HAS **CONTEXT WINDOW OF 4096 tokens** which comes around **750 - 1000 words in english vocabulary**.

CONTEXT CAN BE HUGE (1000s of words) BUT THEY DO NOT NEED TO BE. PROVIDING HUGE INFORMATION IN CONTEXT IS

*   first of all NOT SUPPORTED BY ALL MODELS
*   second, NOT ALL INFORMATION MAY BE RELEVANT.



ONE METHOD CAN BE TO CREATE SMALL SIZE INFORMATION UNITS (chunks), SELECT THE RELEVANT ONES AND PROVIDE THEM IN CONTEXT.

* COMPARING STRING CHUNKS WITH QUERY DIRECTLY IS NOT AT ALL A GOOD WAY TO GO BECAUSE THAT WILL REQUIRE US TO CODE ALL POSSIBLE COMBINATIONS OF WORDS AND MEANINGS WHICH IS IMPOSSIBLE.

* AND, STORING THE CHUNKS DIRECTLY IS ALSO NOT A GOOD STRATEGY BECAUSE IT MAY REQUIRE HUGE AMOUNT OF MEMORY DEPENDING ON THE USECASE.

* WE NEED TO FIND AN APPROACH TO STORE THE CHUNKS IN AN EFFICIENT MANNER WHILE CONSIDERING THE ABILITY TO PERFORM COMPUTATIONS ON AND WITH IT ALSO.

* TAKING INTO CONSIDERATION THE FACT THE COMPUTERS ARE AMAZING AT PERFORMING OPERATIONS WITH NUMBERS, WE ARE GOING TO SOMEHOW CONVERT OUR STRING CHUNKS TO VECTORS WHICH WILL ALLOW US TO STORE TWO SET OF INFORMATION AT A TIME AS WELL AS ENABLE US WITH FACILITY TO COMPUTE SOMETHING ON AND WITH IT.

PARTICULARLY THERE ARE 2 COMMON DATA REPRESENTATIONS CAPTURING SEMANTIC INFORMATION -


*   DENSE
*   SPARSE



**DENSE** - WHEN WE HAVE HUGE NUMBER OF DATA POINTS AND WE WANT TO CAPTURE AND COMPUTE SEMANTIC RELATEDNESS BETWEEN THEM AS WELL AS THE INCOMING DATA

**SPARSE** - WHEN WE HAVE LESS NUMBER OF DATA POINTS AND WANT TO CATEGORIZE THE DATA POINTS AS WELL AS THE INCOMING DATA

* THERE ARE MANY WAYS STRING(s) CAN BE CONVERTED TO NUMBERS BUT NOT ALL WAYS ARE EFFICIENT. THE TO-BE SELECTED WAY SHOULD BE ABLE TO PERFORM CONVERSION IN AN EFFICIENT MANNER AS WELL AS SHOULD NOT BE TOO COMPUTATIONALLY COMPLEX AND TIME-TAKING AS WELL AS SHOULD NOT UTILIZE TOO MUCH STORAGE.

* THE APPROACH SHOULD BE ABLE TO EXPRESS semantic relatedness (e.g. "boat" - "water") AND semantic similarity (e.g. "boat" - "ship").

* ONE VERY GOOD WAY IS COMPUTING "Word embeddings" which uses vectors. Word embeddings are based on the idea that contextual information alone constitutes a viable representation of linguistic items.

### VECTORS

* Vectors are built from components, which are ordinary numbers. You can think of a vector as a list of numbers, and vector algebra as operations performed on the numbers in the list.

* **In python, a vector can be represented as a tuple of scalars** (ordinary numbers INT or FLOAT)

Vector arithmetic (mainly dot product) can be used for calculating vector projections, vector decompositions, and determining orthogonality.

*   vector projection - vector projection gives the magnitude of projection of one vector over another vector. The vector projection is a scalar value. The vector projection of one vector over another is obtained by multiplying the given vector with the cosine of the angle between the two vectors.

*   vector decomposition - vector decomposition is vector addition in reverse. Usually we decompose vectors into component vectors that are orthogonal (90deg). This must be done in such a way that the component vectors sum to the original vector.

*   orthogonality - generalization of the notion of perpendicularity





* Here's a good **HISTORY** AND overview of how things evolved over time to ["Word embeddings"](https://www.gavagai.io/text-analytics/a-brief-history-of-word-embeddings/).

* THERE ARE MANY VERY FAMOUS SOLUTIONS like [word2vec](https://github.com/danielfrg/word2vec), [sentence_transformers](https://www.sbert.net/) and a lot more. They are benchmarked and listed on [MTEB(Massive text embedding benchmark)](https://huggingface.co/spaces/mteb/leaderboard).

* We will use [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) which provides a good tradeoff between speed of operations and representational accuracy.

Lets try encoding some text now to get a feel of what an embedding(output) output looks like

In [55]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-MiniLM-L6-v2')



* This **sentence-transformers/paraphrase-MiniLM-L6-v2** is a model hosted on [huggingface](https://huggingface.co/) and since we are using langchain for almost everything, we will use [langchain-huggingface](https://pypi.org/project/langchain-huggingface/) from pypi.

LETS TRY CREATING EMBEDDINGS FOR SIMPLE TERMS

In [56]:
puppy_embedding = embedding_model.embed_query('puppy')
dog_embedding = embedding_model.embed_query('dog')

print("puppy_embedding (vector) length:", len(puppy_embedding))
print("dog_embedding (vector) length:", len(dog_embedding))

puppy_embedding (vector) length: 384
dog_embedding (vector) length: 384


THE LENGTH OF THE EMBEDDING REPRESENTS THE DIMENSION DEPTH IN WHICH THE WORD/QUERY IS REPRESENTED.
EVERY MODEL HAS REPRESENTS A QUERY IN SOME DIMENSION DEPTH which USUALLY RANGES FROM 384 to 1586 in some cases.
HERE "puppy" and "dog" ARE REPRESENTED IN 384 DIMENSIONS.

SO THE EMBEDDING MODEL HAS REPRESENTED THE QUERY IN MULTI-DIMENSIONAL SPACE. BUT WAS IT ABLE TO CAPTURE THE SEMANTIC RELATEDNESS AND SIMILARITY IN ANY WAY ?

LETS FIND OUT

In [57]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
  return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

In [58]:
cosine_similarity(puppy_embedding, dog_embedding)

0.8027491357540261

IT GIVES US A FLOATING POINT NUMBER FROM -1 to 1.

*   CLOSER TO -1 being not very related
*   CLOSER TO 1 means highly likely related

JUST FOR FUN, LETS TRY TO COMPUTE THE SIMILARITY OF "puppy" and "planet jupiter"

In [59]:
planet_jupiter_embedding = embedding_model.embed_query('planet jupiter')

cosine_similarity(puppy_embedding, planet_jupiter_embedding)

-0.03535227699084348

SEE ? IT BEAUTIFULLY SHOWS HOW UNRELATED a PUPPY AND PLANET JUPITER ARE.

THATS WHAT WE WANTED RIGHT ?

* CREATING SMALL SIZE INFORMATION UNITS (chunks)

In [71]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = encoding.encode(text)
    return len(tokens)

chunk_creator = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 0,
    length_function = tiktoken_len
    )

chunks = chunk_creator.split_text(context)

In [72]:
chunks[0]

'LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):\n\nAsync, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.'

LETS GET THE RELEVANT CHUNK FOR A QUERY

* CONVERTING CHUNK TO EMBEDDINGS

In [73]:
embeddings_dict = {}

for chunk in chunks:
  embeddings_dict[chunk] = embedding_model.embed_query(chunk)

* PRINT CHUNKS ALONG WITH THEIR EMBEDDING SIZES

In [77]:
for chunk, embedding in embeddings_dict.items():
  print(chunk, '\n')
  print("Vector Embedding Length :", len(embedding))
  print("------"*10)

LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface. 

Vector Embedding Length : 384
------------------------------------------------------------
Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are. 

Vector Embedding Length : 384
---------------------------------------------

EMBED THE QUERY

In [80]:
rag_query = "Can LCEL help take code from the notebook to production?"

query_vector = embedding_model.embed_query(rag_query)

GET THE MOST SIMILAR CHUNK TO THE QUERY

In [91]:
def retrieve_similar_content(query_vector, embeddings_dict):
    max_similarity_score = 0
    most_similar_chunk = ''

    for chunk, chunk_vector in embeddings_dict.items():
        chunk_similarity_score = cosine_similarity(chunk_vector, query_vector)

        if max_similarity_score < chunk_similarity_score:
          max_similarity_score = chunk_similarity_score
          most_similar_chunk = chunk

    print("Query:", rag_query, '\n')
    print("Most Similar Chunk: ", most_similar_chunk, '\n')
    print("Similarity Score:", max_similarity_score)
    print("-----"*10)

    return most_similar_chunk

_ = retrieve_similar_content(query_vector, embeddings_dict)

Query: Can LCEL help take code from the notebook to production? 

Most Similar Chunk:  LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface. 

Similarity Score: 0.5815063597022642
--------------------------------------------------


LETS CREATE A CHAIN WHICH WILL PERFORM THESE OPERATIONS -

* EMBED QUERY
* GET THE MOST SEMANTICALLY SIMILAR CHUNK
* PROVIDE THE CHUNK IN CONTEXT
* GET THE ANSWER FROM THE MODEL

In [88]:
def create_query_vector(query):
    return embedding_model.embed_query(query)

def create_chat_template(chunk):
    system_message = (
        "You are a helpful AI bot. You respond to the query provided by "
        "the user by only using information provided in the context. If "
        "there is no information related to the user query in the context"
        """ then you just respond as "I don't know."."""
    )
    human_message = (
        f"#CONTEXT:\n{chunk}\n\n\n"
        "#QUERY:\n{query}\n"
    )
    template = ChatPromptTemplate.from_messages([
        ("system", system_message),
        ("human", human_message)
    ])

    return template

def create_chain(template, model):
    chain = template | model
    return chain

def get_response_from_rag(query):
    query_vector = create_query_vector(query)
    most_similar_chunk = retrieve_similar_content(query_vector, embeddings_dict)
    template = create_chat_template(most_similar_chunk)
    chain = create_chain(template, model)
    model_response = chain.invoke({"query": query})
    return model_response

In [92]:
user_query = "Can LCEL help take code from the notebook to production?"

get_response_from_rag(user_query)

Query: Can LCEL help take code from the notebook to production? 

Most Similar Chunk:  LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface. 

Similarity Score: 0.5815063597022642
--------------------------------------------------


AIMessage(content='LCEL can indeed aid in transitioning code from a notebook to production. By utilizing LangChain Expression Language, developers can seamlessly compose chains with asynchronous, batch, and streaming support. This feature facilitates the quick prototyping of chains within a Jupyter notebook using the synchronous interface. Subsequently, developers can effortlessly transition these chains to production by exposing them through an asynchronous streaming interface. This streamlined process enhances the efficiency of moving code from the development stage in a notebook environment to a production-ready state.', response_metadata={'token_usage': {'completion_tokens': 101, 'prompt_tokens': 172, 'total_tokens': 273}, 'model_name': 'gpt-3.5-unfiltered', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-9a42cfaf-fa1d-481c-8498-b9a364170b58-0')

GREAT !

NOW LETS TRY DOING THIS ON OUR OWN DOCUMENT !

BUT WAIT. WHAT IF OUR DOCUMENTS ARE HUGE and ARE OF DIVERSE RANGE OF FORMATS ?
* pdf, pptx, docx, html, md, txt, csv etc

FIRST WE WILL HAVE TO PARSE AND BRING THEM IN A FORMAT SUITABLE FOR PROCEEDING WITH CHUNKING

WE'LL DO THIS FOR A PDF and UNDERSTAND HOW THE END-TO-END PIPELINE MAY LOOK LIKE.

In [95]:
from langchain.document_loaders import PyMuPDFLoader

documents = PyMuPDFLoader(file_path='https://www.deyeshigh.co.uk/downloads/literacy/world_book_day/the_hitchhiker_s_guide_to_the_galaxy.pdf').load()

LETS SPLIT DOCUMENTS IN CHUNKS

In [130]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 0,
    length_function = tiktoken_len
)

chunks = text_splitter.split_documents(documents)

In [132]:
text_chunks = [_chunk.page_content for _chunk in chunks]

NOW CALCULATING EMBEDDINGS FOR THESE CHUNKS and STORING IT IN A VECTOR STORE (DATABASE FOR VECTORS).
* LangChain contains helper libraries for vast variety of vector stores.

In [134]:
from langchain.vectorstores import FAISS

docs_vector_store = FAISS.from_documents(documents, embedding_model)
# chunks_vector_store = FAISS.from_documents(chunks, embedding_model)
chunks_vector_store = FAISS.from_texts(text_chunks, embedding_model)


MAKING THIS VECTOR STORE SUITABLE FOR RETRIEVAL

In [136]:
doc_retriever = docs_vector_store.as_retriever()
chunk_retriever = chunks_vector_store.as_retriever()

LETS TEST THE VECTOR RETRIEVER

In [137]:
query_for_rag = "What is the significance of towels in Douglas Adam's Hitchhicker's Guide?"
related_doc_chunks = doc_retriever.invoke(query_for_rag)
related_text_chunks = chunk_retriever.invoke(query_for_rag)

In [138]:
related_doc_chunks[0]

Document(page_content='14  /  D O U G L A S  A D A M S  \nFord wished that a flying saucer would arrive soon because he \nknew how to flag flying saucers down and get lifts from them.  \nHe knew how to see the Marvels of the Universe for less than \nthirty Altairan dollars a day. \nIn fact, Ford Prefect was a roving researcher for that wholly \nremarkable book The Hitch Hiker\'s Guide to the Galaxy. \nHuman beings are great adaptors, and by lunchtime life in the \nenvirons of Arthur\'s house had settled into a steady routine.  It \nwas Arthur\'s accepted role to lie squelching in the mud making \noccasional demands to see his lawyer, his mother or a good book; \nit was Mr Prosser\'s accepted role to tackle Arthur with the \noccasional new ploy such as the For the Public Good talk, the \nMarch of Progress talk, the They Knocked My House Down \nOnce You Know, Never Looked Back talk and various other \ncajoleries and threats; and it was the bulldozer drivers\' accepted \nrole to sit aroun

In [139]:
related_text_chunks[0]

Document(page_content="14  /  D O U G L A S  A D A M S  \nFord wished that a flying saucer would arrive soon because he \nknew how to flag flying saucers down and get lifts from them.  \nHe knew how to see the Marvels of the Universe for less than \nthirty Altairan dollars a day. \nIn fact, Ford Prefect was a roving researcher for that wholly \nremarkable book The Hitch Hiker's Guide to the Galaxy. \nHuman beings are great adaptors, and by lunchtime life in the \nenvirons of Arthur's house had settled into a steady routine.  It \nwas Arthur's accepted role to lie squelching in the mud making \noccasional demands to see his lawyer, his mother or a good book; \nit was Mr Prosser's accepted role to tackle Arthur with the \noccasional new ploy such as the For the Public Good talk, the \nMarch of Progress talk, the They Knocked My House Down \nOnce You Know, Never Looked Back talk and various other \ncajoleries and threats; and it was the bulldozer drivers' accepted \nrole to sit around dri

In [142]:
chunk_retriever.invoke("What species does Marvin belong to ?")[1]

Document(page_content='T H E  H I T C H H I K E R \' S  G U I D E  T O  T H E  G A L A X Y  /  95  \n"Alright," said Marvin like the tolling of a great cracked bell, "I\'ll \ndo it." \n"Good ..." snapped Zaphod, "great ...  thank you ..." \nMarvin turned and lifted his flat-topped triangular red eyes up \ntowards him. \n"I\'m not getting you down at all am I?" he said pathetically. \n"No no Marvin," lilted Trillian, "that\'s just fine, really ..." \n"I wouldn\'t like to think that I was getting you down." \n"No, don\'t worry about that," the lilt continued, "you just act as \ncomes naturally and everything will be just fine." \n"You\'re sure you don\'t mind?" probed Marvin. \n"No no Marvin," lilted Trillian, "that\'s just fine, really ...  just part \nof life." \n"Marvin flashed him an electronic look. \n"Life," said Marvin, "don\'t talk to me about life." \nHe turned hopelessly on his heel and lugged himself out of the \ncabin.  With a satisfied hum and a click the door closed behind 

LETS COMPLETE THE RAG

In [109]:
HUMAN_MESSAGE = """
CONTEXT:
{context}

QUERY:
{query}

Use the provide context to answer the provided user query. Only use the provided context to answer the query. If you do not know the answer, response with "I don't know"
"""

rag_prompt = ChatPromptTemplate.from_template(HUMAN_MESSAGE)

LETS USE THE LangChain LCEL AND CHAIN OUR COMPONENTS TOGETHER

In [143]:
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {"context": doc_retriever, "query": RunnablePassthrough()}
    | rag_prompt
    | model
    | StrOutputParser(name='content')
)

In [144]:
rag_chain.invoke("What is the significance of towels in Douglas Adam's Hitchhicker's Guide? Provide the answer and also print the exact lines from which you got the answer.")

'SIGNIFICANCE OF TOWELS IN DOUGLAS ADAMS\' HITCHHIKER\'S GUIDE:\n\nTowels hold a special significance in Douglas Adams\' Hitchhiker\'s Guide to the Galaxy. In the book, it is mentioned that a towel is one of the most useful things a hitchhiker can have. The exact lines from the book are as follows: "A towel, it says, is about the most massively useful thing an interstellar hitchhiker can have. Partly it has great practical value." Towels are emphasized for their practicality and versatility, acting as a multipurpose tool for hitchhikers in the vastness of space.\n\nIn addition to being useful for drying oneself, towels in the Hitchhiker\'s Guide serve various other purposes, highlighting their importance in the galactic journey. The book humorously describes the significance of towels in a humorous and memorable way, adding a unique and quirky element to the narrative.'

COOL !

In [145]:
rag_chain.invoke("What is the name of the planet Ford is from?")

'Ford is from Betelgeuse Seven.'

In [146]:
rag_chain.invoke("Did Zaphod send Marvin or Trillion, to escort the hitchhikers to the bridge ?")

'Zaphod sent Trillian to escort the hitchhikers to the bridge.'

LETS TEST IT OUT FOR A QUERY FOR WHICH WE HAVE NO DATA IN THE VECTOR STORE

In [147]:
rag_chain.invoke("What is the airspeed velocity of an unladen swallow?")

"I don't know."

PERFECT !