### A) Importing Libraries

<b>- TextLoader, UnstructuredPDFLoader, PyPDFLoader:</b> These are used to load different types of documents, including plain text files, PDFs, and URLs, into the LangChain pipeline. In this case, PyPDFLoader is specifically used to load PDF documents.

<b>- CharacterTextSplitter, RecursiveCharacterTextSplitter:</b> These are used to split the loaded documents into smaller chunks. For example, RecursiveCharacterTextSplitter splits the text recursively based on a specified chunk size, overlap, and separator (e.g., paragraphs, sentences).

<b>- HuggingFaceEmbeddings:</b> This is used to create embeddings for the documents using a HuggingFace model (like T5, BERT, etc.).

<b>- FAISS:</b> FAISS (Facebook AI Similarity Search) is used as a vector store for efficient similarity search. It allows for the fast retrieval of similar documents given an embedding.

<b>- load_qa_chain:</b> This is used for setting up a question-answering chain that uses a language model to answer queries based on documents.

<b>- HuggingFaceHub:</b> This is used to interact with HuggingFace models, providing access to pre-trained models like flan-t5-xxl (which is a large language model).

<b>- VectorstoreIndexCreator:</b> This is used for creating a vector store and indexing documents using tools like ChromaDB, though it's not directly used in this code.

<b>- RetrievalQA:</b> This chain is used to integrate document retrieval and question answering. It uses a retriever (like FAISS) to get relevant documents and a language model to answer questions based on those documents.

<b>- os / dotenv:</b> These are used for loading environment variables from a .env file, which is typically used for storing sensitive information like API tokens.


In [1]:
from langchain_community.document_loaders import TextLoader, UnstructuredPDFLoader, UnstructuredURLLoader, PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS  
from langchain.chains.question_answering import load_qa_chain
from langchain_community.llms import HuggingFaceHub
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings("ignore")

### B) Loading and Splitting the PDF:

In general, LLM retrieves contextual documents from an external dataset as part of its execution. This is useful when we want to ask questions about specific documents (e.g., PDFs, videos, etc). If we want to create an application to chat with our data, we need to first load our data into a format where it can be worked with.

<p align="center"><img src="./assets_img/a1.PNG" width="60%"/></p>

Here, we have PDF document, we use `PyPDFLoader()` to load PDF document. 

- __Youtube DataLoader:__</b>__ LangChain provides YoutubeAudioLoader that loads videos from YouTube.
- __WebBaseLoader:__ WebBaseLoader is used to load URLs from the Internet.
- __NotionDirectoryLoader:__ NotionDirectoryLoader is used to load data from Notion.

<p align="center"><img src="./assets_img/a2.PNG" width="40%"/></p>


In [2]:
load_dotenv() # It loads env variable from .env file.
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN") # Retrieves the HuggingFace API token.
loader = PyPDFLoader("./GenAI_data.pdf") # It is used to load the PDF file.
pages = loader.load_and_split() # Split the PDF into Pages
print('Document pages :', len(pages))
pages

Document pages : 2


[Document(metadata={'producer': 'GPL Ghostscript 10.01.1', 'creator': 'PyPDF', 'creationdate': '2024-03-07T17:18:45+01:00', 'moddate': '2024-03-07T17:18:45+01:00', 'source': './GenAI_data.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Austin ¥ Boston ¥ Chicago ¥ Denver ¥ Harrisburg ¥ O lympia ¥ Sacramento ¥ Silicon Valley ¥ Washington, D.C.  \n \nARTIFICIAL INTELLIGENCE (AI) & GENERATIVE AI \n \nWhat is Artificial Intelligence? \n \nArtificial Intelligence (AI) is a field of science concerned with building machines that can reason, \nlearn, and act in such a way that would normally re quire human intelligence or that involves data \nwhose scale exceeds what humans can analyze.1 \n \nWhat is Generative Artificial Intelligence? \n \nAI has been around for decades, but the field has recently garnered significant attention due to \nadvancements in the subfield of generative AI, and the subsequent release of generative AI chatbots \nÐ like ChatGPT and Bard Ð for public

### C) Text Splitting

<b>Document splitting:</b> Document Splitting is required to split documents into smaller chunks as we need to maintain meaningful relationships between the chunks.

<p align="center"><img src="./assets_img/a3.PNG" width="40%"/></p>

- The input text is split based on a defined chunk size with some defined chunk overlap. Chunk Size is a length function to measure the size of the chunk. This is often characters or tokens.
- A chunk overlap is used to have little overlap between two chunks and this allows for to have some notion of consistency between 2 chunks

<p align="center"><img src="./assets_img/a4.PNG" width="50%"/></p>

__Examples:__
- Recursive text Splitter

        text1 = 'abcdefghijklmnopqrstuvwxyz'
        r_splitter.split_text(text1)
        # Output - ['abcdefghijklmnopqrstuvwxyz']
- Character Text Splitter

        text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
        r_splitter.split_text(text2)
        # Output - ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [3]:
# RecursiveCharacterTextSplitter: It splits the text into chunks of size 1024 characters, 
# with an overlap of 64 characters between consecutive chunks which helps in maintaining context while splitting text.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    separators=[r'\n\n', r'\n', r'(?=>\. )', ' ', '']
)

# Splitting should happen at double newlines (\n\n), single newlines (\n), after periods, spaces, etc.
docs = text_splitter.split_documents(pages) # It splits the pages into smaller chunks.

print(len(docs))
docs

6


[Document(metadata={'producer': 'GPL Ghostscript 10.01.1', 'creator': 'PyPDF', 'creationdate': '2024-03-07T17:18:45+01:00', 'moddate': '2024-03-07T17:18:45+01:00', 'source': './GenAI_data.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Austin ¥ Boston ¥ Chicago ¥ Denver ¥ Harrisburg ¥ O lympia ¥ Sacramento ¥ Silicon Valley ¥ Washington, D.C.  \n \nARTIFICIAL INTELLIGENCE (AI) & GENERATIVE AI \n \nWhat is Artificial Intelligence? \n \nArtificial Intelligence (AI) is a field of science concerned with building machines that can reason, \nlearn, and act in such a way that would normally re quire human intelligence or that involves data \nwhose scale exceeds what humans can analyze.1 \n \nWhat is Generative Artificial Intelligence? \n \nAI has been around for decades, but the field has recently garnered significant attention due to \nadvancements in the subfield of generative AI, and the subsequent release of generative AI chatbots \nÐ like ChatGPT and Bard Ð for public

### D) Embeddings and Vector Database Creation

<p align="center"><img src="./assets_img/a5.PNG" width="50%"/></p>

We split up our document into small chunks and now we need to put these chunks into an index so that we are able to retrieve them easily when we want to answer questions on this document. We use embeddings and vector stores for this purpose.

- __Vector stores__ and __embeddings__ come after text splitting as we need to store our documents in an easily accessible format. Embeddings take a piece of text and create a numerical representation of the text. 
- Text with semantically similar content will have similar vectors in embedding space. Thus, we can compare embeddings(vectors) and find texts that are similar.

<p align="center"><img src="./assets_img/a6.PNG" width="40%"/> &nbsp <img src="./assets_img/a7.PNG" width="40%"/></p>

- A vector store is a database where you can easily look up similar vectors later on. This becomes useful when we try to find documents that are relevant to a question.
- When we want to get an answer for a question, we create embeddings of the question and then we compare the embeddings of the question with all the different vectors in the vector store and pick the n most similar.


<b>- HuggingFaceEmbeddings:</b> This model generates embeddings for the document chunks using a HuggingFace model. Embeddings are numerical representations of text that capture semantic meaning.

<b>- FAISS.from_documents():</b> This takes the document chunks (docs) and their embeddings and stores them in a FAISS vector store. FAISS allows efficient similarity search, which is crucial for retrieving relevant documents when querying.


In [4]:
embeddings = HuggingFaceEmbeddings()
print(embeddings)
db = FAISS.from_documents(docs, embeddings)
db

client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
) model_name='sentence-transformers/all-mpnet-base-v2' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False


<langchain_community.vectorstores.faiss.FAISS at 0x2a400113650>

### E) Model Setup

We loads the `large FLAN-T5` model (google/flan-t5-xxl) from HuggingFace's model hub. The model is used for text generation tasks.

In [5]:
llm = HuggingFaceHub(
    repo_id="google/flan-t5-xxl", # Model: large FLAN-T5 model.
    model_kwargs=
        {
            "temperature":1, # Controls output randomness (higher values, more randomness).
            "max_length":300 # Model can generate max long text if required.
        }, 
        task="text-generation" # Task: text generation tasks.
    )
llm

HuggingFaceHub(client=<InferenceClient(model='google/flan-t5-xxl', timeout=None)>, repo_id='google/flan-t5-xxl', task='text-generation', model_kwargs={'temperature': 1, 'max_length': 300})

### F) Question-Answering Chain


__Similarity Search:__ We will now ask questions using the similarity search method and pass k, which specifies the number of documents that we want to return. Retrieval is important at query time when a query comes in and we want to retrieve the most relevant splits

<p align="center"><img src="./assets_img/a8.PNG" width="40%"/> &nbsp <img src="./assets_img/a9.PNG" width="40%"/> </p>


In [6]:
qa = RetrievalQA.from_chain_type( # RetrievalQA: It sets up a question-answering chain.
    llm=llm,
    chain_type="stuff",  # Method used to create QA chain. 'stuff' meaning the model use all documents for the response.
    retriever=db.as_retriever(search_kwargs={ # db.as_retriever: Takes FAISS vector store and retrieves top most relevant documents.
        "k": 3})  # Top 'k=3' documents will retrieve
) 
qa

RetrievalQA(verbose=False, combine_documents_chain=StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=HuggingFaceHub(client=<InferenceClient(model='google/flan-t5-xxl', timeout=None)>, repo_id='google/flan-t5-xxl', task='text-generation', model_kwargs={'temperature': 1, 'max_length': 300}), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context'), retriever=VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 

In [7]:
def answer_QA_chain(q):
    chain = load_qa_chain(llm, chain_type="stuff")
    docs = db.similarity_search(q, k=3)
    response = chain.run(input_documents=docs, question=q)
    start_index = response.find("Helpful Answer:") 
    if start_index != -1:
        answer = response[start_index + len("Helpful Answer:"):].strip()
        return answer

In [8]:
query = "Can you give me brief history of AI?"
resp = answer_QA_chain(query)
print(resp)

Sure, here's a brief history of AI:

- **1950**: Alan Turing published a paper entitled "Computing Machinery and Intelligence," which introduced the concept of a "learning machine."
- **1956**: John McCarthy coined the term "artificial intelligence" at the Dartmouth Conference, marking the official birth of AI as a field of study.
- **1966**: Joseph Weizenbaum created ELIZA, one of the first chatbots, at MIT. ELIZA was designed to simulate a psychotherapist's conversation with a patient.
- **1970s-1980s**: AI research focused on expert systems, which used AI to mimic the decision-making abilities of human experts.
- **1990s**: The internet boom led to advancements in machine learning, with AI being used for tasks like image and speech recognition.
- **2010s**: Deep learning, a subset of machine learning, gained prominence. This led to significant improvements in AI's ability to understand and generate human-like text, images, and speech.
- **2020s**: The release of generative AI chatbo

In [9]:
# chain = load_qa_chain(llm, chain_type="stuff")
# query = "Can you give me brief history of AI?"
# docs = db.similarity_search(query, k=3)
# response = chain.run(input_documents=docs, question=query)
# print('Response:', response)
# response

In [10]:
query = "Who is Chean?"
resp = answer_QA_chain(query)
print(resp)

I don't know.


__The key components are:__
- Document processing: Loading and splitting text.
- Embeddings: Using HuggingFace embeddings to convert text into vectors.
- Vector store (FAISS): Storing and retrieving document vectors efficiently.
- Question Answering (RetrievalQA): Answering queries based on the stored documents.

https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed