## Prepare the data


### Document Cleaning

In [5]:
import regex as re
def clean_text(text):
    # Remove the initial system message if it exists
    text = re.sub(r"Setting `pad_token_id` to `eos_token_id`:\d+ for open-end generation.\n\n", "", text)

    # Replace escaped newlines and other escaped characters
    text = text.replace('\\n', '\n').replace("\\'", "'")

    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

    # Optionally, remove additional unwanted characters or sequences
    text = re.sub(r"\[Document\(page_content='", "", text)
    text = re.sub(r"'\)\]", "", text)
    text = re.sub(r"\\n", " ", text)
    text = re.sub(r"\n", " ", text)  # Replace escaped new lines with space

    # Remove content within brackets that contains only a number, like [8] or [37]
    text = re.sub(r"\[\d+\]", "", text)

    return text.strip()

### Importing the desired Research Paper for context

In [6]:
from langchain.document_loaders import PyPDFLoader

# Initialize the PDF loader with the URL of the PDF document
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf")

# Load the document by calling loader.load(), which returns a list of Page objects
pages = loader.load()

# Extract content from each page and store it in a list
docs = [clean_text(page.page_content) for page in pages]

# Example of accessing documents
print(len(docs))  # Prints the number of documents/pages
print(docs[0][:500])  # Prints the first 500 characters of the first document/page

15
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain  Shazeer∗ Google Brain  Parmar∗ Google Research  Uszkoreit∗ Google Research  Llion Jones∗ Google Research  N. Gomez∗ † University of Toronto .eduŁukasz Kaiser∗ Google Brain  Illia Polosukhin∗ ‡  Abstract The dominant sequence transduction models are based on complex recu




### Chunking the documents

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.create_documents(docs)

## Create the embeddings + retriever

Now that the docs are all of the appropriate size, we can create a database with their embeddings.

To create document chunk embeddings we'll use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. There are many other embeddings models available on the Hub, and you can keep an eye on the best performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).


To create the vector database, we'll use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets.

We'll access both the embeddings model and FAISS via LangChain API.

In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We need a way to return(retrieve) the documents given an unstructured query. For that, we'll use the `as_retriever` method using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents
- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.


In [9]:
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 4}
)

The vector database and retriever are now set up, next we need to set up the next piece of the chain - the model.

## Load quantized model

For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.

With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

To make inference faster, we will load the quantized version of the model:

In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'HuggingFaceH4/zephyr-7b-beta'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [12]:
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## Setup the LLM chain

Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting.

In [29]:
from langchain.prompts import PromptTemplate

prompt_template = """
Please provide a concise and factual response to the following question based on the provided context.

---Context Start---
{context}
---Context End---

Question: {question}

---Answer Start---
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


In [39]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
llm_chain = prompt | llm | StrOutputParser() | extract_answer


Finally, we need to combine the `llm_chain` with the retriever to create a RAG chain. We pass the original question through to the final generation step, as well as the retrieved context docs:

In [40]:
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

# Adding a function to only retrieve the answer
def extract_answer(full_text):
    # Find the start of the answer section and extract the text following it
    marker = "---Answer Start---"
    start_index = full_text.find(marker) + len(marker)
    if start_index > len(marker):
        return full_text[start_index:].strip()  # Remove any leading/trailing whitespace
    return "No answer found or output format error."


rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)


## Compare the results

Let's see the difference RAG makes in generating answers to the library-specific questions.

In [41]:
question = "What is a transformer?"

First, let's see what kind of answer we can get with just the model itself, no context added:

In [42]:
llm_chain.invoke({"context":"", "question": question})

'A transformer is an electrical device that transfers electrical energy from one circuit to another, typically through the use of electromagnetic induction. It allows for the transformation of voltage levels between high and low values without changing the frequency or direction of alternating current (AC) electricity. Transformers are commonly used in power transmission and distribution systems, as well as in various electronic devices such as computers and appliances.'

As you can see, the model interpreted the question as one about physical transformer, while transformer models are different in AI.
Let's see if adding context from the research paper helps the model give a more relevant answer:

In [43]:
rag_chain.invoke(question)

'A transformer is a new simple network architecture proposed in the given context, which is based solely on attention mechanisms and dispenses with recurrence and convolutions entirely. It allows for significantly more parallelization and can achieve better translation quality compared to previous models that use recurrent networks or attention mechanisms in conjunction with them. The transformer draws global dependencies between input and output using an attention mechanism, eliminating the need for sequence-aligned RNNs or convolution. This approach enables faster training times and requires fewer resources compared to previous models, making it a promising development in the field of neural sequence transduction.'

As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.

Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.

In [44]:
response = rag_chain.invoke(question)

## Adding a transcriber

In [45]:
from transformers import pipeline
transcriber = pipeline(task="automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebo

Utilized AWS Polly to generate speech here for questions testing.

In [46]:
question = transcriber("/kaggle/input/speech/speech_20240415163418027.mp3")['text']

In [47]:
question

'WHAT ARE APPLICATIONS OF ATTENSION IN A MODEL'

In [48]:
response = rag_chain.invoke(question)

## Adding text-to-speech model

In [50]:
from gtts import gTTS
from IPython.display import Audio

def text_to_speech_gtts(text, filename="output.mp3"):
    tts = gTTS(text=text, lang='en', slow=False)  # Language is English and speech is at a normal rate.
    tts.save(filename)
    return Audio(filename)

audio = text_to_speech_gtts(response)
audio


We can enhance this by developing a Streamlit app that includes a feature for users to input the link to a paper and provides real-time question and answer functionality.