# Introduction to Retrieval Augmented Generation (RAG), In-context Learning, and Next-generation LLMOps

### 1. Retrieval Augmented Generation (RAG):
Retrieval Augmented Generation (RAG) is a novel approach that combines the strengths of large-scale retrieval systems with the generative capabilities of transformers like GPT. In a RAG-based system, when a question is posed, relevant documents or passages are retrieved from a corpus, and then fed alongside the query to a generative model. This two-step process enables the model to leverage both external knowledge from the corpus and its internal knowledge to produce more informed and contextually accurate responses.

### 2. In-context Learning:
Traditional machine learning models learn from extensive labeled datasets. In contrast, in-context learning pertains to models, especially language models, leveraging a few examples or context provided at inference time to tailor their outputs. This technique allows LLMs to be dynamically adapted to new tasks without undergoing explicit fine-tuning.

### 3. LLM Chains/Pipelines:
LLM chains or pipelines involve stringing together multiple stages or components of a system to achieve a complex task. For instance, in a RAG system, a retrieval component fetches relevant data from a database, followed by a generation component that constructs the final answer. By chaining different LLM modules, developers can build systems that harness diverse capabilities and can be modularly updated or optimized.

### 4. RAG for On-Premise LLM Applications:
With the growing need for data privacy and proprietary data handling, many enterprises seek solutions to harness the power of LLMs in-house. RAG provides a unique opportunity for such use-cases. By integrating RAG with on-premise data repositories, enterprises can build powerful LLM applications tailored to their specific needs, while ensuring data confidentiality.

### 5. RAG and Fine-Tuning:
While RAG is a powerful approach on its own, it can also be combined with fine-tuning to enhance LLM capabilities further. Fine-tuning allows models to be explicitly trained on specific datasets, honing them for certain tasks. When coupled with RAG, fine-tuned models can make informed decisions using retrieved external knowledge, yielding even more precise and task-specific results.

### 6. The Advent of LLMOps:
With the rise of large language models and their multifaceted applications, there's an emerging discipline known as LLMOps, focused on the operational aspects of deploying, monitoring, and maintaining LLM systems. RAG, in-context learning, and LLM pipelines become central topics in this space, paving the way for next-generation machine learning operations tailored to the specific needs of LLMs.

Install dependencies. Only run the first line if you are running this on the Intel Developer Cloud's Jupyter Environment.

In [None]:
!source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
!pip install langchain==0.0.330
!pip install pygpt4all==1.1.0
!pip install gpt4all==1.0.12
!pip install transformers==4.30.2
!pip install datasets==2.14.6
!pip install tiktoken==0.4.0
!pip install chromadb==0.4.15
!pip install sentence_transformers==2.2.2

Set the model initialization parameters: 

**model_path = './ggml-model-gpt4all-falcon-q4_0.bin'**: This line specifies the file path for a pre-trained model, possibly a GPT-4 variant named 'gpt4all-falcon-q4_0', which will be loaded for further tasks.

**n_threads = 32**: Sets the number of threads to be used, which might influence parallel processing or inference speed, especially relevant for multi-core systems.

**max_tokens = 50**: Limits the number of tokens (words or subwords) to a maximum of 50 for the input or output sequences, ensuring that the data fed into or produced by the model does not exceed this length.

**repeat_penalty = 1.20**: This parameter possibly penalizes repetitive content in the model's output. A value greater than 1.0 means the model is discouraged from generating repeated sequences.

**n_batch = 32**: Specifies the batch size for processing data, meaning that the model will process 32 samples at a time. This can help optimize processing speed and memory usage.

**top_k = 1**: This could be related to the "top-k" sampling strategy during the model's generation. When generating text, the model will consider only the top k most probable next tokens. In this case, only the most probable next token is considered.

In summary, these lines of code are configuration settings for loading a specific pre-trained model and setting various parameters for its operation, including how data is processed and how text is generated.

In [None]:
model_path = './ggml-model-gpt4all-falcon-q4_0.bin'
n_threads =32
max_tokens = 50
repeat_penalty = 1.20
n_batch = 32
top_k = 1

The code below imports necessary libraries like os, requests, tqdm, and the GPT4All module from langchain.llms. It's designed to download a gpt4all-falcon-q4_0 language model from a specified URL on the Hugging Face platform. Using the requests library, the model is streamed and downloaded in manageable chunks. The progress of the download is visually represented using a progress bar from the tqdm library. The downloaded model is then saved to a local file, as defined by the model_path variable, in binary mode. Users are cautioned that due to the model's size, the download might take some time.

In [None]:
import os
import requests

from tqdm import tqdm
from langchain.llms import GPT4All

# download the commercial gpt4all-j model
url = "https://huggingface.co/nomic-ai/gpt4all-falcon-ggml/resolve/main/ggml-model-gpt4all-falcon-q4_0.bin"
# send a GET request to the URL to download the file. Stream since it's large
response = requests.get(url, stream=True)
# open the file in binary mode and write the contents of the response to it in chunks
# This is a large file, so be prepared to wait.
with open(model_path, 'wb') as f:
    for chunk in tqdm(response.iter_content(chunk_size=10000)):
        if chunk:
            f.write(chunk)

The code below initializes and loads the model into memory.

In [None]:
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]
# Verbose is required to pass to the callback manager
llm = GPT4All(model=model_path, callbacks=callbacks, verbose=True,
              n_threads=n_threads, n_predict=max_tokens, repeat_penalty=repeat_penalty,
              n_batch=n_batch, top_k=top_k)

The code below imports the langchain library functions and downloads a sample dataset of dialogues from Hugging Face using the datasets library. 

In [None]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from datasets import load_dataset
import pandas as pd

data = './dialogues.txt'

# Download the customer service robot support dialogue from hugging face
dataset = load_dataset("FunDialogues/customer-service-robot-support")
# Convert the dataset to a pandas dataframe
dialogues = dataset['train']
df = pd.DataFrame(dialogues, columns=['id', 'description', 'dialogue'])
# Print the first 5 rows of the dataframe
df.head()
# only keep the dialogue column
dialog_df = df['dialogue']

# save the data to txt file
dialog_df.to_csv(data, sep=' ', index=False)

Below, we establish the chunk size and chunk overlap. 

If you set chunk_size to 500 and overlap to 25 using the RecursiveCharacterTextSplitter, when processing a text, the splitter would attempt to divide it into chunks of no more than 500 characters each. However, each chunk would overlap with its neighboring chunks by 25 characters. This means, for example, that characters 476-500 of the first chunk would be the same as characters 1-25 of the second chunk. Such overlap ensures that context is not lost between chunks, especially useful if the boundaries might cut off important contextual information. This setup aids in preserving continuity and context, particularly crucial when analyzing or processing the resulting chunks separately.

In [None]:
chunk_size = 500
overlap = 25

The below code initializes a TextLoader to preprocess a given dataset, then employs a RecursiveCharacterTextSplitter to segment this data into overlapping chunks. Using the VectorstoreIndexCreator, it converts each text chunk into a numerical vector with the help of the HuggingFaceEmbeddings. These vectors are then stored in an index, presumably for efficient searching or similarity checks, within a chromadb vector DB. 

In [None]:
loader = TextLoader(data)
# Text Splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
# Embed the document and store into chroma DB
index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings(), text_splitter=text_splitter).from_loaders([loader])

Below we provide user input to the RAG based LLM Chatbot

In [None]:
user_input = 'my robot is not turning on, can you help me please?'
context_verbosity = False
top_k = 2

In [None]:
import time

# perform a similarity search and retrieve the context from our documents
results = index.vectorstore.similarity_search(user_input, k=top_k)
# join all context information (top 4) into one string 
context = "\n".join([document.page_content for document in results])
if context_verbosity:
    print(f"Retrieving information related to your question...")
    print(f"Found this content which is most similar to your question: {context}")
template = """
Please use the following robotic technical support related questions to answer questions. 
Context: {context}
---
This is the user's question: {question}
Answer: This is what our robot arm technical specialist suggest."""
prompt = PromptTemplate(template=template, input_variables=["context", "question"]).partial(context=context)
llm_chain = LLMChain(prompt=prompt, llm=llm)
        
print("Processing the information with gpt4all...\n")
start_time = time.time()
response = llm_chain.run(user_input)
elapsed_time_milliseconds  = (time.time() - start_time) * 1000

tokens = len(response.split())
time_per_token_milliseconds = elapsed_time_milliseconds  / tokens if tokens != 0 else 0

processed_reponse = response + f" --> {time_per_token_milliseconds:.4f} milliseconds/token AND Time taken for response: {elapsed_time_milliseconds:.2f} milliseconds"

processed_reponse

# Conclusion

In this Jupyter notebook, we delve into the sophisticated realm of Retrieval Augmented Generation (RAG) and its integration with Large Language Models (LLMs) using the LangChain platform. We explore the essence of RAG, in-context learning, and the operational intricacies of LLMs, elucidating how they interplay in today's AI landscape. Harnessing LangChain's robust toolset—ranging from diverse document loaders and transformers, a plethora of text embedding models, to efficient vector stores—we craft a RAG-based chatbot specifically designed to answer intricate queries about robotic maintenance. Through this hands-on approach, learners gain a holistic understanding of the potential of RAG and the versatility of the LangChain framework.





