# RAG-based text documents Q&A chat with LangChain and Llama 2 7B model 

RAG stands for retrieval augmented generation and it works by retrieving external documents and using them when executing queries to the LLMs. Using this technique we can ask out language model questions specific for the content of these documents. We will build a simple demo of it where the LLM will answer some questions regarding the set of external PDFs.

We'll use Stanford's CS224 Natural Language Processing with Deep Learning amazing course's syllabus and lectures trascript text files as our external data content we want to ask questions about.

In this experiment I used:

* Data source: text files
* Model: ChatGPT API <---- change
* RAG: LangChain

In [1]:
# !pip install langchain
# !pip install tiktoken
# !pip install chromadb
# !pip install lark
# !pip install sentence-transformers

## 1. Loading documents

We'll start by loading the data we want to ask our LLM about.

In [2]:
import os
# import openai

# openai.api_key  = os.environ['OPENAI_API_KEY']

DATA_PATH = './data/02_llama_docs_chat/'

In [3]:
from langchain.document_loaders import TextLoader


loaders = [
    TextLoader(DATA_PATH + "CS224n_Syllabus.txt"),
    TextLoader(DATA_PATH + "CS224N_NLP_with_Deep_Learning_Winter_2021_Lecture_1.txt"),
    TextLoader(DATA_PATH + "CS224N_NLP_with_Deep_Learning_Winter_2021_Lecture_2.txt"),
    TextLoader(DATA_PATH + "CS224N_NLP_with_Deep_Learning_Winter_2021_Lecture_3.txt"),
    TextLoader(DATA_PATH + "CS224N_NLP_with_Deep_Learning_Winter_2021_Lecture_4.txt")
]
pages = []
for loader in loaders:
    pages.extend(loader.load())

In [4]:
len(pages)

5

In [5]:
pages[0].page_content[0:200]

'CS224n: Natural Language Processing with Deep Learning\nStanford / Winter 2021\n\nNatural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share informatio'

In [6]:
pages[0].metadata

{'source': './data/02_llama_docs_chat/CS224n_Syllabus.txt'}

## 2. Documents splitting

Out next step should be splitting the data. LLMs input context has a limited token length. That is why we need to chunk our input data. 

Individual chunks, later down the line, will be represented as embeddings vectors which will be selected as input for the model by their semantic similarity to the posted question or problem.

We will use one of the Langchain's simplest and essential split method RecursiveCharacterTextSplitter and run it on our loaded PDFs text content.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [8]:
splits = text_splitter.split_documents(pages)

In [9]:
len(splits)

203

In [10]:
len(pages)

5

In [11]:
splits[0]

Document(page_content='CS224n: Natural Language Processing with Deep Learning\nStanford / Winter 2021\n\nNatural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share information. In recent years, deep learning approaches have obtained very high performance on many NLP tasks. In this course, students gain a thorough introduction to cutting-edge neural networks for NLP.\n\nInstructors:\nChris Manning\nJohn Hewitt\n\nTA:\nCourse Coordinator\nAmelie Byun\nTeaching Assistants\nDaniel Do\nRachel Gardner\nDavide Giovanardi\nAlvin Hou\nPrerna Khullar\nGita Krishna\nMegan Leszczynski\nElissa Li\nMandy Lu\nShikhar Murty\nAkshay Smit\nDilara Soylu\nAngelica Sun\nChris Waites\nAndrew Wang\nRui Wang\nYuyan Wang\nZihan Wang\nLingjue Xie\nRui Yan\nAnna Yang\nLauren Zhu\nLogistics', metadata={'source': './data/02_llama_docs_chat/CS224n_Syllabus.txt'})

## 3. Embeddings and vector storage

Like I said above, we will use the generated individual chunks and represent their meaning as embeddings vectors representing the semantic meaning of the chunk of text in the high-dimensional space.

Since we are using ChatGPT as our LLM we will use the OpenAIEmbeddings for good fit.

The first major change we willmake in order to use the LLama 2 model instead of ChatGPT and keep using only open-source tools in the process is switching the embeddings. 

We will use...

In [12]:
# TODO: Update writeup

In [13]:
# from langchain.embeddings.openai import OpenAIEmbeddings
# embedding = OpenAIEmbeddings()

In [14]:
# from sentence_transformers import SentenceTransformer

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
 
embedding = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    # model_kwargs={"device": "cuda"}, <- TODO: Uncoment
    encode_kwargs={"normalize_embeddings": True},
)

In order to be able to use the generated documents chunks embeddings vector we need to store them in a persistent and easy to access way. Vectorstores do exactly this. It is a vector database that stores our embeddings that will be then used when performing queries using our LLM.

Chroma will serve our embeddings storage and retrieval needs pretty well. Its `from_documents` method will also take care of transforming the text chunks into the embeddings form.

In [None]:
# Remove previous version if exist
# !rm -rf ./data/01_docs_chat/chroma/

In [None]:
from langchain.vectorstores import Chroma

In [None]:
persist_directory = DATA_PATH + "chroma/"

In [15]:
persist_directory

'./data/02_llama_docs_chat/chroma/'

In [16]:
chroma_db = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [17]:
chroma_db.persist()

In [18]:
chroma_db._collection.count()

203

Let's use simple embeddings similarity search to answer some question about document by identifying text chunks that potentially could contain information related to the question. No LLMs yet - just simple cosine similarity calculated on question and documents embeddings.

In [19]:
question = "Who is the course instructor?"
docs = chroma_db.similarity_search(question, k=3)
docs[0].page_content

"Lectures: are on Tuesday/Thursday 4:30-5:50pm Pacific Time (Remote, Zoom link is posted on Canvas).\nLecture videos for enrolled students: are posted on Canvas (requires login) shortly after each lecture ends. Unfortunately, it is not possible to make these videos viewable by non-enrolled students.\nPublicly available lecture videos and versions of the course: Complete videos from the 2019 edition are available (free!) on the Stanford Online Hub and on the CS224N YouTube channel. Anyone is welcome to enroll in XCS224N: Natural Language Processing with Deep Learning, the Stanford Artificial Intelligence Professional Program version of this course, throughout the year (medium fee, community TAs and certificate). You can enroll in CS224N via Stanford online in the (northern hemisphere) Autumn to do the course in the Winter (high cost, gives Stanford credit). The lecture slides and assignments are updated online each year as the course progresses. We are happy for anyone to use these reso

In [20]:
question = "In which lecture - give the number and date - are word embeddings described?"
docs = chroma_db.similarity_search(question, k=3)
docs[0].page_content

'Gensim word vectors example:\n[code] [preview] \tSuggested Readings:\n\nEfficient Estimation of Word Representations in Vector Space (original word2vec paper)\nDistributed Representations of Words and Phrases and their Compositionality (negative sampling paper)\n\nAssignment 1 out\n[code]\n[preview] \t\nThu Jan 14 \tWord Vectors 2 and Word Window Classification\n[slides] [notes] \tSuggested Readings:\n\nGloVe: Global Vectors for Word Representation (original GloVe paper)\nImproving Distributional Similarity with Lessons Learned from Word Embeddings\nEvaluation methods for unsupervised word embeddings\n\nAdditional Readings:\n\nA Latent Variable Model Approach to PMI-based Word Embeddings\nLinear Algebraic Structure of Word Senses, with Applications to Polysemy\nOn the Dimensionality of Word Embedding\n\n\t\nFri Jan 15 \tPython Review Session\n[code] [preview] \t10:00am - 11:20am \t\t\nTue Jan 19 \tBackprop and Neural Networks\n[slides] [notes] \tSuggested Readings:\n\nmatrix calculus 

In [21]:
question = "How many assignments are there in the course?"
docs = chroma_db.similarity_search(question, k=3)
docs[0].page_content

'Credit:\n    Assignment 1 (6%): Introduction to word vectors\n    Assignment 2 (12%): Derivatives and implementation of word2vec algorithm\n    Assignment 3 (12%): Dependency parsing and neural network foundations\n    Assignment 4 (12%): Neural Machine Translation with sequence-to-sequence, attention, and subwords\n    Assignment 5 (12%): Self-supervised learning and fine-tuning with Transformers\nDeadlines: All assignments are due on either a Tuesday or a Thursday before class (i.e. before 4:30pm). All deadlines are listed in the schedule.\nSubmission: Assignments are submitted via Gradescope. If you need to sign up for a Gradescope account, please use your @stanford.edu email address. Further instructions are given in each assignment handout. Do not email us your assignments.\nLate start: If the result gives you a higher grade, we will not use your assignment 1 score, and we will give you an assignment grade based on counting each of assignments 2–5 at 13.5%.\nCollaboration: Study 

It works pretty well. That's the magic of embeddings alone. Just this high-dimensional vector representation of the documents chunks meaning helps to identify the semantic relation between question we pose and the part of the text that can help to answer it.

## 4. Question answering

Retrieval - the R in RAG - is used for output generation in question answering step using `RetrievalQA` chain and loaded target LLM.

In [87]:
# from langchain.chat_models import ChatOpenAI
# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# llm.predict("Hello world!")

'Hello! How can I assist you today?'

In [24]:
## import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
 
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)
 
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.0001
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15
 
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)
 
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

llm.predict("Hello world!")

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

  warn_deprecated(


KeyboardInterrupt: 

In [88]:
from langchain.prompts import PromptTemplate

template = """Use this context to answer the question. Use up to four sentences. Keep the answer concise. Say you don't know if you don't know the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)


In [89]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=chroma_db.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})

In [90]:
question = "Who is the course instructor?"
result = qa_chain({"query": question})
result["result"]

'The main instructor for the course is Christopher Manning.'

In [91]:
question = "In which lecture - give the number and date - are word embeddings described?"
result = qa_chain({"query": question})
print(result["result"])

In Lecture 1 on Thu Jan 21, word embeddings are described.


In [92]:
question = "How many assignments are there in the course?"
result = qa_chain({"query": question})
print(result["result"])

There are five assignments in the course.


The final problem that we need to address is making sure the question answering task will have element of continuity between subsequent question to be similar to actual natural conversation. 

The problem can be visible when we ask a follow up question refering implicitely to the previous one.

In [93]:
question = "What are their topics?"
result = qa_chain({"query": question})
print(result["result"])

The topics of the previous offerings include Transformers and Pretraining, Question Answering, and Natural Language Generation.


As we can see in the follow up question the LLM lost the context of the course assignments mentioned in previous question. And that is quite obvious as LLMs are not equiped with dialogue memory. 

To make the Q&A more natural we will use `ConversationBufferMemory` conversation memory mechanism provided by LangChain. We will also use more complex `ConversationalRetrievalChain` - instead of the original `RetrievalQA` - chain in order to handle the conversation history and feeding it to the LLM with queries.

In [105]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [106]:
from langchain.prompts import PromptTemplate

template = """Use this context to answer the question. Use up to four sentences. Keep the answer concise. Say you don't know if you don't know the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)


In [107]:
from langchain.chains import ConversationalRetrievalChain
qa_chain_memory = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=chroma_db.as_retriever(),
    memory=memory,
    combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [108]:
question = "How many assignments are there in the course?"
result = qa_chain_memory({"question": question})
print(result['answer'])

There are five assignments in the course.


In [109]:
question = "What are their topics? List them as numbered list."
result = qa_chain_memory({"question": question})
print(result['answer'])

1. Introduction to word vectors
2. Derivatives and implementation of word2vec algorithm
3. Dependency parsing and neural network foundations
4. Neural Machine Translation with sequence-to-sequence, attention, and subwords
5. Self-supervised learning and fine-tuning with Transformers


In [110]:
question = "What is the third one?"
result = qa_chain_memory({"question": question})
print(result['answer'])

The topic of the third assignment is Dependency parsing and neural network foundations.


Now - after adding the conversation memory buffer - we can see the follow up question does not loose the context of previous answers and give more details on it keeping the conversation continuit intact.

Our RAG text data files Q&A chat is now fully functional and works quite well.