# Welcome to Lab 2
## Here we will solve the context length problem using RAG(Retrieval Augmented generation)

RAG involves the following steps:
1. Creating Chunks of the document.
2. Using these chunks to create vector db.
3. Using FAISS(Facebook AI Similarity Search) to search the closest vectors relative to the Question asked.

* Transformers: This is a large and convenient package, that has many libraries for loading models from huggingFace, fine tuning models, creating pipelines, loading tokenizers. This package is provided by Huggingface
* faiss-cpu: This package is used for generating the faiss index using CPU, search and indexing is also done by CPU.
* sentence-transformer: This well known package is used to transform sentences to embeddings.
* langchain: This is yet another large and powerfull library used for many things like splitting sentences into chunks, making calls to OpenAI. They also have their own FAISS indexing library but will not use it from Langchain but the faiss library itself.
* openai: This is used to call the openAI model.
* python-dotenv: This is used to load all the keys in the environment from a .env file
* PyMuPDF: This package is used for easy PDF manipulation.
* tiktoken: This package is used to calculate the tokens in a text

In [None]:
%%capture
!pip install transformers faiss-cpu sentence-transformers langchain==0.0.354 pypdf openai==1.3.9 python-dotenv==1.0.0 PyMuPDF==1.24.2 tqdm tiktoken

In [None]:
from transformers import pipeline,BertTokenizer
from sentence_transformers import SentenceTransformer,util
from langchain.text_splitter import RecursiveCharacterTextSplitter
import fitz
import faiss
import numpy as np
import openai
from dotenv import load_dotenv
import numpy as np
import os
from tqdm import tqdm
import tiktoken

load_dotenv()

os.environ["OPENAI_API_KEY"] = "<Place you key here>"
model_name = "gpt-3.5-turbo"

token_encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

### Using the all-mpnet-base-v2 or all-MiniLM-L6-v2 models from huggingFace, the embeddings for the document text/chunks will be generated.

Both all-mpnet-base-v2 and all-MiniLM-L6-v2 are sentence-transformer models used for tasks like semantic search, clustering, and sentence embedding.

 all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.

 * The input sentence is tokenized into smaller units (tokens), typically using a tokenizer specific to the model (e.g., BERT tokenizer).
 * Each token is converted into an initial vector representation (embedding) using an embedding layer.
 * The token embeddings are then passed through multiple transformer layers.
 * After passing through the transformer layers, the model generates contextualized embeddings for each token.
 * The final output is a fixed-size vector that represents the entire sentence in a high-dimensional space.

In [None]:
%%capture
encoder = SentenceTransformer("all-MiniLM-L6-v2")
# encoder = SentenceTransformer("allenai/longformer-base-4096")

### Loading the contents of a PDF and converting them to chunks to form the embeddings

In [None]:
pdf = "/content/AWS1.pdf"

#### Making the chunks
This cell will mainly creates chunks using `RecursiveCharacterTextSplitter()` with `chunk_size` as 20 (20 words), `chunk_overlap` as 5 (5 words will always overlap).

In the `For loop` we are going through each page of the document and converting the content of each page to chunks and storing them in an array called the `knowledge_base`.

The `knowledge_base` is then converted to embedding indices.

In [None]:
knowledge_base = []
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 20,
    chunk_overlap = 5,
    length_function = len
)
file = fitz.open(pdf)
for page in file:
  text = page.get_text()
  text = recursive_splitter.split_text(text)
  knowledge_base.append(text)

print(knowledge_base[:3])

[['AWS Customer', 'Agreementfififfff', 'Last Updated: April', '20, 2023', "See What's Changed", 'This AWS Customer', 'Agreement (this', '“Agreement”)', 'contains the terms', 'and conditions that', 'that govern your', 'your access to and', 'and use of the', 'the Services (as', '(as deﬁned below)', 'and is an agreement', 'between the', 'the applicable AWS', 'AWS Contracting', 'Party speciﬁed in', 'in Section 12 below', 'and by the name', 'name "Amazon web', 'web services" (also', 'referred to', 'as “AWS,” “we,”', '“us,” or “our”) and', 'and you or the', 'the entity " XYZ', 'XYZ Software', 'solutions" you', 'you represent', '(“you” or “your”).', 'This Agreement', 'takes eﬀect on 1st', '1st April 2023', 'or, if earlier,', 'when you use any of', 'of the Services', '(the “Eﬀective', 'Date”). You', 'You represent to us', 'us that you are', 'are lawfully able', 'able to enter into', 'into contracts', '(e.g., you are not', 'not a', '*Please note that', 'that as of April 1,', '1, 2016, customers

### This cell takes in the knowledge base and creates index/db of the embeddings
k_base is an array of chunks taken from a document

So, `encoder.encode(k_base)` will convert these chunks into continuous vectors

The vectors are flattened in a single row array using `vectors.shape[1]`

Using `index = faiss.IndexFlatL2(vector_dimension)` Eucledian L2 distance the `faiss` vector is created.

Index is normalized to be within the range of 0-1 using `faiss.normalize_L2(vectors)`

In [None]:
vectors = encoder.encode(knowledge_base)
vector_dimension = vectors.shape[1]
embed_index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
embed_index.add(vectors)

In [None]:
print(embed_index)

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7a43c4ee9b90> >


### The answer_question() function takes the question, the array of embeddings and the number of results wanted, then searches the array that best fits the questions.

In this function:

1. The question is encoded to create a matrix of vectors that is flattened to 1-D array and normalized.
2. `index.search(_vector, results_len)` will return the distaces and the array of most relevant chunks indexes.
3. We us the retrieved indexes to get the values from `knowledge_base[]`.
4. `retrieved_idxs.ravel()[i]` converts the 2-D array into 1-D by flattening it.
5. `answer` stores array of chunks that are most relevant.

In [None]:
def answer_question(question,index,results_len):
  answer = []
  """
  This function takes a question and uses RAG to answer it with Faiss for retrieval.

  Args:
      question: The user's question as a string.

  Returns:
      A dictionary containing the answer and retrieved passage.
  """
  search_vector = encoder.encode(question)
  _vector = np.array([search_vector])
  faiss.normalize_L2(_vector)

  distances, retrieved_idxs = index.search(_vector, results_len)
  print(len(retrieved_idxs.ravel()))
  # Extract the answer and passage based on the retrieved index
  for i in range(len(retrieved_idxs.ravel())):
    answer.append(knowledge_base[retrieved_idxs.ravel()[i]])

  # Return the answer and retrieved passage for transparency
  return {"answer": answer}

### Generating the answers/chunks related to the question from the array of embeddings

In this function we are getting the relevant chunks using the question that the user provided.`answer_question(question,embed_index,10)`

In the `For Loop` we are interating through each chunk and enclosing them inside tags as `<ContextN></ContextN>`

These chunks are then returned

In [None]:
def return_RAG_passage(question,embed_index):
  RAG_passage = ''
  answer_dict = answer_question(question,embed_index,10)
  for i in range(len(answer_dict['answer'])):
    RAG_passage += '<Context'+str(i)+'>'+' '.join(answer_dict['answer'][i])
    RAG_passage += '</Context'+str(i)+'>'+'\n\n'

  print(RAG_passage)
  return RAG_passage

## Here we are creating this function which is called to generate results using the `openai.chat.completions.create`

In [None]:
def CallOpenAI(user,system):
  response = openai.chat.completions.create(
              model= model_name, # model = "deployment_name".
              temperature= 0,
              top_p= 0,
              messages=[
                  {"role": "system", "content": system},
                  {"role": "user", "content": user}
              ]
          )
  return response

## First we try analysing a with less content

### We format the prompt to include the RAG chunks along with the question
### What happens in the Background?

The most relevant chunks are retrieved from the vector index and these chunks are formatted to be included between:

`<Context1></Context1>`

`<Context2></Context2>`

.

.

.

`<ContextN></ContextN>`

In [None]:
question = "What is the governing courts for Amazon Web Services South Africa ProprietaryLimited"

rag_passage = return_RAG_passage(question,embed_index)

10
<Context0>Learn About AWS Resources for AWS Getting Started Training and and Certification Developers on AWS Developer Center SDKs & Tools
Help Help
Contact Us Get Expert Help “Indirect Taxes” means applicable taxes and duties, including, without limitation, VAT, VAT, service tax, tax, GST, excise taxes, sales and and transactions taxes, and gross receipts tax. “Intellectual Property License” means the separate license terms that that apply to your your access to and and use of AWS AWS Content and and Services located at https://aws.amazon. azon.com/legal/aws-i aws-ip-license-terms terms (and any successor or or related locations designated by us), us), as may be be updated by us us from time to to time. “Losses” means any any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys’ fees). “Policies” means the Acceptable Use Use Policy, Privacy Notice, the Site Site Terms, the the Service Terms, the AWS Trademark Guidelines, all all restrictions des

Here we are adding the Question to the formatted RAG chunks before sending it to GPT for analysis

In [None]:
full_prompt_SD = rag_passage +"\n\n" +"<Question> "+question+" </Question>"
print(len(token_encoding.encode(full_prompt_SD)))

10585


In [None]:
response = CallOpenAI(full_prompt_SD,"You are a Professional lawyer who can analyse documents thorougly")

Finally we get the correct result and the token count is also less

In [None]:
print(response.choices[0].message.content)

The governing courts for Amazon Web Services South Africa Proprietary Limited are the South Gauteng High Court in Johannesburg, as specified in the document provided.


## Now we try analysing a with large content that we were facing problems with in Lab 0.1

### Loading the contents of a PDF and converting them to chunks to form the embeddings

In [None]:
pdf = "/content/PROFRAC HOLDINGS, LLC credit agreement.pdf"

#### Making the chunks of the document
This cell will mainly creates chunks using `RecursiveCharacterTextSplitter()` with `chunk_size` as 20 (20 words), `chunk_overlap` as 5 (5 words will always overlap).

In the `For loop` we are going through each page of the document and converting the content of each page to chunks and storing them in an array called the `knowledge_base`.

The `knowledge_base` is then converted to embedding indices.

In [None]:
knowledge_base = []
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 20,
    chunk_overlap = 5,
    length_function = len
)
file = fitz.open(pdf) # <------Make changes in the PDF file path that you want to use
for page in file:
  text = page.get_text()
  text = recursive_splitter.split_text(text)
  knowledge_base.append(text)

print(knowledge_base[:3])

[['EX-10.1 2', '2 d497551dex101.htm', 'EX-10.1', 'Exhibit 10.1', 'Execution Version', 'FOURTH AMENDMENT', 'TO TERM LOAN CREDIT', 'AGREEMENT', 'THIS FOURTH', 'AMENDMENT TO TERM', 'TERM LOAN CREDIT', 'AGREEMENT (this', '“Amendment”), dated', 'as of February 1,', '1, 2023, relating', 'to the', 'Credit Agreement', 'referred to below,', 'is made by and', 'and among PROFRAC', 'HOLDINGS II, LLC, a', 'a Texas limited', 'liability company', '(the “Borrower”),', 'PROFRAC HOLDINGS,', 'LLC, a Texas', 'limited liability', 'company', '(“Holdings”), the', 'the Guarantors', 'party hereto, each', 'each of the', 'the Additional Term', 'Term B Loan', 'Lenders (as defined', 'below), each of the', 'the other Lenders', 'party hereto, as', 'as required, as the', 'the case may be, by', 'by the terms of', 'of this Amendment', 'and the Existing', 'Credit Agreement,', 'and PIPER SANDLER', 'FINANCE LLC, as the', 'the Agent and the', 'the Collateral', 'Agent for the', 'the Lenders.', 'RECITALS', 'WHEREAS, the', 't

### This cell takes in the knowledge base and creates index/db of the embeddings
k_base is an array of chunks taken from a document

So, `encoder.encode(k_base)` will convert these chunks into continuous vectors

The vectors are flattened in a single row array using `vectors.shape[1]`

Using `index = faiss.IndexFlatL2(vector_dimension)` Eucledian L2 distance the `faiss` vector is created.

Index is normalized to be within the range of 0-1 using `faiss.normalize_L2(vectors)`

In [None]:
vectors = encoder.encode(knowledge_base)
vector_dimension = vectors.shape[1]
embed_index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
embed_index.add(vectors)

In [None]:
print(embed_index)

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7a43c4df3de0> >


## Just like before we are formatting the RAG chunks to be included in the `<Context>` tag

In [None]:
question = "What is the Acknowledgement Regarding Any Supported QFCs?"

rag_passage = return_RAG_passage(question,embed_index)

10
<Context0>7.23  FCPA
   118 7.24  Sanctioned Persons 118 7.25  Designation of Senior Debt 118 7.26  Insurance 118 7.27  FTS Assets 118 ARTICLE VIII AFFIRMATIVE AND AND NEGATIVE COVENANTS 8.1   Taxes 118 8.2   Legal Existence and Good Good Standing 119 8.3   Compliance with Law; Law; Maintenance of of Licenses 119 8.4   Maintenance of Property, Inspection 119 8.5   Insurance 120 8.6   Environmental Laws 121 8.7   Compliance with ERISA 121 8.8   Dispositions 121 8.9   Mergers, Consolidations, etc 121 8.10  Distributions 122 8.11  Investments 126 
8.12  Debt 126 8.13  Prepayments of Debt 130 8.14  Transactions with Affiliates 131 8.15  Business Conducted 134 
8.16  Liens 134 8.17  Restrictive Agreements 134 8.18  Restrictions on FTS Acquisition Transactions 136 8.19  Fiscal Year; Accounting 136 8.20  Financial Covenants 137 8.21  Information Regarding Collateral 138 8.22  Ratings 138 8.23  Additional Obligors; Covenant to Give Security 138 8.24  Use of of Proceeds 140 8.25  Further Ass

## Adding the question to the RAG chunks,

### Also you can see that the token count has decreaded significantly to 8347, which was 163227 before.

In [None]:
full_prompt_LD = rag_passage +"\n\n" +question
print(len(token_encoding.encode(full_prompt_LD)))

8347


In [None]:
response = CallOpenAI(full_prompt_LD,"You are a Professional lawyer who can analyse documents thorougly")

### Finally you get the answer which was throwing error in our previous lab

In [None]:
print(response.choices[0].message.content)

The "Acknowledgement Regarding Any Supported QFCs" is a provision in a legal document that addresses the treatment of hedge agreements or other agreements or instruments that are Qualified Financial Contracts (QFCs) and are supported by the loan documents. 

In this provision, the parties acknowledge and agree that in the event of a proceeding under a U.S. Special Resolution Regime involving a Covered Entity that is a party to a Supported QFC, the transfer of such Supported QFC and the benefit of QFC Credit Support will be effective to the same extent as if the Supported QFC and QFC Credit Support were governed by the U.S. Special Resolution Regimes. This provision outlines the effects of a Bail-In Action on such liabilities, including potential reduction, conversion into shares or other instruments of ownership, or variation of terms.

The provision also clarifies the rights and obligations of the parties in relation to the resolution power of the Federal Deposit Insurance Corporation