# <font color='cyan'>**Applying RAG - LLM - Prompt Engineering for Text Analysis in Legal Documents**</font>

## Installing and Loading Packages

In [32]:
!pip install -q faiss-cpu --quiet

In [33]:
!pip install -q pymupdf --quiet

In [34]:
!pip install -q sentence-transformers --quiet

In [35]:
!pip install -q openai --quiet

In [36]:
!pip install -q langchain --quiet

In [37]:
# Imports
import langchain
import textwrap
from langchain import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import warnings
warnings.filterwarnings('ignore')

In [38]:
%env TOKENIZERS_PARALLELISM=true

env: TOKENIZERS_PARALLELISM=true


The TOKENIZERS_PARALLELISM variable is often used in conjunction with NLP libraries. It controls whether tokenizers (which convert text into tokens or smaller units for processing by language models) use parallelism or not.

When set to true, it allows tokenizers to operate in parallel, which can speed up certain operations, but can also lead to deadlock issues in environments where parallelism is not well managed.

If set to false, it disables parallelism, which can be useful to avoid concurrency issues in environments with multiple threads or processes.

In [39]:
# Function to load the pdf
def load_pdf(file_path):

    # Creates an instance of the PyMuPDFLoader class, passing the PDF file path as an argument.
    loader = PyMuPDFLoader(file_path=file_path)

      # Uses the 'load' method of the 'loader' object to load the PDF content.
     # This returns an object or data structure containing the PDF pages with their content.
    docs = loader.load()

     # Returns the loaded content of the PDF.
    return docs

In [40]:
# Function to divide documents into several chunks
def split_docs(documents, chunk_size = 1000, chunk_overlap = 20):

     # Creates an instance of the RecursiveCharacterTextSplitter class.
     # This class divides long texts into smaller chunks.
     # 'chunk_size' defines the size of each chunk, and 'chunk_overlap' defines the overlap between consecutive chunks.
     text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)

     # Uses the 'split_documents' method of the 'text_splitter' object to split the provided document.
     # 'documents' is a variable that contains the text or set of texts to be divided.
     chunks = text_splitter.split_documents(documents = documents)

     # Returns the chunks of text resulting from the split.
     return chunks

In [41]:
# Load the embeddings model
def load_embedding_model(model_path, normalize_embedding=True):

     # Returns an instance of the HuggingFaceEmbeddings class.
     # 'model_name' is the identifier of the embeddings model to be loaded.
     # 'model_kwargs' is a dictionary of additional arguments for model configuration, in this case setting the device to 'cpu'.
     # 'encode_kwargs' is a dictionary of arguments for the encoding method, here specifying whether embeddings should be normalized.
     return HuggingFaceEmbeddings(model_name = model_path,
                                  model_kwargs = {'device':'cpu'},
                                  encode_kwargs = {'normalize_embeddings': normalize_embedding})

In [42]:
# Function to create embeddings using FAISS
def create_embeddings(chunks, embedding_model, storing_path = "model/vectorstore"):

     # Creates a 'vectorstore' (a FAISS index) from the given documents.
     # 'chunks' is the list of text segments and 'embedding_model' is the embedding model used to convert text to embeddings.
     vectorstore = FAISS.from_documents(chunks, embedding_model)

     # Saves the created 'vectorstore' to a local path specified by 'storing_path'.
     # This allows persistence of the FAISS index for future use.
     vectorstore.save_local(storing_path)

     # Returns the created 'vectorstore', which contains the embeddings and can be used for similarity search and comparison operations.
     return vectorstore

In [43]:
# Creating the chain
def load_qa_chain(retriever, llm, prompt):

     # Returns an instance of the RetrievalQA class.
     # This function deals with the chain of processes involved in a Question Answering (QA) system.
     # 'llm' refers to the large-scale language model (such as a GPT or BERT model).
     # 'retriever' is a component used to retrieve relevant information (like a search engine or document retriever).
     # 'chain_type' defines the type of chain or strategy used in the QA process. Here, it is defined as "stuff", a placeholder for an actual type.
     # 'return_source_documents': a boolean that, when True, indicates that source documents (i.e., documents from which responses are extracted) should be returned along with responses.
     # 'chain_type_kwargs' is a dictionary of additional arguments specific to the chosen chain type. Here, it is passing 'prompt' as an argument.
     return RetrievalQA.from_chain_type(llm = llm,
                                        retriever = retriever,
                                        chain_type = "stuff",
                                        return_source_documents = True,
                                        chain_type_kwargs = {'prompt': prompt})

In [44]:
# Function to obtain LLM (Large Language Model) answers
def get_response(query, chain):

     # Invoke the 'chain' (processing chain, a Question Answering pipeline) with the provided 'query'.
     # 'chain' is a function that receives a query and returns a response, using LLM.
     response = chain({'query': query})

     # Uses the textwrap library to format the response. 'textwrap.fill' wraps the response text into lines of specified width (100 characters in this case),
     # making it easier to read in environments like Jupyter Notebook.
     wrapped_text = textwrap.fill(response['result'], width=100)

     # Print formatted text
     print(wrapped_text)

In [45]:
# Defining the OpenAI API
llm_api = OpenAI(openai_api_key="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")

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [46]:
# Load the Embedding model
embed = load_embedding_model(model_path = "all-MiniLM-L6-v2")

In [47]:
# Load the file pdf
docs = load_pdf(file_path = "/content/COI.pdf")

In [48]:
# Split the file into chunks
documents = split_docs(documents = docs)

In [49]:
# Create the vectorstore
vectorstore = create_embeddings(documents, embed)

In [50]:
# Convert vectorstore to a retriever
retriever = vectorstore.as_retriever()

In [51]:
template = """
### System:
You act as a legal assistant. Your job is to answer the client's questions using only the context provided to you. If you don't know the answer, simply indicate that you don't have that knowledge. Don't try to create fictional answers

### Context:
{context}

### User:
{question}

### Response:
"""

In [52]:
# Creating the prompt from the template
prompt = PromptTemplate.from_template(template)

In [53]:
# Creating the chain (pipeline)
K_chain = load_qa_chain(retriever, llm_api, prompt)

In [54]:
get_response("Who should be held responsible for damages caused to third parties?", K_chain)

The jurisdiction of all courts is excluded, except for the Supreme Court under article 136, with
respect to disputes or complaints referred to in clause (1). The administrative tribunal may also
receive representations and make orders for redress of grievances, as specified by the President.
Additionally, the tribunal has the power to punish for contempt. However, if the damages were caused
by the State or Union of India, they may be held responsible and may be sued in relation to their
respective affairs.


In [55]:
get_response("Is a natural person a legal entity of internal public law?", K_chain)

No, a natural person is not considered a legal entity of internal public law. They are not
incorporated and do not have the same rights and obligations as a legal entity.


In [56]:
get_response("What is the minimum age to contest in a parliament election?", K_chain)

In order to be qualified to be chosen to fill a seat in Parliament, a person must be at least 25
years of age for a seat in the House of the People and at least 30 years of age for a seat in the
Council of States. This requirement is outlined in article 84 of the Constitution. However, it is
important to note that this age requirement may be subject to change if any laws are made by
Parliament in the future.


In [61]:
get_response("What is covered under Article 20?", K_chain)

Article 20 covers (a) any offence committed before the enactment of the Constitution, (b) any
dispute in respect of any right, liability or obligation under article 314 as originally enacted,
(c) the reports of a Commission appointed under clause (1) of article 340, and (d) any other
provisions of the Constitution.


In [58]:
get_response("Which article should I refer if I want to know more about intellectual property?", K_chain)

Article 49 of the Constitution deals with patents, inventions and designs; copyright; trade-marks
and merchandise marks. This article provides for the regulation and protection of intellectual
property rights in India.


In [65]:
get_response(" What all proof do I require to prove my innocence?", K_chain)

You may require evidence to prove your innocence, such as alibi witnesses, surveillance footage, or
other forms of evidence that can support your defense. It is important to consult with a lawyer who
can advise you on the specific proof that may be needed for your particular case.
