## Overview

This project invloves development of generative question-answering pipeline using the retrieval-augmentation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) approach with Haystack 2.0. The process involves four main components: [SentenceTransformersTextEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder) for creating an embedding for the user query, [InMemoryBM25Retriever](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever) for fetching relevant documents, [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) for creating a template prompt, and [GoogleAIGeminiGenerator](https://haystack.deepset.ai/integrations/google-ai) for generating responses.


# Creating RAG Pipeline using Haystack 2.0

- **Project Title**: NyAI Saathi
- **Components Used**: [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`PromptBuilder`](https://docs.haystack.deepset.ai/docs/promptbuilder), [`GoogleAIGeminiGenerator`](https://haystack.deepset.ai/integrations/google-ai)
- **Prerequisites**: You must have an [Gemini API Key](https://ai.google.dev/).
- **Goal**: To create a RAG application to assist legal professionals to retrieve legal document in efficient manner.

> This application uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installing Haystack

Install Haystack 2.0 and other required packages with `pip`:

In [None]:
%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=2.2.0"

Collecting haystack-ai
  Downloading haystack_ai-2.5.1-py3-none-any.whl.metadata (13 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.1.1-py3-none-any.whl.metadata (6.9 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting openai>=1.1.0 (from haystack-ai)
  Downloading openai-1.47.0-py3-none-any.whl.metadata (24 kB)
Collecting posthog (from haystack-ai)
  Downloading posthog-3.6.6-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting httpx<1,>=0.23.0 (from openai>=1.1.0->haystack-ai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai>=1.1.0->haystack-ai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-a

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.


## Fetching and Indexing Documents

We'll start creating your question answering system by downloading the data and indexing the data with its embeddings to a DocumentStore.

In this tutorial, you will take a simple approach to writing documents and their embeddings into the DocumentStore. For a full indexing pipeline with preprocessing, cleaning and splitting, check out our tutorial on [Preprocessing Different File Types](https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline).


### Initializing the DocumentStore

Initialize a DocumentStore to index your documents. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you'll be using the `InMemoryDocumentStore`.

In [None]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

The DocumentStore is now ready. Now it's time to fill it with some Documents.

In [None]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("opennyaiorg/InJudgements_dataset", split="train", token=True)
docs = [Document(content= "Title : " + doc["Titles"] + "Court name : " + doc["Court_Name"] + "Judgement Text : " + doc["Text"] + "Case type : " +  doc["Case_Type"] + "Court type " +  doc["Court_Type"] + "Doc_url (referance) :" + doc["Doc_url"], meta={"Titles":doc["Titles"],"Doc_url":doc["Doc_url"], "Doc_size":doc["Doc_size"]}) for doc in dataset]

README.md:   0%|          | 0.00/17.3k [00:00<?, ?B/s]

(…)-00000-of-00002-add4caaf8fbc6a8c.parquet:   0%|          | 0.00/150M [00:00<?, ?B/s]

(…)-00001-of-00002-09ac6bd45d6b3658.parquet:   0%|          | 0.00/143M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11970 [00:00<?, ? examples/s]

### Fetch the Data

You'll use the Indian legal judgement data of [Indian Kanoon Website](https://indiankanoon.org/) as Documents. A group of people working in same domain preprocessed the data and uploaded to a Hugging Face Space: [Indian Legal Judgement Data](https://huggingface.co/datasets/opennyaiorg/InJudgements_dataset). Thus, we did't need to perform any additional cleaning or splitting.

Fetch the data and convert it into Haystack Documents:

### Initalize a Document Embedder

To store your data in the DocumentStore with embeddings, initialize a [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) with the model name and call `warm_up()` to download the embedding model.

> If you'd like, you can use a different [Embedder](https://docs.haystack.deepset.ai/docs/embedders) for your documents.

In [None]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Write Documents to the DocumentStore

Run the `doc_embedder` with the Documents. The embedder will create embeddings for each document and save these embeddings in Document object's `embedding` field. Then, you can write the Documents to the DocumentStore with `write_documents()` method.

In [None]:
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Batches:   0%|          | 0/375 [00:00<?, ?it/s]

11970

## Building the RAG Pipeline

The next step is to build a [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines) to generate answers for the user query following the RAG approach. To create the pipeline, you first need to initialize each component, add them to your pipeline, and connect them.

### Initialize a Text Embedder

Initialize a text embedder to create an embedding for the user query. The created embedding will later be used by the Retriever to retrieve relevant documents from the DocumentStore.

> ⚠️ Notice that you used `sentence-transformers/all-MiniLM-L6-v2` model to create embeddings for your documents before. This is why you need to use the same model to embed the user queries.

In [None]:
!pip install google-ai-haystack

Collecting google-ai-haystack
  Downloading google_ai_haystack-2.0.0-py3-none-any.whl.metadata (1.8 kB)
Downloading google_ai_haystack-2.0.0-py3-none-any.whl (12 kB)
Installing collected packages: google-ai-haystack
Successfully installed google-ai-haystack-2.0.0


In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

### Initialize the Retriever

Initialize a [InMemoryEmbeddingRetriever](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) and make it use the InMemoryDocumentStore you initialized earlier in this tutorial. This Retriever will get the relevant documents to the query.

In [None]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

retriever = InMemoryEmbeddingRetriever(document_store, top_k=10)

### Define a Template Prompt

Create a custom prompt for a generative question answering task using the RAG approach. The prompt should take in two parameters: `documents`, which are retrieved from a document store, and a `question` from the user. Use the Jinja2 looping syntax to combine the content of the retrieved documents in the prompt.

Next, initialize a [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) instance with your prompt template. The PromptBuilder, when given the necessary values, will automatically fill in the variable values and generate a complete prompt. This approach allows for a more tailored and effective question-answering experience.

In [None]:
from haystack.components.builders import PromptBuilder

template = """
You're a legal research assitant and
Given the following context of judgement, answer the user query and also provide referances / Doc_url markdown clickable link of the case.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

### Initialize a Generator


Generators are the components that interact with large language models (LLMs). Now, set `GEMINI_API_KEY` environment variable and initialize a [GoogleAIGeminiGenerator](https://haystack.deepset.ai/integrations/google-ai) that can communicate with Google's Gemini model. As you initialize, provide a model name:

In [None]:
import os
from getpass import getpass
# from haystack.components.generators import HuggingFaceAPIGenerator
from haystack_integrations.components.generators.google_ai import GoogleAIGeminiGenerator
from haystack.utils import Secret

os.environ["GOOGLE_API_KEY"] = "AIzaSyD643vQOJMW8-t-pcCNQa2o_96C9bsOwIE"

# if "HF_KEY" not in os.environ:
#     os.environ["HF_KEY"] = getpass("Enter OpenAI API key:")
# generator = OpenAIGenerator(model="gpt-3.5-turbo")

# generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
#                                     api_params={"model": "openai-community/gpt2","max_new_tokens": 50},
#                                     token=Secret.from_token("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"))

> You can replace `OpenAIGenerator` in your pipeline with another `Generator`. Check out the full list of generators [here](https://docs.haystack.deepset.ai/docs/generators).

### Build the Pipeline

To build a pipeline, add all components to your pipeline and connect them. Create connections from `text_embedder`'s "embedding" output to "query_embedding" input of `retriever`, from `retriever` to `prompt_builder` and from `prompt_builder` to `llm`. Explicitly connect the output of `retriever` with "documents" input of the `prompt_builder` to make the connection obvious as `prompt_builder` has two inputs ("documents" and "question").

For more information on pipelines and creating connections, refer to [Creating Pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines) documentation.

In [None]:
from haystack import Pipeline

basic_rag_pipeline = Pipeline()
# Add components to your pipeline
basic_rag_pipeline.add_component("text_embedder", text_embedder)
basic_rag_pipeline.add_component("retriever", retriever)
basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
basic_rag_pipeline.add_component("llm", GoogleAIGeminiGenerator(model="gemini-1.5-flash-latest"))

# Now, connect the components to each other
basic_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
basic_rag_pipeline.connect("retriever", "prompt_builder.documents")
basic_rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7b8d0e19cd30>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: GoogleAIGeminiGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.parts (str)

That's it! Your RAG pipeline is ready to generate answers to questions!

## Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to both the `text_embedder` and the `prompt_builder`. This ensures that the `{{question}}` variable in the template prompt gets replaced with your specific question.

In [None]:
question = "I'm a lawyer and currently handling a case of property dispute, It's a case of illegal occupation of land. tell me about some previous cases that are similar to mine"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

import IPython
from markdown import markdown
Markdown = lambda string: IPython.display.HTML(markdown(string))

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "I'm a lawyer and currently handling a case of property dispute, It's a case of illegal occupation of land. tell me about some recent cases that are similar to mine"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "Identify potential policy changes that could be advocated for in light of the given case. Naresh Shridhar Mirajkar And Ors vs State Of Maharashtra And Anr (1966)"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "Develop a hypothetical question that could be raised in parliament based on the provided case. Naresh Shridhar Mirajkar And Ors vs State Of Maharashtra And Anr (1966)."

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "Identify and summarize the key legal issues in the given case. Naresh Shridhar Mirajkar And Ors vs State Of Maharashtra And Anr (1966)"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "Draft an argument appealing the decision of the given case. Naresh Shridhar Mirajkar And Ors vs State Of Maharashtra And Anr (1966)"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}, "retriever": {"top_k": 10}})

Markdown(response["llm"]["replies"][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]