# FarmerGPT - Build Chat with PDF using lancedb

We utilized this PDF: Dataset/FARM5CROPS_farmergpt.pdf,
 which contains crop variety information for sugarcane, turmeric, bamboo, cashew nuts, and more.
 This is a sample project designed to demonstrate how to build an application using LanceDB and LangChain
The use case and prompts can be customized as needed to suit specific requirements

Import pacages

In [1]:
! pip install langchain lancedb
! pip install tantivy
! pip install -U langchain-openai langchain-community pypdf
! pip install gradio

Collecting lancedb
  Downloading lancedb-0.17.0-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (4.7 kB)
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting pylance==0.20.0 (from lancedb)
  Downloading pylance-0.20.0-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (7.4 kB)
Collecting overrides>=0.7 (from lancedb)
  Downloading overrides-7.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading lancedb-0.17.0-cp39-abi3-manylinux_2_28_x86_64.whl (29.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.9/29.9 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pylance-0.20.0-cp39-abi3-manylinux_2_28_x86_64.whl (33.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading overrides-7.7.0-py3-none-any.whl (17 kB)
Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: overrides, deprecat

In [7]:
# pass opeani key or use any LLM
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-"

In [4]:
# Download the sample pdf
!wget https://github.com/vectordb-recipes/examples/raw/main/RAG-On-PDF/Dataset/FARM5CROPS_farmergpt.pdf -O FARM5CROPS_farmergpt.pdf

--2024-12-20 11:04:42--  https://github.com/vectordb-recipes/examples/raw/main/RAG-On-PDF/Dataset/FARM5CROPS_farmergpt.pdf
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vectordb-recipes/examples/main/RAG-On-PDF/Dataset/FARM5CROPS_farmergp.pdf [following]
--2024-12-20 11:04:43--  https://raw.githubusercontent.com/vectordb-recipes/examples/main/RAG-On-PDF/Dataset/FARM5CROPS_farmergpt.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1398674 (1.3M) [application/octet-stream]
Saving to: ‘FARM5CROPS_farmergpt.pdf’


2024-12-20 11:04:43 (34.5 MB/s) - ‘FARM5CROPS_farmergpt.pdf’ saved [1398674/13986

In [14]:
from lancedb.rerankers import LinearCombinationReranker
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import LanceDB
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.memory import ConversationBufferMemory

class QueryProcessor:
    def __init__(self, file_path, db_url="lancedb_temp", table_name="lancedb_indic"):
        """
        Initialize the QueryProcessor with the PDF file and set up the vector store.

        Parameters:
            file_path (str): Path to the PDF file.
            db_url (str): URI for the LanceDB vector store.
            table_name (str): Name of the table in LanceDB.
        """
        # Load and process the PDF document
        loader = PyPDFLoader(file_path)
        documents = loader.load()
        text_splitter = CharacterTextSplitter()
        self.documents = text_splitter.split_documents(documents)

        # Initialize embeddings and vector store
        embeddings = OpenAIEmbeddings()
        self.vector_store = LanceDB(
            uri=db_url,
            embedding=embeddings,
            table_name=table_name
        )

        # Add reranker
        self.reranker = LinearCombinationReranker(weight=0.3)
        self.docsearch = LanceDB.from_documents(
            self.documents, embeddings, reranker=self.reranker
        )

        print("Embedding stored in lancedb")
        # Initialize LLM and memory
        self.llm = ChatOpenAI(
            model_name="gpt-4o",
            temperature=0.01,
        )
        self.memory = ConversationBufferMemory(memory_key="chat_history")

    def generate_prompt_template(self, main_instructions, prompt_instructions, context_name, query):
        """
        Generate a prompt template for LangChain LLM.

        Parameters:
            main_instructions (str): Main instructions for the LLM.
            prompt_instructions (str): Additional instructions for how to use the data.
            context_name (str): The name of the context (e.g., search results).
            query (str): The query from the user.

        Returns:
            PromptTemplate: The generated prompt template.
        """
        template = f"""{main_instructions}

        {prompt_instructions}

        {context_name}:
        {{context}}

        Previous Conversations:
        {{chat_history}}
        Human: {query}
        Chatbot:"""
        return PromptTemplate(template=template, input_variables=["context", "chat_history"])

    def get_answer(self, query):
        """
        Process a query and return the answer based on the preloaded PDF.

        Parameters:
            query (str): The user's query.

        Returns:
            str: The answer to the query.
        """
        # Perform similarity search
        docs = self.docsearch.similarity_search_with_relevance_scores(query)

        # Generate a prompt
        prompt = self.generate_prompt_template(
            main_instructions="Act as a knowledgeable assistant. Answer the query comprehensively and concisely based on the provided content.",
            prompt_instructions=(
                "Focus on extracting the most relevant and accurate information from the context. "
                "Prioritize clarity, conciseness, and detail in your response. "
                "When summarizing, ensure key points are highlighted without losing important nuances. "
                "If the context is insufficient to fully address the query, acknowledge the limitation clearly."
            ),
            context_name="PDF Content",
            query=query,
        )

        # Create the LangChain pipeline
        chain = prompt | self.llm | StrOutputParser()

        # Invoke the chain and get the answer
        answer = chain.invoke({"context": docs, "chat_history": self.memory})
        return answer

# Initialize the QueryProcessor with the PDF file (done once)
file_path = "/content/FARM5CROPS_farmergpt.pdf"
query_processor = QueryProcessor(file_path)

Embedding stored in lancedb


In [15]:
query = "give me some sugarcane variety names?"
answer = query_processor.get_answer(query)
print("Answer:", answer)

Answer: The sugarcane varieties are categorized by states in India as follows:

- **Andhra Pradesh:**
  - Early varieties: Co.6907, 84A125, 81A99, 83A30, 85A261, 87A298, Co.8014, 86V96, 91V83.
  - Mid-late varieties: COA7607, CO8021, COT.8201, Co7805, COV92102 (83V15), 83V288.
  - Late varieties: Co.7219, CoR8001, 87A380, Co7706.

- **Bihar:**
  - Varieties: Bo 99, CoP 9301, CoSe 98231, CoS 8436, Cos 95255, Bo 102, Bo 91, Bo 110, CoP 9206, CoSe 95422, CoSe 92423, UP 9530.

- **Gujarat:**
  - Varieties: Co 86002, Co 86032, CoSi 95071, Co 86249, CoN 05072.

- **Haryana:**
  - Varieties: CoJ 64, CoS 8436, CoS 88230, CoS 767.

- **Karnataka:**
  - Varieties: Co 94012, CoC 671, Co 92020, Co 8014, Co 86032, Co 62175, Co 8371, Co 740, Co 8011.

- **Maharashtra:**
  - Varieties: CoC 671, Co 86032, Co 8011, Co 94012, CoM 265, Co 92005.

- **Odisha:**
  - Varieties: Co 62175, CoA 89085, Co 87A298, Co86V96.

- **Punjab:**
  - Varieties: CoJ 85, CoJ 88, CoS8436, CoH 119, Co89003.

- **Tamil Nadu:*

In [16]:
# Reuse with another query
query2 = "What crops grow in dry regions?"
answer2 = query_processor.get_answer(query2)
print("Answer:", answer2)

Answer: In dry regions, crops that are typically grown include those that are drought-resistant and can thrive with minimal water. Some common crops suitable for dry regions are:

1. **Millets**: These are hardy grains that require less water and can grow in poor soil conditions. Examples include pearl millet (bajra) and finger millet (ragi).

2. **Sorghum**: Known for its drought tolerance, sorghum is a staple in many dry areas.

3. **Pulses**: Legumes such as chickpeas, lentils, and pigeon peas are often grown in dry regions due to their ability to fix nitrogen and improve soil fertility.

4. **Oilseeds**: Crops like sesame and mustard are suitable for dry climates.

5. **Cotton**: This crop can be grown in semi-arid regions with proper irrigation management.

6. **Cactus and Succulents**: While not traditional crops, these plants are increasingly being explored for their potential in arid agriculture.

These crops are chosen for their ability to withstand water scarcity and their ad

# General template for your chat with pdf
##### change the promts as per requirements

In [None]:
#gradio app

import gradio as gr
from lancedb.rerankers import LinearCombinationReranker
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import LanceDB
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.memory import ConversationBufferMemory

class QueryProcessor:
    def __init__(self, file_path, db_url="lancedb_temp", table_name="lancedb_indic"):
        loader = PyPDFLoader(file_path)
        documents = loader.load()
        text_splitter = CharacterTextSplitter()
        self.documents = text_splitter.split_documents(documents)

        embeddings = OpenAIEmbeddings()
        self.vector_store = LanceDB(
            uri=db_url,
            embedding=embeddings,
            table_name=table_name
        )

        self.reranker = LinearCombinationReranker(weight=0.3)
        self.docsearch = LanceDB.from_documents(
            self.documents, embeddings, reranker=self.reranker
        )

        self.llm = ChatOpenAI(
            model_name="gpt-4o",
            temperature=0.01,
        )
        self.memory = ConversationBufferMemory(memory_key="chat_history")

    def generate_prompt_template(self, main_instructions, prompt_instructions, context_name, query):
        template = f"""{main_instructions}

        {prompt_instructions}

        {context_name}:
        {{context}}

        Previous Conversations:
        {{chat_history}}
        Human: {query}
        Chatbot:"""
        return PromptTemplate(template=template, input_variables=["context", "chat_history"])

    def get_answer(self, query):
        docs = self.docsearch.similarity_search_with_relevance_scores(query)

        prompt = self.generate_prompt_template(
            main_instructions="Act as a knowledgeable assistant. Answer the query comprehensively and concisely based on the provided content.",
            prompt_instructions=(
                "Focus on extracting the most relevant and accurate information from the context. "
                "Prioritize clarity, conciseness, and detail in your response. "
                "When summarizing, ensure key points are highlighted without losing important nuances. "
                "If the context is insufficient to fully address the query, acknowledge the limitation clearly."
            ),
            context_name="Search Results",
            query=query,
        )

        chain = prompt | self.llm | StrOutputParser()

        answer = chain.invoke({"context": docs, "chat_history": self.memory})
        return answer

def initialize_processor(pdf_file):
    global query_processor
    query_processor = QueryProcessor(pdf_file.name)
    return "PDF successfully loaded and processed. You can now ask questions."

def query_processor_fn(question):
    global query_processor
    if query_processor is None:
        return "Please upload a PDF first."
    return query_processor.get_answer(question)

query_processor = None

# Define Gradio interface
with gr.Blocks() as app:
    gr.Markdown("# RAG On PDF - FarmersGPT")

    with gr.Row():
        pdf_upload = gr.File(label="Upload PDF", file_types=[".pdf"])
        pdf_status = gr.Textbox(label="Status", interactive=False)

    load_pdf_btn = gr.Button("Load PDF")

    with gr.Row():
        user_query = gr.Textbox(label="Ask a question", placeholder="Enter your question here...")
        answer_box = gr.Textbox(label="Answer", interactive=False)

    ask_question_btn = gr.Button("Get Answer")

    load_pdf_btn.click(initialize_processor, inputs=[pdf_upload], outputs=[pdf_status])
    ask_question_btn.click(query_processor_fn, inputs=[user_query], outputs=[answer_box])

app.launch(debug=True,share=True)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://6e972183d1c4f70073.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
