**Install required packages**

In [1]:
!pip install pypdf
!pip install langchain-community
!pip install sentence-transformers
!pip install gradio
!pip install llama-index
!pip install llama-index-embeddings-langchain
!pip install llama-index-llms-llama-cpp

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0
Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain-community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.8 (from langchain-community)
  Downloading langchain-0.3.9-py3-none-any.whl.metadata (7.1 kB)
Colle

**Gradio Application**

Before running the Gradio Application make sure to
Upload your pdf file

In [4]:
import os
import gradio as gr
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from langchain_community.embeddings import HuggingFaceEmbeddings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

model_url = 'https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf'
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    temperature=0.1,
    max_new_tokens=256,
    context_window=2048,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)
# Initialize embeddings and LLM
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

def initialize_index():
    """Initialize the vector store index from PDF files in the data directory"""
    # Load documents from the data directory
    loader = SimpleDirectoryReader(
        # input_dir="data", # for huggingface
        input_dir="/content", # for Colab
        required_exts=[".pdf"]
    )
    documents = loader.load_data()

    # Create index
    index = VectorStoreIndex.from_documents(
        documents,
        embed_model=embeddings,
    )

    # Return query engine with Llama
    return index.as_query_engine(llm=llm)

# Initialize the query engine at startup
query_engine = initialize_index()

def process_query(
    message: str,
    history: list[tuple[str, str]],
) -> str:
    """Process a query using the RAG system"""
    try:
        # Get response from the query engine
        response = query_engine.query(
            message,
            #streaming=True
        )
        return str(response)
    except Exception as e:
        return f"Error processing query: {str(e)}"

# Create the Gradio interface
demo = gr.ChatInterface(
    process_query,
    title="PDF Question Answering with RAG + Llama",
    description="Ask questions about the content of the loaded PDF documents using Llama model",
    #undo_btn="Delete Previous",
    #clear_btn="Clear",
)

if __name__ == "__main__":
    demo.launch(debug=True)

llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /tmp/llama_index/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv  

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://b12fced75d620a28ba.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)



llama_print_timings:        load time =  205764.00 ms
llama_print_timings:      sample time =       3.05 ms /    19 runs   (    0.16 ms per token,  6223.39 tokens per second)
llama_print_timings: prompt eval time =  489152.80 ms /  1517 tokens (  322.45 ms per token,     3.10 tokens per second)
llama_print_timings:        eval time =   10612.32 ms /    18 runs   (  589.57 ms per token,     1.70 tokens per second)
llama_print_timings:       total time =  499839.21 ms /  1535 tokens

llama_print_timings:        load time =  103034.68 ms
llama_print_timings:      sample time =       4.23 ms /    35 runs   (    0.12 ms per token,  8274.23 tokens per second)
llama_print_timings: prompt eval time =  300115.73 ms /  1413 tokens (  212.40 ms per token,     4.71 tokens per second)
llama_print_timings:        eval time =   13176.51 ms /    34 runs   (  387.54 ms per token,     2.58 tokens per second)
llama_print_timings:       total time =  313364.91 ms /  1447 tokens
Llama.generate: 65 prefix-

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://b12fced75d620a28ba.gradio.live
