<a href="https://colab.research.google.com/github/jeff-ai-ml/genai/blob/main/RAG_PDF_Read_Summary_Using_Mistral_ChromaDB_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# === Install Required Packages ===

# === Install Required Packages ===

In [33]:
pip install -U langchain-mistralai

Collecting langchain-mistralai
  Downloading langchain_mistralai-0.2.11-py3-none-any.whl.metadata (2.0 kB)
Downloading langchain_mistralai-0.2.11-py3-none-any.whl (16 kB)
Installing collected packages: langchain-mistralai
Successfully installed langchain-mistralai-0.2.11


In [34]:
# Import libraries
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate


# For Mistral LLM

In [35]:
from langchain_mistralai import ChatMistralAI # For Mistral AI API
# OR if using a local/hosted Mistral model via Langchain's generic LLM interface:
# from langchain_community.llms import Ollama # Example for local Ollama setup
# from langchain_community.chat_models import ChatOllama # Example for chat models with Ollama

# === Set Paths ===

In [36]:
CHROMA_PATH = "Chroma"
DOC_PATH = "/content/sample_data/20220202_alphabet_10K.pdf"

# === Step 1: Load and Chunk PDF ===

In [37]:
# load your pdf doc
loader = PyPDFLoader(DOC_PATH)
pages = loader.load()



In [38]:
# split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

# === Step 2: Embeddings (HuggingFace) ===

In [39]:
# get Embedding model (keep as is, as you specified no change to vector DB)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")




# === Step 3: Store in Chroma Vector DB ===

In [40]:
# embed the chunks as vectors and load them into the database.
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

# === Step 4: Define User Query ===

In [42]:
# this is an example of a user question (query)
query = 'Summarize the 20220202_alphabet_10K document'

# === Step 5: Retrieve Similar Chunks ===

In [43]:
# retrieve context - top 5 most relevant (closest) chunks to the query vector
docs_chroma = db_chroma.similarity_search_with_score(query, k=5)

In [44]:
# generate an answer based on given user query and retrieved context information
context_text = "\n\n".join([doc.page_content for doc, _score in docs_chroma])

# === Step 6: Prompt Template ===

In [58]:
# you can use a prompt template
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""


# === Step 7 - Set Mistral as LLM for RAG ===

In [59]:
import os
from google.colab import userdata

In [60]:
# Retrieve the Mistral API key from Colab Secrets
try:
    # 1. Retrieve the secret from Colab's userdata.
    MISTRAL_API_KEY = userdata.get("MISTRAL_API_KEY")
    # 2. Set it as an environment variable for the current session.
    os.environ["MISTRAL_API_KEY"] = MISTRAL_API_KEY
except Exception as e:
    print(f"Error retrieving MISTRAL_API_KEY from Colab Secrets: {e}")
    print("Please ensure you have set 'MISTRAL_API_KEY' in Colab Secrets and granted notebook access.")
    exit() # Stop execution if the key isn't found, as the LLM won't work.

# Initialize Mistral LLM

In [64]:

# If using Mistral AI's official API:
model = ChatMistralAI(model="mistral-medium", temperature=0.7) # You can choose "mistral-tiny", "mistral-small", "mistral-medium"

In [65]:
# load retrieved context and user query in the prompt template
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)

In [66]:
# call LLM model to generate the answer based on the given context and query
response_text = model.invoke(prompt) # Use .invoke() for ChatModels in Langchain
print(response_text.content) # Access the content of the response

The document is Alphabet Inc.'s Form 10-K for the fiscal year ended December 31, 2021. It includes a table of contents listing key sections such as:

**PART I:**
- **Item 1. Business** (Page 4)
- **Item 1A. Risk Factors** (Page 10)
- **Item 1B. Unresolved Staff Comments** (Page 24)
- **Item 2. Properties** (Page 24)
- **Item 3. Legal Proceedings** (Page 24)
- **Item 4. Mine Safety Disclosures** (Page 24)

**PART II:**
- **Item 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities** (Page 25)
- **Item 6. [Reserved]** (Page 27)

Additional sections include references to stock plans:
- **10.06.1**: Alphabet Inc. Amended and Restated 2012 Stock Plan - Form of Alphabet Restricted Stock Unit Agreement (Annual Report on Form 10-K, February 4, 2020).
- **10.06.2**: Alphabet Inc. Amended and Restated 2012 Stock Plan - Performance Stock Unit Agreement (Annual Report on Form 10-K, February 4, 2020).
- **10.07**: Alphabet Inc. 2021 Stock Pl