<a href="https://colab.research.google.com/github/jeff-ai-ml/silver-badge-project/blob/main/RAG_PDF_Read_Summary_Using_Mistral_ChromaDB_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# === Install Required Packages ===

# === Install Required Packages ===

In [38]:
!pip install langchain-mistralai langchain-community pypdf chromadb



In [39]:
# Import libraries
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate


# For Mistral LLM

In [40]:
from langchain_mistralai import ChatMistralAI # For Mistral AI API

# === Set Paths ===

In [41]:
CHROMA_PATH = "Chroma"
DOC_PATH = "/content/sample_data/20220202_alphabet_10K.pdf"

# === Step 1: Load and Chunk PDF ===

In [42]:
# load your pdf doc
loader = PyPDFLoader(DOC_PATH)
pages = loader.load()



In [43]:
# split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

# === Step 2: Embeddings (HuggingFace) ===

In [44]:
# get Embedding model (keep as is, as you specified no change to vector DB)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")




# === Step 3: Store in Chroma Vector DB ===

In [45]:
# embed the chunks as vectors and load them into the database.
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

# === Step 4: Define User Query ===

In [46]:
# this is an example of a user question (query)
query = 'Summarize the 20220202_alphabet_10K document'

# === Step 5: Retrieve Similar Chunks ===

In [47]:
# retrieve context - top 5 most relevant (closest) chunks to the query vector
docs_chroma = db_chroma.similarity_search_with_score(query, k=5)

In [48]:
# generate an answer based on given user query and retrieved context information
context_text = "\n\n".join([doc.page_content for doc, _score in docs_chroma])

# === Step 6: Prompt Template ===

In [49]:
# you can use a prompt template
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""


# === Step 7 - Set Mistral as LLM for RAG ===

In [50]:
import os
from google.colab import userdata

In [51]:
# Retrieve the Mistral API key from Colab Secrets
try:
    # 1. Retrieve the secret from Colab's userdata.
    MISTRAL_API_KEY = userdata.get("MISTRAL_API_KEY")
    # 2. Set it as an environment variable for the current session.
    os.environ["MISTRAL_API_KEY"] = MISTRAL_API_KEY
except Exception as e:
    print(f"Error retrieving MISTRAL_API_KEY from Colab Secrets: {e}")
    print("Please ensure you have set 'MISTRAL_API_KEY' in Colab Secrets and granted notebook access.")
    exit() # Stop execution if the key isn't found, as the LLM won't work.

# Initialize Mistral LLM

In [52]:

# Mistral AI's official API:
# Can choose "mistral-tiny", "mistral-small", "mistral-medium"
model = ChatMistralAI(model="mistral-medium", temperature=0.7)

In [53]:
# load retrieved context and user query in the prompt template
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)

In [54]:
# call LLM model to generate the answer based on the given context and query
response_text = model.invoke(prompt) # Use .invoke() for ChatModels in Langchain
print(response_text.content) # Access the content of the response

The 20220202_alphabet_10K document is Alphabet Inc.'s Form 10-K for the fiscal year ended December 31, 2021. It includes a table of contents listing key sections such as:

- **PART I**:
  - **Item 1. Business**: Overview of Alphabet’s operations.
  - **Item 1A. Risk Factors**: Potential risks facing the company.
  - **Item 1B. Unresolved Staff Comments**: Pending regulatory feedback.
  - **Item 2. Properties**: Details on physical assets.
  - **Item 3. Legal Proceedings**: Information on ongoing legal matters.
  - **Item 4. Mine Safety Disclosures**: Not applicable to Alphabet.

- **PART II**:
  - **Item 5. Market for Registrant’s Common Equity**: Discusses stock performance and equity transactions.

The document references stock plans, including the **2012 Stock Plan** and the **2021 Stock Plan**, with agreements for Restricted Stock Units and Performance Stock Units filed in previous reports. It notes that 2019 financial discussions are omitted to avoid redundancy with the 2020 repor