[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sbC-EPHE3NIBAOCCo-Xuix6l0cKxoRsV?usp=sharing

## Final Exam

# Building a Question-Answering System with RAG and Mistral
## Overview
In this notebook, we'll build a Question-Answering system using Retrieval-Augmented Generation (RAG) and the Mistral language model. The system will answer multiple-choice questions about a document on Natural Language Processing.
## What we'll build
Our system will:

- Process a PDF document about NLP developments
- Create a vector database for efficient information retrieval
-Use Mistral to generate accurate answers to multiple-choice questions
- Evaluate the answers against provided correct responses

### Technical Components

- Document Processing: PDF extraction and text chunking
- Vector Storage: Document embeddings and retrieval
- Language Model: Mistral for answer generation
- RAG Pipeline: Combining retrieval and generation

## Learning Objectives
By completing this notebook, you will learn:

- How to implement a RAG system from scratch
- Techniques for processing and chunking PDF documents
- Methods for creating and managing vector embeddings
- Integration of Mistral LLM for question answering
- Best practices for prompt engineering with multiple-choice questions

## Dataset
We'll use:

A PDF document discussing NLP developments (Understanding Natural Language Processing.pdf)
A set of multiple-choice questions testing comprehension of the document

### **You can work together**

In [None]:
!gdown "https://drive.google.com/uc?id=1BLJOIJONLof1ufwrx1-HXj0mrzCFKe8E"

Downloading...
From: https://drive.google.com/uc?id=1BLJOIJONLof1ufwrx1-HXj0mrzCFKe8E
To: /content/Understanding Natural Language Processing.pdf
  0% 0.00/32.8k [00:00<?, ?B/s]100% 32.8k/32.8k [00:00<00:00, 66.1MB/s]


In [None]:
%pip install pinecone
%pip install langchain
%pip install langchain-community
%pip install langchain-core
%pip install PyPDF2
%pip install -qU langchain_mistralai
%pip install mistralai
%pip install markdown
!pip install faiss-cpu

Collecting pinecone
  Downloading pinecone-5.4.2-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<4.0.0,>=2.0.0 (from pinecone)
  Downloading pinecone_plugin_inference-3.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-5.4.2-py3-none-any.whl (427 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.3/427.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-3.1.0-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone
Successfully installed pinecone-5.4.2 pinecone-plugin-inference-3.1.0 pinecone-plugin

In [None]:
from google.colab import userdata
import getpass
import os

# Mistral API Key
if "MISTRAL_API_KEY" not in os.environ:
    try:
        os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
    except Exception as e:
        os.environ["MISTRAL_API_KEY"] = getpass.getpass("Provide your Mistral API Key: ")

In [None]:
import os
from langchain_mistralai import ChatMistralAI, MistralAIEmbeddings
from langchain_community.vectorstores import FAISS
import PyPDF2
import time
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text

def create_chunks(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        if end < len(text):
            end = text.rfind(' ', start, end)
        chunks.append(text[start:end].strip())
        start = end - overlap
    return chunks

llm = ChatMistralAI(model="mistral-tiny", temperature=0)
embeddings = MistralAIEmbeddings(model="mistral-embed")

text = extract_text_from_pdf("Understanding Natural Language Processing.pdf")
chunks = create_chunks(text)

vector_store = FAISS.from_texts(chunks, embeddings)
print(f"Processed {len(chunks)} chunks from PDF")

retriever = vector_store.as_retriever()

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an AI tasked with answering multiple choice questions.
    Based ONLY on the provided context, select the most accurate answer among the options.
    If the answer cannot be determined from the context, say "Cannot determine".

    Respond ONLY with the letter (A, B, C, or D) of your answer.

    Context: {context}
    """),
    ("human", """Question: {input}

Options:
A) {option_a}
B) {option_b}
C) {option_c}
D) {option_d}""")
])

doc_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, doc_chain)

def get_answer(rag_chain, question_data):
    # Include the 'input' key with the question
    chain_input = {
        "input": question_data["question"],  # This is the required 'input' key
        "option_a": question_data["options"][0],
        "option_b": question_data["options"][1],
        "option_c": question_data["options"][2],
        "option_d": question_data["options"][3]
    }
    result = rag_chain.invoke(chain_input)
    return result['answer'].strip()



Processed 5 chunks from PDF


In [None]:
questions = {
    "questions": [
            {
                "question": "What was the main limitation of the TalkBot chatbot in 2015?",
                "options": [
                    "It couldn't process multiple languages",
                    "It couldn't understand context and nuance in complex conversations",
                    "It had no internet access",
                    "It was too slow in responding"
                ],
                "correct": "B"
            },
            {
                "question": "What accuracy did BERT achieve in medical symptom classification?",
                "options": [
                    "75%",
                    "85%",
                    "92%",
                    "67%"
                ],
                "correct": "C"
            },
            {
                "question": "What accuracy did MIT's system achieve in detecting sarcasm in tweets?",
                "options": [
                    "75%",
                    "87%",
                    "92%",
                    "85%"
                ],
                "correct": "B"
            },
            {
                "question": "What improvement percentage is achieved by combining text and images versus text-only?",
                "options": [
                    "10-15%",
                    "15-20%",
                    "20-25%",
                    "25-30%"
                ],
                "correct": "B"
            },
            {
                "question": "How is modern text classification implemented in the document's example?",
                "options": [
                    "Using spaCy",
                    "Using NLTK",
                    "Using BERT",
                    "Using Word2Vec"
                ],
                "correct": "C"
            }
        ]
    }

In [None]:
# Test and evaluation
answers = []
for q in questions["questions"]:
    answer = get_answer(rag_chain, q)
    answers.append(answer)
    print(f"Q: {q['question']}")
    print(f"AI answered: {answer}, Correct: {q['correct']}\n")
    time.sleep(15)

Q: What was the main limitation of the TalkBot chatbot in 2015?
AI answered: B) It couldn't understand context and nuance in complex conversations, Correct: B



HTTPStatusError: Error response 429 while fetching https://api.mistral.ai/v1/chat/completions: {"message":"Requests rate limit exceeded"}

In [None]:
mario_question = {
                "question": "What does Mario eat and how much experience does he have?",
                "options": [
                    "pizza, 21",
                    "pasta, 15",
                    "pizza, 25",
                    "pasta, 21"
                ],
                "correct": "A"
            }
answer = get_answer(rag_chain, mario_question)
answers.append(answer)
print(f"Q: {mario_question['question']}")
print(f"AI answered: {answer}, Correct: {mario_question['correct']}\n")

Q: What does Mario eat and how much experience does he have?
AI answered: A) pizza, 21, Correct: A

