[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-PPB5oaKxoeIh2OXzJc3JFrvhtK0PM-b?usp=sharing)

## Final Exam

# Building a Question-Answering System with RAG and Mistral
## Overview
In this notebook, we'll build a Question-Answering system using Retrieval-Augmented Generation (RAG) and the Mistral language model. The system will answer multiple-choice questions about a document on Natural Language Processing.
## What we'll build
Our system will:

- Process a PDF document about NLP developments
- Create a vector database for efficient information retrieval
-Use Mistral to generate accurate answers to multiple-choice questions
- Evaluate the answers against provided correct responses

### Technical Components

- Document Processing: PDF extraction and text chunking
- Vector Storage: Document embeddings and retrieval
- Language Model: Mistral for answer generation
- RAG Pipeline: Combining retrieval and generation

## Learning Objectives
By completing this notebook, you will learn:

- How to implement a RAG system from scratch
- Techniques for processing and chunking PDF documents
- Methods for creating and managing vector embeddings
- Integration of Mistral LLM for question answering
- Best practices for prompt engineering with multiple-choice questions

## Dataset
We'll use:

A PDF document discussing NLP developments (Understanding Natural Language Processing.pdf)
A set of multiple-choice questions testing comprehension of the document

### **You can work together**

In [1]:
!gdown "https://drive.google.com/uc?id=1BLJOIJONLof1ufwrx1-HXj0mrzCFKe8E"

Downloading...
From: https://drive.google.com/uc?id=1BLJOIJONLof1ufwrx1-HXj0mrzCFKe8E
To: /content/Understanding Natural Language Processing.pdf
  0% 0.00/32.8k [00:00<?, ?B/s]100% 32.8k/32.8k [00:00<00:00, 48.7MB/s]


In [2]:
questions = {
    "questions": [
            {
                "question": "What was the main limitation of the TalkBot chatbot in 2015?",
                "options": [
                    "It couldn't process multiple languages",
                    "It couldn't understand context and nuance in complex conversations",
                    "It had no internet access",
                    "It was too slow in responding"
                ],
                "correct": "B"
            },
            {
                "question": "What accuracy did BERT achieve in medical symptom classification?",
                "options": [
                    "75%",
                    "85%",
                    "92%",
                    "67%"
                ],
                "correct": "C"
            },
            {
                "question": "What accuracy did MIT's system achieve in detecting sarcasm in tweets?",
                "options": [
                    "75%",
                    "87%",
                    "92%",
                    "85%"
                ],
                "correct": "B"
            },
            {
                "question": "What improvement percentage is achieved by combining text and images versus text-only?",
                "options": [
                    "10-15%",
                    "15-20%",
                    "20-25%",
                    "25-30%"
                ],
                "correct": "B"
            },
            {
                "question": "How is modern text classification implemented in the document's example?",
                "options": [
                    "Using spaCy",
                    "Using NLTK",
                    "Using BERT",
                    "Using Word2Vec"
                ],
                "correct": "C"
            },
            {
                "question": "What does Mario eat and how much experience does he have?",
                "options": [
                    "pizza, 21",
                    "pasta, 15",
                    "pizza, 25",
                    "pasta, 21"
                ],
                "correct": "A"
            }
        ]
    }

In [None]:
%pip install pinecone
%pip install langchain
%pip install langchain-community
%pip install langchain-core
%pip install PyPDF2
%pip install -qU langchain_mistralai
%pip install mistralai
%pip install markdown
!pip install faiss-cpu

Collecting pinecone
  Downloading pinecone-5.4.2-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<4.0.0,>=2.0.0 (from pinecone)
  Downloading pinecone_plugin_inference-3.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-5.4.2-py3-none-any.whl (427 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.3/427.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-3.1.0-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone
Successfully installed pinecone-5.4.2 pinecone-plugin-inference-3.1.0 pinecone-plugin

In [None]:
from google.colab import userdata
import getpass
import os

# Mistral API Key
if "MISTRAL_API_KEY" not in os.environ:
    try:
        os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
    except Exception as e:
        os.environ["MISTRAL_API_KEY"] = getpass.getpass("Provide your Mistral API Key: ")