<a href="https://colab.research.google.com/github/mani-droid/data-science-projects/blob/master/Advanced_RAG_Project_Day6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain faiss-cpu sentence-transformers openai chromadb PyPDF2 streamlit




Importing Necessary libraries


In [2]:
import os
import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline
from PyPDF2 import PdfReader

Filename is file downloaded directly from the URL and pdf_path is the filename which uses in another function with different name but the content is same.


In [3]:
def download_pdf(url, filename):
  response = requests.get(url)
  with open(filename, 'wb') as f:
    f.write(response.content)

def extract_text_from_pdf(pdf_path):
  reader = PdfReader(pdf_path)
  text = ''
  for page in reader.pages:
    text +=page.extract_text()
  return text

**URL with the filename (pdf_path) in which url content will be downloaded**

In [4]:
url = "https://arxiv.org/pdf/2312.10997"
pdf_path = "arxiv.pdf"
download_pdf(url,pdf_path)
text = extract_text_from_pdf(pdf_path)

**Splitting Large text into smaller chunks**

In [5]:
def chunk_text(text, chunk_size=500, chunk_overlap=50):
  splitter = RecursiveCharacterTextSplitter(
      chunk_size = chunk_size,
      chunk_overlap = chunk_overlap,
      separators=["\n\n","\n","."," "]
  )
  return splitter.split_text(text)

In [6]:
chunks = chunk_text(text)


**Create Embeddings and FAISS Index**

In [7]:
model_embed = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model_embed.encode(chunks)
embeddings.shape

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


(236, 384)

In [8]:
def embed_and_store(chunks):
  model_embed = SentenceTransformer("all-MiniLM-L6-v2")
  embeddings = model_embed.encode(chunks)

  dimension = embeddings.shape[1]
  index = faiss.IndexFlatL2(dimension)
  index.add(np.array(embeddings))

  return index, chunks

In [9]:
index, all_chunks = embed_and_store(chunks)

In [10]:
index, all_chunks

(<faiss.swigfaiss_avx512.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7eeae94f15c0> >,
 ['1\nRetrieval-Augmented Generation for Large\nLanguage Models: A Survey\nYunfan Gaoa, Yun Xiongb, Xinyu Gaob, Kangxiang Jiab, Jinliu Panb, Yuxi Bic, Yi Daia, Jiawei Suna, Meng\nWangc, and Haofen Wanga,c\naShanghai Research Institute for Intelligent Autonomous Systems, Tongji University\nbShanghai Key Laboratory of Data Science, School of Computer Science, Fudan University\ncCollege of Design and Innovation, Tongji University\nAbstract —Large Language Models (LLMs) showcase impres-',
  'sive capabilities but encounter challenges like hallucination,\noutdated knowledge, and non-transparent, untraceable reasoning\nprocesses. Retrieval-Augmented Generation (RAG) has emerged\nas a promising solution by incorporating knowledge from external\ndatabases. This enhances the accuracy and credibility of the\ngeneration, particularly for knowledge-intensive tasks, and allows\nfor conti

In [11]:
def retrieve_and_answer(query, index, chunks):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    query_embedding = model.encode([query])

    top_k = 3
    distances, indices = index.search(np.array(query_embedding), top_k)

    context = "\n".join([chunks[i] for i in indices[0]])

    qa_model = pipeline("text-generation", model="gpt2", device = 0)
    prompt = f"Answer the question based on the context below:\n\n{context}\n\nQuestion: {query}\nAnswer:"

    response = qa_model(prompt, max_new_tokens=150, do_sample=True, temperature=0.7)
    return response[0]['generated_text']


In [12]:
answer = retrieve_and_answer("What is retrieval-augmented generation?", index, all_chunks)
print(answer)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the question based on the context below:

arXiv:2212.14024 , 2022.
[24] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang,
J. Callan, and G. Neubig, “Active retrieval augmented generation,”
arXiv preprint arXiv:2305.06983 , 2023.
[25] A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag:
Learning to retrieve, generate, and critique through self-reflection,”
arXiv preprint arXiv:2310.11511 , 2023.
[26] Z. Ke, W. Kong, C. Li, M. Zhang, Q. Mei, and M. Bendersky,
LLM-auto-eval-best-practices-RAG, 2023.
[164] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Au-
tomated evaluation of retrieval augmented generation,” arXiv preprint
arXiv:2309.15217 , 2023.
[165] J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia, “Ares: An
automated evaluation framework for retrieval-augmented generation
systems,” arXiv preprint arXiv:2311.09476 , 2023.
[166] C. Jarvis and J. Allard, “A survey of techniques for
below a certain threshold would activates the ret