# Utilizing OpenAI LLMs for Retrieval Augmented Generation with FAISS

This project harnesses OpenAI Large Language Models (LLMs) to execute retrieval augmented generation tasks, employing FAISS for efficient search and clustering. Data extracted from parsed PDFs serves as input, enabling comprehensive evaluation and testing of the retrieval augmented generation capabilities.

## Libraries are installed


In [None]:
!pip install openai --quiet
!pip install langchain --quiet
!pip install pypdf --quiet
!pip install faiss-cpu --quiet
!pip install tiktoken --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.8 MB[0m [31m5.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.8 MB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.8 MB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━[0m [32m1.5/1.8 MB[0m [31m10.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packag

## Google Drive is mounted. PDF files are stored there

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## OpenAI token is saved as enviroment variable

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "token"

## Libraries are imported

In [None]:
from langchain.chat_models import ChatOpenAI
import re
import os
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.callbacks import get_openai_callback
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
import pickle
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

## Model is instantiated


In [None]:
chat = ChatOpenAI(model = 'gpt-3.5-turbo')

## PDF document is loaded

In [None]:
pdf = "/content/gdrive/MyDrive/GenAIEne2024/XBOX News Leaks.pdf"
loader = PyPDFLoader(pdf)
pages = loader.load_and_split()

## REGEX is used to extract the pdf name. This info will be used to create the corresponding vector store

In [None]:
store_name = re.findall(r'([^\/]+)\.pdf$',pdf)[0]
store_name

'XBOX News Leaks'

## Initialize OpenAI embeddings in order to turn text into vectors

In [None]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(pages,embedding=embeddings)

## Local vectorstore is created

In [None]:
vectorstore.save_local(f"/content/gdrive/MyDrive/GenAIEne2024/{store_name}_vector_store")

## Query is created


In [None]:
query = 'What is this document about?'

## Similarity search is performed with the aforementioned query. The top 3 most simmilar documents are retireved in order to be used as context.

In [None]:
docs = vectorstore.similarity_search(query=query,k=3)
chain = load_qa_chain(llm=chat, chain_type= "stuff", verbose=True)

## Executing the QA chain to retrieve info and create response

In [None]:
with get_openai_callback() as cb:
  response = chain.run(input_documents = docs, question = query)
  print(cb)

  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Bethesda   
Every game mentioned so far is a previously released game, but what about new ones? Some 
major upcoming Xbox games could be going multiplatform, too. According to The Verge , Xbox 
exclusive  Indiana Jones and the Great Circle  could be exclusive no more. Sources reportedly told 
the site that Xbox was considering launching the action -adventure title on PS5. The exact timing 
of that is unclear. There’s a chance, though, that it launches alongside the Xbox Series X/S 
version later this year. If that happens, it’ll be a major disruption of the usual “console exclusive” 
strategy and a signal that Xbox is plotting a significant strategic shift.

Rare   
Xbox’

## Displaying response

In [None]:
print(response)

This document is about potential upcoming Xbox games, specifically Indiana Jones and the Great Circle, Sea of Thieves, and Hi-Fi Rush, potentially being released on other platforms such as PlayStation and Nintendo Switch. It discusses rumors and speculation surrounding these games and their potential multiplatform releases.
