## A project aiming to develop a personalized chatbot that will allow us to ask questions to a specific set of data that we provide.
## It leverages Langchain as a model, OpenAI for embeddings generation, and FAISS library for efficient indexing and storage of text embeddings, all this based on a provided database in form of a PDF file.


### First set up the environement, download the needed libraries and set the OpenAI api key.

In [None]:
# Download the needed libraries

!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

In [23]:
from PyPDF2 import PdfReader
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
import os
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [24]:
# Get our API key from Openai, an account is needed.


os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [25]:
# Read in the pdf file


pdf_reader = PdfReader("/content/BIG-DATA-COURS.pdf")

In [26]:
# Read data from the file and put them into a string variable called text


text = ''
for i, page in enumerate(pdf_reader.pages):
  tmp_text = page.extract_text()
  if tmp_text:
    text += tmp_text

In [27]:
# Chunk the Documents, the chunks are the building blocs for our LLM
# It go from a simple letter 'g' to a complete word 'data'


text_splitter = RecursiveCharacterTextSplitter(
  chunk_size = 512,
  chunk_overlap  = 32,
  length_function = len,
  )
texts = text_splitter.split_text(text)

In [28]:
# Initialize the embeddings from the OpenAI library
# It will allow us to give a sementic to our data (contextualization)
# we will use them to convert the chunks of text into vectors


embeddings = OpenAIEmbeddings()

# to finally get a Vector Database (our knowledge base) using the FAISS library and the OpenAI embeddings


docsearch = FAISS.from_texts(texts, embeddings)

In [29]:
# qa_chain is used to connect our similarity search to the prompts–user input
# it can be used for more complex tasks

# The stuff parameter in our qa_chain enables us to build applications like
# this, where documents are small and only a few are passed in for most calls


chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [31]:
# Finally, we get to ask our PDF questions

query = "what is big data ?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query).strip()

'Big data refers to a large volume of data that is characterized by its massive size, variety of formats and structures, and the need for fast processing. It is a complex type of data that requires advanced technologies and algorithms for storage and analysis. It is often defined using the concept of the 3Vs: velocity, variety, and volume.'

In [32]:
query = "for which engineer profile is this course useful ?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query).strip()

'This course is useful for engineers working with big data, as it covers topics such as Hadoop, MapReduce, and Spark, which are commonly used in big data technologies.'