<a href="https://colab.research.google.com/github/iptimoshenko/openai_forbiz_tasks/blob/main/PDF_Query_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

In [None]:
from PyPDF2 import PdfReader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = ""

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [None]:
# location of the pdf file/files.
reader = PdfReader(os.path.join(root_dir, 'ecstatic_earthling/apps/The-Field-Guide-to-Data-Science.pdf'))

In [None]:
reader

<PyPDF2._reader.PdfReader at 0x7f7538f02340>

In [None]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
len(raw_text)

162337

In [None]:
raw_text[:100]

'    THE FIELD  GUIDE   \n            to   DATA  SCIENCE\n© COPYRIGHT 2013 BOOZ ALLEN HAMILTON INC. ALL'

In [None]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits.

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
len(texts)

202

In [None]:
texts[0]

'THE FIELD  GUIDE   \n            to   DATA  SCIENCE\n© COPYRIGHT 2013 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.FOREWORD\nEvery aspect of our lives, from life-saving disease \ntreatments, to national security, to economic stability \nand even the convenience of selecting a restaurant, \ncan be improved by creating better data analytics \nthrough Data Science. \nWe live in a world of incredible \nbeauty and complexity. A world \nincreasingly measured, mapped, \nand recorded into digital bits for \neternity. Our human existence \nis pouring into the digital realm \nfaster than ever. From global \nbusiness operations to simple \nexpressions of love – an essential \npart of humanity now exists in \nthe digital world. \nData  is the byproduct of our \nnew digital existence.  Recorded \nbits of data from mundane \ntraﬃc cameras to telescopes \npeering into the depths of \nspace are propelling us into \nthe greatest age of discovery \nour species has ever known. \nAs we move from isolatio

In [None]:
texts[1]

'traﬃc cameras to telescopes \npeering into the depths of \nspace are propelling us into \nthe greatest age of discovery \nour species has ever known. \nAs we move from isolation into \nour ever-connected and recorded \nfuture, data is becoming the \nnew currency and a vital natural \nresource. /T_he power, importance, and responsibility such incredible \ndata stewardship will demand of \nus in the coming decades is hard \nto imagine – but we often fail to \nfully appreciate the insights data \ncan provide us today. Businesses \nthat do not rise to the occasion \nand garner insights from this new \nresource are destined for failure.\nAn essential part of human \nnature is our insatiable curiosity \nand the need to /f_ind answers to \nour hardest problems. Today, the \nemerging /f_ield of Data Science is \nan auspicious and profound new \nway of applying our curiosity \nand technical tradecraft to create \nvalue from data that solves our \nhardest problems. Leaps in \nhuman imagination,

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings(disallowed_special=())

In [None]:
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7f2519108df0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7f251912ed70>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization=None, allowed_special=set(), disallowed_special=set(), chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, http_client=None)

In [None]:
docsearch = FAISS.from_texts(texts, embeddings)

In [None]:
docsearch

<langchain.vectorstores.faiss.FAISS at 0x7a28ec660160>

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [None]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [None]:
query = "who is data scientist?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Data scientists are people who use their creativity, curiosity, technical skills, and detail-oriented nature to solve problems using data.'

In [None]:
docs

In [None]:
query = "what are the best machine learning algorithms for time series analysis, their strengths and weaknesses?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Neural networks are a good choice for evaluating weekly variable contributions, as they can condition upon week without greatly increasing the complexity. Other algorithms such as principal component regression and unsupervised learning techniques may also be beneficial, depending on the goal of the analysis. Each of these algorithms have their own strengths and weaknesses, so it is best to research each before making a decision.'

In [None]:
query = "what are the best machine learning algorithms for performance prediction, their strengths and weaknesses?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' It depends on the data and the problem, but some common machine learning algorithms for performance prediction include regression, clustering, classification, and recommendation. Each has its own strengths and weaknesses, so it is best to evaluate them based on the specific data and problem.'