### Query a PDF document read from a directory

For this model we will use **OpenAI** (requires an API Token to access)  
checkout `Account setup` section in  
https://platform.openai.com/docs/quickstart?context=python 

### Sample environmet file holding API Tokens
import os                                                                                                                                                                                                                   
SERPAPI_API_KEY="**Google**"  
HF_TOKEN="**HuggingFaces**"    
GITHUB_TOKEN="**GitHub**"   
OPENAI_API_KEY="**OpenAI**"    
  
def set_environment():  
    &nbsp;&nbsp;variable_dict = globals().items()  
    &nbsp;&nbsp;for key, value in variable_dict:  
        &nbsp;&nbsp;if "API" in key or "ID" in key or "TOKEN" in key:  
            &nbsp;&nbsp;&nbsp;&nbsp;os.environ[key] = value  
  

In [2]:
%load_ext autoreload
%autoreload 2
import os
from pathlib import Path
import sys
## READ your OPENAI_API_KEY from .env file -or- setup as ENV
# os.environ['OPENAI_API_KEY'] = "<your OpenAI access token>"
sys.path.append(r'..')
from configs import set_environment
set_environment()

In [7]:
# Initial imports
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

In [10]:
# load related packaged to split, embed, store
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain

### Query PDF document

In [11]:
from langchain.document_loaders import PyPDFLoader       # load .pdf file
from langchain.document_loaders import Docx2txtLoader    # load .docx file
from langchain.document_loaders import TextLoader        # load .txt file

In [12]:
# load PDF document
file_name = "./data/falcon-users-guide-2021-09.pdf"
name, ext = os.path.splitext(file_name)
print(f"name:{name} - extension:{ext}")

name:falcon-users-guide-2021-09 - extension:.pdf


In [13]:
if ext == ".pdf":
  uploded_file = PyPDFLoader(file_name)

In [14]:
# load document
document = uploded_file.load()

# split document into chunks - chunk_size = 1000, and chunk_overlap = 150
text_chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# split document as per txt_chunks
chunks = text_chunks.split_documents(document)

# get OpenAI Embeddings
embeddings = OpenAIEmbeddings()

# Use Chroma to store vectors of embeddings
vector_store = Chroma.from_documents(chunks, embeddings)


Overwriting cache for 0 11250


In [19]:
# some verification
chunks[20]

Document(page_content='© Space Exploration Technologies Corp.  All rights reserved.  1 \n \n \n \nThe Falcon launch vehicle user’s guide is a planning document provided for customers of Space X (Space Exploration \nTechnologies Corp.). This document is applicable to the Falcon vehicle configuration s with a 5.2 m ( 17-ft) diameter \nfairing and the related launch service (Section  2). \nThis user’s guide is intended for pre -contract mission planning and for understanding SpaceX’s standard services. The \nuser’s guide is not intended for detailed design use. Data for detailed design purposes will be exchanged directly between \na SpaceX customer and a SpaceX mission manager.   \nSpac eX reserves the right to update this user’s guide as required. Future revisions are assumed to always be in process \nas SpaceX gathers additional data and works to improve its launch vehicle design.  \n \nSpaceX  offers  a family of launch vehic les that improve s launch  reliability and  increase s acces

In [72]:
# Embeddings are Numerical representation of text data - high dimensional vectors
print(type(embeddings))
str_ch20 = str(str(chunks[20]).split(' \\n')[0])   # split and convert to string
print(type(str_ch20))

## print part of chunk to view embeddings
print(embeddings.embed_query(str_ch20)[:20])

<class 'langchain_community.embeddings.openai.OpenAIEmbeddings'>
<class 'str'>
[0.012253612080443187, 0.00555551788108997, -0.015626605345768544, -0.02889964123714779, -0.008996928895600685, -0.0010382387385281654, -0.02378199583325705, -0.019307477085444176, -0.018814870781425685, -0.014422453705164264, 0.010584220117907488, 0.01005056204679071, -0.01527083267747565, -0.015188731937246755, -0.005360527808139104, 0.02508193213645632, 0.006947818564784627, -0.030678500853322018, 4.270761859123458e-05, 0.014449820618573896]


### ChatGPT

In [44]:
# setup ChatOpenAI
llm = ChatOpenAI(api_key=os.environ['OpenAI_API_KEY'], 
                 model='gpt-3.5-turbo', 
                 temperature=0.1, 
                 max_tokens=100)

In [45]:
# retrieve document from vector store
retriever = vector_store.as_retriever()

In [46]:
# setup retrieval chain
conv_ret_chain = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)

In [70]:
# start asking questions
session_chat = {}
session_chat['history'] = []
print(type(session_chat))
def ask_questions():
  question = input("Enter your question: ")
  if "Thank" in question:
    print("Thank you for using ChatGPT")
    return
  else:
    response = conv_ret_chain.run({'question':question,
                                  'chat_history':session_chat['history']})
    print(response, end="\n\n")
    session_chat['history'].append((question, response))
    ask_questions()

<class 'dict'>


In [71]:
ask_questions()

Enter your question:  in 3 sentences summarize how "Elon Musk" got investments to "Tesla Motors"


Elon Musk co-founded Tesla Motors in 2003 and invested a significant amount of his own money into the company. He also secured investments from other co-founders, including Martin Eberhard and Marc Tarpenning, as well as external investors like JB Straubel and Ian Wright. Musk's vision for electric vehicles and sustainable energy attracted further investments as the company grew and gained traction in the market.


Enter your question:  In one sentence summarize SpaceX


SpaceX is a private aerospace company founded by Elon Musk that focuses on developing reliable, cost-effective launch vehicles like the Falcon family to increase access to space for commercial, government, and international customers.


Enter your question:  Thank You


Thank you for using ChatGPT


In [69]:
ask_questions()

Enter your question:  Thank You


Thank you for using ChatGPT
