In [1]:
%pip install -qU langchain openai tiktoken pinecone-client[grpc] pypdf chromadb

Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/GNSS-book.pdf")
pages = loader.load_and_split()
len(pages)

117

In [3]:
import tiktoken

def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tokenizer = tiktoken.get_encoding('cl100k_base')

tiktoken_len(pages[41].page_content)

411

In [4]:
total = 0
for page in pages:  
    total += tiktoken_len(page.page_content)  
total

49890

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=150,
        chunk_overlap=20,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )

index = Chroma("gnss_book")
for page in pages:
    chunks = text_splitter.split_text(page.page_content)
    index.add_texts(chunks, [page.metadata for _ in chunks])
    

#chunks = text_splitter.split_text(inputText)

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
  from .autonotebook import tqdm as notebook_tqdm


In [6]:
query = "airplane"
index.similarity_search(query)

[Document(page_content='distance and direction between the ship and \nhelicopter. Using this relative distance and \ndirection, the unmanned helicopter is able \nto autonomously approach and land on the \nship’s flight deck. \nAn article (From Fledgling to Flight) about the \nlanding of the unmanned Little Bird helicopter \non a moving ship is in the 2013 Velocity magazine \navailable at: resources.hexagonpositioning.\ncom/from-fledgling-to-flight .\nDelivering critical medical supplies\nDelivering essential medical supplies \nto hospitals and medical clinics in remote \nareas is often challenging. Long distances \nand poor infrastructure can delay deliveries \nof supplies that are desperately needed to', metadata={'source': 'data/GNSS-book.pdf', 'page': 99}),
 Document(page_content='com/put-to-the-test .\nLanding an unmanned helicopter \non a ship \nThe autonomous landing of an unmanned \nhelicopter is already challenging as the navigation system needs to deal with \nmovement of the h

In [25]:
def open_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as infile:
        return infile.read()

OPENAI_API_KEY = open_file('openaiapikey.txt')

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
OPENAI_API_KEY = open_file('openaiapikey.txt')
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.as_retriever(),
    #return_source_documents=True
)


In [8]:
qa.run('What is the GNSS?')

'GNSS stands for Global Navigation Satellite System. It is a technology that uses a network of satellites to provide location, navigation, and timing information to users anywhere in the world. It is becoming a ubiquitous technology in many aspects of our lives, and has a wide range of applications in cost efficiency and safety of life. However, there are also potential threats to GNSS that need to be considered.'

In [9]:
qa.run('What can degenerate quality of positioning?')

'There are several factors that can degrade the quality of positioning, including poor satellite geometry, atmospheric interference, multipath errors, and receiver noise. These factors can cause errors in the positioning data and increase the standard deviation, which reduces the accuracy of the positioning output. To ensure integrity and accuracy, it is important to validate the positioning data and use techniques such as differential GNSS to mitigate errors and improve performance.'

In [10]:
qa.run('What is CrossCheck?')

"I'm sorry, there is no context provided about CrossCheck. Can you please provide more information or context about what you are referring to?"

In [27]:

res = qa.run('What is augmentation?')

In [28]:
res

'Augmentation is the process of improving the accuracy, integrity, and availability of the basic GNSS signals across a large geographical region. This is done through the use of satellite-based systems such as SBAS (Satellite-Based Augmentation System) like WAAS and EGNOS. These systems provide corrections to GNSS signals, which are then broadcast to GNSS receivers throughout the coverage area. User equipment receives the corrections and applies them to range calculations, resulting in improved accuracy.'