### 01_load2vectordb

#### Overview
This script is part of a pipeline that processes PDF documents and stores their data in a vector database for easy retrieval and analysis. The script performs the following key steps:

#### Use Cases
- Enhance search capabilities within a document database.
- Enable similarity-based document analysis.
- Facilitate advanced text analytics in a variety of domains.

##### 1. Data Loading from PDFs
##### 2. Text Splitting
##### 3. Embedding Creation
##### 4. Context-based chatbot


##### Imports

In [1]:
from dotenv import load_dotenv
import os
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_core.documents import Document
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re
import PyPDF2
from langchain.vectorstores import Chroma
from typing import List, Optional, Union
from langchain.output_parsers import YamlOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field



##### Load my api key

In [2]:
# Load the .env file
load_dotenv()

# Accessing variables
myapikey = os.getenv('OPENAI_API_KEY')

##### 1. Data Loading from PDFs

In [3]:
# Define the folder path to the PDF files
folder_path = '../data/cz_vfr_manual/'

# Get a list of PDF files in the directory
pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]
pdf_paths = [os.path.join(folder_path, file) for file in pdf_files]

##### 2. Text Splitting

In [4]:
tokenizer_name = tiktoken.encoding_for_model('gpt-4')
tokenizer_name.name

# create the length function
def tiktoken_len(text):
    tokens = tokenizer_name.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=tiktoken_len,
    separators=[". "],
    is_separator_regex=False,
)

def extract_cz_tags_from_filename(filename):
    # Extract the first occurrence of ICAO code from the filename
    match = re.search(r'lk[a-z]{2,6}', filename)
    
    # Return the found code in uppercase if there's a match, otherwise return None
    return match.group().upper() if match else None

OpenAI Embeddings model

In [5]:
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=myapikey
)

  warn_deprecated(


Extraction data from MAP, TEXT to struct

In [135]:
model = ChatOpenAI(openai_api_key=myapikey, temperature=0, model_name="gpt-3.5-turbo-1106")

from typing import Optional

# Define your desired data structure.
class Fee(BaseModel):
  type: Optional[str]
  amount: Optional[str]
  info: Optional[str]

class Fees(BaseModel):
    operating_fees: Optional[List[Fee]] = Field(description="Landing fees at airport")
    parking_fees: Optional[List[Fee]]  = Field(description="Parking fees at airport")
    passanger_fees: Optional[List[Fee]]  = Field(description="Passenger fees at airport")
    other_information: Optional[str] = Field(description="Other information, for example how to pay fees")


class Runway(BaseModel):
  RWY: Optional[str]
  magnetic_direction: Optional[str]
  rwy_dimensions: Optional[str]

class Geocordinates(BaseModel):
  latitude: Optional[str]
  longitude: Optional[str]

class Airport(BaseModel):
    icao_code: str = Field(description="ICAO code of the airport")
    name: Optional[str] = Field(description="Official name of the airport")
    full_address: Optional[str] = Field(description="Official address of the airport")
    city: Optional[str] = Field(description="Municipality of the airport")
    airport_type: Optional[str] = Field(description="Type of the airport")
    runways: Optional[List[Runway]] = Field(description="Runways of the airport")
    traffic: Optional[str] = Field(description="Types of traffic handled, e.g., IFR, VFR")
    geocoordinates: Optional[List[Geocordinates]] = Field(description="Geographic coordinates of the airport as latitude and longitude")
    elevation_ft: Optional[Union[int, str]] = Field(description="Elevation of the airport in feet")
    elevation_m: Optional[Union[int, str]] = Field(description="Elevation of the airport in meters")
    frequency: Optional[Union[dict[str, str], str]] = Field(description="Communication frequencies used by the airport")
    operating_hours: Optional[str] = Field(description="Operating hours of the airport")
    fuel: Optional[Union[List[str], str]] = Field(description="Types of fuel available at the airport")



# Function to set up parser and prompt template, and create a processing chain
def setup_processing_chain(pydantic_object):
    """
    Sets up the processing chain with a parser and a prompt template.

    :param query: The user's query to be processed.
    :param pydantic_object: The Pydantic model object to be used with the parser.
    :return: The constructed processing chain.
    """

    # Set up a parser
    parser = YamlOutputParser(pydantic_object=pydantic_object)

    # Create a prompt template
    prompt = PromptTemplate(
        template="Extract structure information: \n{format_instructions}\n{query}\n",
        input_variables=["query"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # Create the processing chain
    chain = prompt | model | parser

    return chain

# Example usage
chain = setup_processing_chain(Airport)
fees_chain = setup_processing_chain(Fees)

def parse2struct(text):
    return chain.invoke({"query": text})

def fees2struct(text):
    return fees_chain.invoke({"query": text})

Extraction fees data from TEXT to struct

In [137]:
yaml_special_chars = [':','{', '}', '[', ']', '#', '-', '!', '&', '*', '%', '"', "'", '|', '>', '\\n', '\\t']
# Ensure the target directory exists
output_dir = "../data/extracted_cz_vfr_manual/"
os.makedirs(output_dir, exist_ok=True)
# Iterate over each PDF file
for path in pdf_paths:
    # Extract the filename from the path
    filename = os.path.basename(path)

    # Process only files that end with 'map_cz.pdf'
    if filename.endswith('text_en.pdf'):
        full_text = ""
        pdf_reader = PyPDF2.PdfReader(path)

        # Iterate through each page and extract text
        for page in pdf_reader.pages:
            full_text += page.extract_text() + "\n"
        for char in yaml_special_chars:
            full_text = full_text.replace(char, ' ')
        # Parse the extracted text to structured data
        words = full_text.split()
        last_500_words = ' '.join(words[-500:])

        try:
            # Parse the extracted text to structured data
            airport_info = fees2struct(last_500_words)

            # Serialize the Airport object to JSON
            airport_json = airport_info.json(indent=4)

            # Define the JSON file name
            json_filename = os.path.join(output_dir, f"{filename}_fees.json")

            # Write the JSON data to a file
            with open(json_filename, 'w', encoding='utf-8') as json_file:
                json_file.write(airport_json)

        except Exception as e:
            # Handle exceptions (e.g., log them)
            print(f"An error occurred while processing {filename}: {e}")
            # You can choose to continue to the next file, or take other actions

An error occurred while processing ad-lkbe_text_en.pdf: Failed to parse Fees from completion ```yaml
operating_fees:
  - type: Landing fees
    amount: 110,00
    info: Per aeroplane up to 1500 kg MTOW
  - type: Landing fees
    amount: 200,00
    info: Per aeroplane over 1500 kg MTOW
  - type: Landing fees
    amount: 70,00
    info: Ultralight aircraft, powered gliders
  - type: Landing fees
    amount: 50,00
    info: Glider
  - type: Special charge
    amount: 500,00
    info: 
      Value added tax is included in rates.
parking_fees:
  - up to 7 HR
    amount: 60,00
  - ultralight aircraft up to 7 HR
    amount: 35,00
  - from 7 HR up to 24 HR
    amount: 120,00
  - every other day
    amount: 60,00
  Value added tax is included in rates.
passanger_fees: []
other_information: The charge for dispatch of international flight is subject of a settlement with the aerodrome operator.
```. Got: mapping values are not allowed here
  in "<unicode string>", line 21, column 11:
        amoun

Extract structure from json

In [138]:
yaml_special_chars = [':','{', '}', '[', ']', '#', '-', '!', '&', '*', '%', '"', "'", '|', '>', '\\n', '\\t']

# Ensure the target directory exists
output_dir = "../data/extracted_cz_vfr_manual/"
os.makedirs(output_dir, exist_ok=True)
# Iterate over each PDF file
for path in pdf_paths:
    # Extract the filename from the path
    filename = os.path.basename(path)

    # Process only files that end with 'map_cz.pdf'
    if filename.endswith('map_en.pdf'):
        full_text = ""
        pdf_reader = PyPDF2.PdfReader(path)

        # Iterate through each page and extract text
        for page in pdf_reader.pages:
            full_text += page.extract_text() + "\n"
            # Replace each special character with a space
        for char in yaml_special_chars:
            full_text = full_text.replace(char, ' ')
        # Parse the extracted text to structured data
        try:
        # Parse the extracted text to structured data
            airport_info = parse2struct(full_text)
            # Serialize the Airport object to JSON
            airport_json = airport_info.json(indent=4)

            # Define the JSON file name
            json_filename = os.path.join(output_dir, f"{airport_info.icao_code}_info.json")

            # Write the JSON data to a file
            with open(json_filename, 'w', encoding='utf-8') as json_file:
                json_file.write(airport_json)

        except Exception as e:
            # Handle exceptions (e.g., log them)
            print(f"An error occurred while processing {filename}: {e}")
            # You can choose to continue to the next file, or take other actions




An error occurred while processing ad-lkcebi_map_en.pdf: Failed to parse Airport from completion ```yaml
icao_code: LKCEBI
name: Čebín
full_address: Čebínskokuřimský spolek aviatiků z.s., 1. Května 1306/4, 664 34 Kuřim
city: Kuřim
airport_type: Private
runways:
  - RWY: 
    magnetic_direction: 
    rwy_dimensions: 
traffic: VFR
geocoordinates:
  - latitude: 49° 19  08  N
    longitude: 16° 29  48  E
elevation_ft: 1004
elevation_m: 306
frequency: 
  ČEBÍN RADIO: 125,830
  Group frequencyARP: 
operating_hours: Year round
fuel: NIL
```. Got: 2 validation errors for Airport
frequency -> Group frequencyARP
  none is not an allowed value (type=type_error.none.not_allowed)
frequency
  str type expected (type=type_error.str)
An error occurred while processing ad-lkcs_map_en.pdf: Failed to parse Airport from completion ```yaml
- icao_code: LKCS
  name: České Budějovice
  full_address: U Zimního stadionu 1952/2, 370 01 ČESKÉ BUDĚJOVICE 7
  city: České Budějovice
  airport_type: Public domestic 

Loading PDF to chunks and struct extraction from raw text 

In [6]:
chunked_texts_with_tags = []

# Iterate over each PDF file
for path in pdf_paths:
    full_text = ""
    pdf_reader = PyPDF2.PdfReader(path)
    filename = os.path.basename(path)
    pdf_tags = extract_cz_tags_from_filename(filename)

    #if filename.endswith('text_en.pdf'):
    if filename.endswith('.pdf'):
        for page in pdf_reader.pages:
            if len(page.extract_text()) < 200:
                continue
            else:
                page_text = page.extract_text()
                if page_text:  # Check if text extraction is successful
                    page_text = page_text.replace('\n', '. ')
                    full_text += page_text
                else:
                    full_text += "[No text extracted from this page]\n"
        
        chunks = text_splitter.split_text(full_text)
        # Add metadata to each chunk - ICAO code, airport name, categorization 
        for chunk in chunks:
            
            chunk_with_tags = f"{(pdf_tags)}, {chunk}"
            chunked_texts_with_tags.append(Document(page_content=chunk, metadata={"aiport": pdf_tags, "airport_name": "Letňany"}))

In [11]:
doc

Document(page_content='foo', metadata={'source': '1'})

In [7]:
chunked_texts_with_tags

['LKBA, 28 FEB 19 (2) Břeclav VFR-AD-LKBA-VOC. VFR příručka - Česká republika. LKBA Břeclav. §Veřejné vnitrostátní letištěVFR den, výsadková činnost, navijákové. vzlety. Břeclav RADIO. 119,655ARP:  48° 47\' 27" N, 16° 53\' 33" E. 3,5 km N Břeclav. ELEV:  525 ft / 160 m. Okruh:  1500 ft / 457 m AMSL. 3° E/2005. ! Při vzletech z RWY 08 a přistáních na RWY 26 křižuje dráhu letadla silnice s provozem řízeným. světelnou signalizací ovládanou dispečerem Poskytování informací známému provozu. Přesto není. zaručeno respektování světelné signalizace všemi účastníky silničního provozu. Proto komunikaci v. blízkosti prahu RWY 26 je nutné v rámci bezpečnosti přelétávat v minimální výšce 15 m od nejnižší části. letadla nebo vlečeného předmětu.05 NOV 20 (1) Břeclav VFR-AD-LKBA-ADC. VFR příručka - Česká republika. LKBA Břeclav. Břeclav RADIO. 119,655RWYMagnetický. směrRozměry. RWYÚnosnost TORA TODA ASDA LDA. 08 080° 740 x 60 5700 kg / 0.7 MPa 740 770 740 740. 26 260° 740 x 60 5700 kg / 0.7 MPa 740 77

In [44]:
import json
json_file_path = '../data/extracted_cz_vfr_manual/LKLT_info.json'

# Open the file and load the JSON data
with open(json_file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

##### 3. Embedding Creation

In [14]:
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), persist_directory="../data/vectordbtest/")
vfr_db = vectorstore.from_documents([doc], embed, persist_directory="../data/vectordbtest/")

##### 4. Context-based chatbot

In [155]:
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), persist_directory="../data/vectordb/")

In [168]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI


qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(search_kwargs={'k': 3}),
    return_source_documents=True
)

# we can now execute queries against our Q&A chain
result = qa_chain({'query': 'Otevírací doba LKHK'})
print(result['result'])

 

The opening hours of LKHK are 0800 - 1500 on Saturdays, Sundays, and holidays from 15 April to 15 October. However, please note that advance notice and approval from the airport operator is required for any arrivals or departures outside of these hours.
