### RAG Flow - Baseline

##### Initialize environment

In [1]:
import json

from elasticsearch import Elasticsearch
from openai import OpenAI
from sentence_transformers import SentenceTransformer

from helper_functions import encoding
from helper_functions import indexing
from helper_functions import prompting
from helper_functions import searching

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
with open("data/documents_with_ids.json") as f:
    documents = json.load(f)

Setup:
- Documents:
    - API response, chunked by entire sections (e.g. financial statements, management discussions and analyses)
- Embedding:
    - SentenceTransformer all-mpnet-base-v2
- Search:
    - Elasticsearch without boosting (i.e. base source without any changes)
    - Similarity measure: cosine similarity
- LLM:
    - Ollama moondream - open source LLM model that can be run on local machine's CPU

In [9]:
embedding_model = SentenceTransformer("all-mpnet-base-v2")

llm_client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
model_name = "moondream"

es_client = Elasticsearch("http://localhost:9200")
similarity = "cosine"
base_source = ["reporting_period", "filing_type", "section", "text", "id"]

In [10]:
documents = encoding.encode_documents(documents, embedding_model)

100%|██████████| 45/45 [00:18<00:00,  2.44it/s]


In [11]:
dims = len(documents[0]["text_vector"])

index_name = "sec-filing-index"

indexing.index_documents(
    documents=documents,
    es_client=es_client,
    index_name=index_name,
    dims=dims,
    similarity=similarity,
)

100%|██████████| 45/45 [00:02<00:00, 21.68it/s]


-----

##### Generate responses

In [19]:
query = "What was the revenue for the quarter ending 2023-06-30"

Plain text query

In [20]:
search_query_text = searching.elastic_search_text_query(
    query=query,
    company="pltr",
    base_source=base_source,
)

answer_text = prompting.rag(
    query=query,
    search_query=search_query_text,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model="moondream",
)

In [21]:
answer_text

''

Vectorized text data

In [22]:
search_query_text_vector = searching.elastic_search_knn_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_field="text_vector",
)

answer_text_vector = prompting.rag(
    query=query,
    search_query=search_query_text_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
)

In [23]:
answer_text_vector

'\n Question 4.4 from the paragraph provided: General and Administrative expenses decreased by ... $72.0 million, or 12%.'

Vectorized text data + vectorized attribute data

In [24]:
search_query_combined_vector = searching.elastic_search_combined_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_fields=["text_vector", "non_text_vector"],
)

answer_combined = prompting.rag(
    query=query,
    search_query=search_query_combined_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
)

In [25]:
answer_combined

"\n\n1. Operating segments of a company being consolidated together into one reportable accounting segment based on their geographical region. The geographical regions are divided into three main divisions and then these three are combined into one single division under an analysis based on the specific attributes shared by all such areas within that division. The primary factors to be considered would include economic climate along with local government policies, regulatory framework enforced at the state level, and tax rates at the country's national level for a comprehensive outlook of where these segments would potentially overlap or diverge economically if they were to report in different regions separately under their respective regions' geographic divisions."

---

Thoughts:
- No real relevant responses to the query
- As a next step for optimizing, chunk the data further to have a more relevant contexts returned when searched