### RAG Flow - Additionally Chunked

##### Initialize environment

In [146]:
import json

from elasticsearch import Elasticsearch
from openai import OpenAI
from sentence_transformers import SentenceTransformer

from helper_functions import encoding
from helper_functions import indexing
from helper_functions import prompting
from helper_functions import searching

In [147]:
with open("data/documents_chunked_3000.json") as f:
    documents = json.load(f)

Setup:
- Documents:
    - API response, chunked by table and 3000 character limit
- Embedding:
    - SentenceTransformer all-mpnet-base-v2
- Search:
    - Elasticsearch without boosting (i.e. base source without any changes)
    - Similarity measure: cosine similarity
- 2 LLMs:
    - Ollama moondream - open source LLM model that can be run on local machine's CPU
    - OpenAI's ChatGPT 3.5 Turbo via API

In [148]:
embedding_model = SentenceTransformer("all-mpnet-base-v2")

llm_client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
model_name = "moondream"

es_client = Elasticsearch("http://localhost:9200")
similarity = "cosine"
base_source = ["reporting_period", "filing_type", "section", "text", "id"]

In [149]:
documents = encoding.encode_documents(documents, embedding_model)

100%|██████████| 181/181 [00:59<00:00,  3.02it/s]


In [150]:
dims = len(documents[0]["text_vector"])

index_name = "sec-filing-index"

indexing.index_documents(
    documents=documents,
    es_client=es_client,
    index_name=index_name,
    dims=dims,
    similarity=similarity,
)

100%|██████████| 181/181 [00:02<00:00, 75.41it/s]


-----

##### Generate responses - Moondream

In [151]:
query = "What was the revenue for the quarter ending 2023-06-30?"

Plain text query

In [152]:
search_query_text = searching.elastic_search_text_query(
    query=query,
    company="pltr",
    base_source=base_source,
)

answer_text = prompting.rag(
    query=query,
    search_query=search_query_text,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model="moondream",
)

In [153]:
answer_text

''

Vectorized text data

In [154]:
search_query_text_vector = searching.elastic_search_knn_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_field="text_vector",
)

answer_text_vector = prompting.rag(
    query=query,
    search_query=search_query_text_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
)

In [155]:
answer_text_vector

'\n 1. Business operating costs for the year ending December 31, 2020 are $72.0 million less than a year ago compared to a year prior to the COVID crisis resulting in over two years of operational decline during this period of the year due to the pandemic which affected nearly all aspects of daily human lives including employment and general businesses activities that make up the economy contributing to its decline at the moment.\n 2. Other than cost savings, the biggest difference between the year 2020 and 2019 was an increase in sales of $564,895 during 2020 as opposed to a decrease due to the COVID crisis being the reason for it. It is evident that despite considerable restrictions on business activities imposed by public health experts over the past two years, sales did see an upward trend in 2020 with most businesses continuing or improving their operations and economic growth amid challenges posed by the pandemic, such as increased adoption of remote work arrangements which allow

Vectorized text data + vectorized attribute data

In [156]:
search_query_combined_vector = searching.elastic_search_combined_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_fields=["text_vector", "non_text_vector"],
)

answer_combined = prompting.rag(
    query=query,
    search_query=search_query_combined_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
)

In [157]:
answer_combined

'______________________________________________________________'

Thoughts:
- While still nothing close to the correct answer, text vector based semantic search now is returning something remotely relevant

----

##### Generate responses - ChatGPT 3.5

In [158]:
llm_client_chatgpt = OpenAI()
model_name_chatgpt = "gpt-3.5-turbo"

In [159]:
answer_text_gpt = prompting.rag(
    query=query,
    search_query=search_query_text,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name,
)

In [160]:
answer_text_gpt

'Based on the context, the revenue for the quarter ending 2023-06-30 was not provided.'

In [161]:
answer_text_vector_gpt = prompting.rag(
    query=query,
    search_query=search_query_text_vector,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name,
)

In [162]:
answer_text_vector_gpt

'The revenue for the quarter ending 2023-06-30 was $2,225,012 million.'

In [163]:
answer_combined_gpt = prompting.rag(
    query=query,
    search_query=search_query_combined_vector,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name,
)

In [164]:
answer_combined_gpt

'The revenue for the quarter ending 2023-06-30 was $2,225,012 in thousands.'

---

---

---

##### Generate responses - Moondream

Retrying the flow, but using a 500 character-limit chunked documents

In [85]:
with open("data/documents_chunked_500.json") as f:
    documents_500 = json.load(f)

Setup:
- Documents:
    - API response, chunked by table and 500 character limit
- Embedding:
    - SentenceTransformer all-mpnet-base-v2
- Search:
    - Elasticsearch without boosting (i.e. base source without any changes)
    - Similarity measure: cosine similarity
- 2 LLMs:
    - Ollama moondream - open source LLM model that can be run on local machine's CPU
    - OpenAI's ChatGPT 3.5 Turbo via API

In [86]:
embedding_model = SentenceTransformer("all-mpnet-base-v2")

llm_client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
model_name = "moondream"

es_client = Elasticsearch("http://localhost:9200")
similarity = "cosine"
base_source = ["reporting_period", "filing_type", "section", "text", "id"]

In [88]:
documents_500 = encoding.encode_documents(documents_500, embedding_model)

100%|██████████| 669/669 [03:43<00:00,  2.99it/s]


In [131]:
dims = len(documents_500[0]["text_vector"])

index_name_500 = "sec-filing-index-500"

indexing.index_documents(
    documents=documents_500,
    es_client=es_client,
    index_name=index_name_500,
    dims=dims,
    similarity=similarity,
)

100%|██████████| 669/669 [00:08<00:00, 77.53it/s]


-----

##### Generate responses

In [132]:
query = "What was the revenue for the quarter ending 2023-06-30?"

Plain text query

In [133]:
search_query_text = searching.elastic_search_text_query(
    query=query,
    company="pltr",
    base_source=base_source,
)

answer_text = prompting.rag(
    query=query,
    search_query=search_query_text,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model="moondream",
    index_name=index_name_500,
)

In [134]:
answer_text

''

Vectorized text data

In [135]:
search_query_text_vector = searching.elastic_search_knn_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_field="text_vector",
)

answer_text_vector = prompting.rag(
    query=query,
    search_query=search_query_text_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
    index_name=index_name_500,
)

In [136]:
answer_text_vector

'ery  5'

Vectorized text data + vectorized attribute data

In [137]:
search_query_combined_vector = searching.elastic_search_combined_query(
    embedding_model=embedding_model,
    query=query,
    filter_company="pltr",
    source=base_source,
    vector_fields=["text_vector", "non_text_vector"],
)

answer_combined = prompting.rag(
    query=query,
    search_query=search_query_combined_vector,
    search_client=es_client,
    llm_client=llm_client,
    retrieval_model=model_name,
    index_name=index_name_500,
)

In [138]:
answer_combined

'\nThere is more text to answer this question but I can tell you that the two main points in question are not only from different texts but they contain two numbers that say $831,047 and $2,599,540 which might be revenue or loss figure.'

----

##### Generate responses - ChatGPT 3.5

In [139]:
llm_client_chatgpt = OpenAI()
model_name_chatgpt = "gpt-3.5-turbo"

In [None]:
answer_text_gpt = prompting.rag(
    query=query,
    search_query=search_query_text,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name_500,
)

In [141]:
answer_text_gpt

'The revenue for the quarter ending 2023-06-30 is not explicitly mentioned in the provided context.'

In [142]:
answer_text_vector_gpt = prompting.rag(
    query=query,
    search_query=search_query_text_vector,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name_500,
)

In [143]:
answer_text_vector_gpt

'The revenue for the quarter ending 2023-06-30 was not directly provided in the context. The total revenue for the year ending 2023-12-31 was $2,225,012 million.'

In [144]:
answer_combined_gpt = prompting.rag(
    query=query,
    search_query=search_query_combined_vector,
    search_client=es_client,
    llm_client=llm_client_chatgpt,
    retrieval_model=model_name_chatgpt,
    index_name=index_name_500,
)

In [145]:
answer_combined_gpt

'The revenue for the quarter ending 2023-06-30 was not directly provided in the context.'

---

Note:
- The responses seem to indicate that narrowing the chunking window helps the RAG flow take a step towards the right direction

- Main issue still is that the search isn't returning any relevant information
    - Either, the financial data needs to be parsed in a more structured manner, such that search result will contain the information needed
    - Or, the LLM model needs to at least get a large contextual chunk that holds the information to parse with proper instructions
    
- Note, the $2,225,012 figure that is repeatedly brought up is an actual FY2023 revenue for Palantir (but is in thousands, not millions)