# ARK Disruptive Innovation Insights

### Introduction

#### Quick links:

[Go to Coding Section](#CODING-SECTION) - See how each of these approaches is implemented.

[Go to Example Output Section](#OUTPUT-ENHANCEMENTS) - See example responses from each pipeline side-by-side.

[Go to Evaluation Section](#EVALUATION-SECTION) - See the results of the analysis for each enhancement.

[Go to TIFIN Question Evaluation Section](#TIFIN-EVALUATION-QUESTION-PROCESSING) - See the final evaluation using the TIFIN questions as a validation set.

[Go to Future Efforts](#FUTURE-EFFORTS) - Thoughts on what could be done in future with more time and resources.

In the following Notebook, I will showcase some advanced approaches to RAG.  I will address all of the tasks mentioned in the TIFIN exercise, as well as other common features that provide a more robust system.

#### Key aspects of the baseline implementation:
The exercise requests that I start with a "baseline" RAG approach with a more "basic" approach to chunking/text splitting.  I will take this notion further, and provide a true baseline RAG approach that incorporates the basic text splitter, the most common vectorization approach (semantic), using the older OpenAI embedding API based on the text-embedding-ada-002 model, a widely available and sufficient vector database for demonstrations (ChromaDB), using the most common distance metric (cosine similarity), and an older LLM (ChatGPT 3.5 Turbo). Here are the requested strategy descriptions:

- <b>Ingestion (also commonly called indexing)</b>: We will use PyPDF2's pdfreader and LangChain's Document class to contain the content in a way that improves the efficiency of access and manipulating the data in the RAG pipeline. This data will be passed to the most basic text splitter implementation in LangChain, the Character Text Splitter, where it will be chunked prior to vectorization and continued to be stored in the Documents object from LangChain using the split_documents function from the CharacterTextSplitter class. Document's will be vectorized by OpenAI's the text-embedding-ada-002 embedding service and stored in a Chroma vector database for later retrieval.

- <b>Information retrieval</b>: Using LangChain's vectorstores object, which wraps our Chroma vector database, we initialize a retriever object using the as_retriever function, which can be used to retrieve documents contained within the Chroma database using the default cosine similarity based vector search mechanism.  Much of the retrieval process is abstracted by the LangChain vectorstore and retriever classes, but under the hood, it will be using the Open AI embedding function we pass it to vectorize the query, perform the vector search that using cosine similarity as it's default distance metric, and returns the relevant data based on the vectors that were matched.

- <b>Answer synthesis (also commonly called generation)</b>: In this stage, we build a prompt template and a small helper function for formatting documents as they pass through our LangChain cain.  We then use LCEL to define our LangChain chain, where we first establish a mini-chain to retrieve and format the documents and use that context to "hydrate" the prompt template with the relevant retrieved data. At this point, we have our prompt complete with dynamic context that we can pass to the LLM for answer generation.  A last step is using a StrOutputParser class to extract only the information from the LLM response that we need to complete the chain.

- <b>Performance evaluation</b>: Evaluation is performed using the ragas python library that is focused specifically on evaluating RAG pipeline performance. We will review this evaluation process in much more depth as we step through it several times across the numerous enhancements we will be evaluating.  We will synthesize our own question/answer pairs to represent our "testing" data for the enhancements we make.  We will then use the evalution questions provided by TIFIN as the final "validation" set to determine the final performance value of our enhancements.

- <b>Bonus security feature</b>: I am assuming this is a consumer facing application, so I added a security feature common for this type of application. There are many other security features that are typically used on the front end to prevent things like sql injection, which front-end developers typically have strong defense against.  But in generative AI, a new line of attacks has emerged that focus directly on the LLM that is used in the RAG pipeline.  This security feature targets a type of attack called 'prompt injection' or 'jailbreaking', and should significantly reduce the exposure to this type of attack.  All RAG pipeline related functions in this demonstration include this security feature.  Of course, further improvements can be made to protect from other types of generative AI security threats and the battle against bad actors is a never-ending process.

#### Summary of Baseline RAG Pipeline components:
- Naive RAG (basic implementation of RAG pipeline)
- Search mechanism for retriever: dense vector/semantic search
- chunking strategy: CharacterTextSplitter (basic text splitter)
- embedding model: OpenAI text-embedding-ada-002
- LLM: ChatGPT-3.5
- Distance metric for vector search: cosine similarity
- reranking strategy: none

See notes through out the code explaining the thoughts and considerations for each step in this implementation.

#### Enhancement Evaluations

Quick reference of enhancements to evaluation:

1. <b>Advanced "chunking" technique</b> - Introduce the Recursive Character Text Splitter 
2. <b>Hybrid</b> - Combines the existing semantic search with keyword search and uses Reciprocal Rank Fusion reranking to rerank the results. Vector split: 50/50 dense/sparse.
3. <b>GPT-4o</b> Upgraded LLM (GPT-4o)</b> Vector split: 50/50 dense/sparse.
4. <b>Hybrid 30/70</b> Same settings as the Hybrid approach (#2), but with the weights adjusted to put more focus on the keyword/sparce vectors. Vector split: 30/70 dense/sparse.
5. <b>Embedding Open Source Alt</b> - test against OpenAI's advanced embedding model, text-embedding-3-large. 
6. <b>Query expansion</b> - This introduces a common approach to try to improve the coherence, faithfulness, and relevance of answers.
7. <b>Ultimate combination</b> - This combines the other enhancements that were determined to be improvements in isolation, to see how they perform together.
8. <b>Final Evaluation on TIFIN Questions</b> - This evaluation generates a new baseline evaluation based on the TIFIN evaluation question using just the reference-free evaluation metrics (which do not require ground truth).  And then we will run an evaluation of the TIFIN questions using the "ultimate combo" configuration of enhancements, compare this to the baseline results, and analyze the compared results. This provides a final validation analysis of how our enhancements performed on answering the TIFIN evaluation questions.

[Go to Coding Section](#CODING-SECTION) - See how each of these approaches is implemented.

[Go to Example Output Section](#OUTPUT-ENHANCEMENTS) - See example responses from each pipeline side-by-side.

[Go to Evaluation Section](#EVALUATION-SECTION) - See the results of the analysis for each enhancement.

[Go to Future Efforts](#FUTURE-EFFORTS) - Thoughts on what could be done in future with more time and resources.

# CODING SECTION

[Back to top](#ARK-Disruptive-Innovation-Insights)

### IMPORTS, HELPER FUNCTIONS

In [None]:
# First uninstall all related packages to start fresh
%pip uninstall -y langchain-core langchain-openai langchain-experimental langchain-community langchain chromadb ragas sentence-transformers streamlit pillow langchain-text-splitters typing-extensions

# Install core dependencies first
%pip install torch>=1.11.0
%pip install transformers>=4.38.0,<5.0.0
%pip install networkx>=2.8
%pip install pillow>=7.1.0,<11.0.0

# First install the langchain ecosystem with specific versions
%pip install --no-deps langchain-core==0.3.6
%pip install --no-deps langchain-community==0.3.1
%pip install --no-deps langchain-openai==0.2.1
%pip install --no-deps langchain==0.3.1
%pip install --no-deps langchain-experimental==0.3.2

# Now install their dependencies
%pip install langchain-core==0.3.6
%pip install langchain-community==0.3.1
%pip install langchain-openai==0.2.1
%pip install langchain==0.3.1
%pip install langchain-experimental==0.3.2

# Now install remaining packages
%pip install chromadb==0.5.11
%pip install python-dotenv==1.0.1
%pip install PyPDF2==3.0.1
%pip install rank_bm25==0.2.2
%pip install tqdm==4.66.5
%pip install matplotlib==3.9.2
%pip install openai==1.52.2
%pip install sentence-transformers==3.1.1
%pip install scikit-image==0.23.2
%pip install streamlit==1.33.0
%pip install typing-extensions==4.12.2
%pip install nltk
%pip install pypdf
%pip install pypdfium2
%pip install timm
%pip install langchainhub

# Install ragas last since it has conflicting dependencies
%pip install --no-deps ragas==0.1.20

# Verify versions
%pip list | grep -E "langchain|core|community|experimental|openai|ragas"

In [1]:
import subprocess
import sys
import os
import openai
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnableParallel
from dotenv import load_dotenv, find_dotenv
from langchain_core.prompts import PromptTemplate
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
import tqdm as notebook_tqdm
import pandas as pd
import matplotlib.pyplot as plt
from datasets import Dataset
from ragas import evaluate
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity
)

  from .autonotebook import tqdm as notebook_tqdm



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._answer_correctness import AnswerCorrectness, answer_correctness

For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._context_entities_recall import (


In [40]:
# VARIABLES
# ***WARNING*** -> you will need a env.txt file in the same directory as this Notebook to run this code!!!
load_dotenv(dotenv_path='env.txt') 
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
openai.api_key = os.environ['OPENAI_API_KEY']
embedding_function = OpenAIEmbeddings()
pdf_path = "InvestmentCaseForDisruptiveInnovation.pdf"
str_output_parser = StrOutputParser()
example_user_query = "What is the core objective of investing in disruptive innovation according to ARK?"

# LLMs/Embeddings
embedding_ada2 = "text-embedding-ada-002"
embedding_large = "text-embedding-3-large"
model_gpt35="gpt-3.5-turbo"
model_gpt4="gpt-4o-mini"

embedding_function_ada2 = OpenAIEmbeddings(model=embedding_ada2, openai_api_key=openai.api_key)
embedding_function_large = OpenAIEmbeddings(model=embedding_large, openai_api_key=openai.api_key)
baseline_llm = ChatOpenAI(model=model_gpt35, openai_api_key=openai.api_key, temperature=0.0)
generator_llm = ChatOpenAI(model=model_gpt35, openai_api_key=openai.api_key, temperature=0.0)
critic_llm = ChatOpenAI(model=model_gpt4, openai_api_key=openai.api_key, temperature=0.0)

# Vector Database initialization
chroma_client = chromadb.Client()
baseline_collection_name = "ark_disruptive_innovation_insights_baseline"
recursive_collection_name = "ark_disruptive_innovation_insights_recursive"
hybrid_collection_name = "ark_disruptive_innovation_insights_hybrid"
embedding_collection_name = "ark_disruptive_innovation_insights_embedding"
ult_collection_name = "ark_disruptive_innovation_insights_ult"

# various variables for pipelines:
chunk_size=2000
chunk_overlap=1000

# Approach Names
baseline_name = "BASELINE"
recursive_name = "RECURSIVE"
hybrid_name = "HYBRID"
gpt4o_name = "GPT4o"
hybrid_3070_name = "HYBRID 3070"
newembed_name = "NEW EMBEDDINGS"
queryexpand_name = "QUERY EXPANSION"
ult_name = "ULTIMATE COMBO"

In [12]:
# Import SentenceTransformerEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
embedding_function_st = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cpu'}
)

In [3]:
### DATA PREPARATION
docs = []
with open(pdf_path, "rb") as pdf_file:
    pdf_reader = PdfReader(pdf_file)
    pdf_text = "".join(page.extract_text() for page in pdf_reader.pages)
    docs = [Document(page_content=page) for page in pdf_text.split("\n\n")]

def run_inference(function, query, approach_name):
    result = function.invoke(query)
    retrieved_docs = result['context']
    answer = result['answer']['final_answer']
    print(f"Example Question used for {approach_name}: {example_user_query}\n")
    print(f"Relevance Score: {result['answer']['relevance_score']}\n")
    print(f"Final Answer:\n{answer}\n\n")
    print("Retrieved Documents:")
    for i, doc in enumerate(retrieved_docs, start=1):
        print(f"Document {i}: Document ID: {doc.metadata['id']} source: {doc.metadata['source']}")
        print(f"Content:\n{doc.page_content}\n")
    return answer

### ENHANCEMENT #0 - BASELINE RAG PIPELINE 

#### About Text Splitters

Text Splitters split a document into chunks that can be used for retrieval. Larger documents pose a threat to many parts of our RAG application and the splitter is our first line of defense. If you were able to vectorize a very large document, the larger the document, the more context representation you will lose in the vector embedding. But this assumes you can even vectorize a very large document, which you often can’t!  Most embedding models have relatively small limits on the size of document we can pass to it compared to the large documents many of us work with. For example, the context length for the OpenAI model we are using to generate our embeddings is 8191 tokens. If we try to pass a document larger than that to the model, it will generate an error. These are the main reasons splitters exist, but these are not the only complexities introduced with this step in the process.

The key element of text splitters for us to consider is how they split the text.  Let’s say you have 100 paragraphs that you want to split up. In some cases, there may be two or three that are semantically meant to be together, like the paragraphs in this one section. In some cases, you may have a section title, or a URL, or some other type of text. Ideally, you want to keep the semantically related pieces of text together, but this can be much more complex than it first seems!  For a real world example of this, go to this website and copy in a large set of text:

https://chunkviz.up.railway.app/

Chunkviz is a utility created by Greg Kamradt that helps you visualize how your text splitter is working.  Change the parameters for the splitters to use what we are using, chunk size of 1000 and chunk overlap of 200.  Try the character splitter compared to the recursive character text splitter.  

As you increase the chunk size, it stays on the paragraph splits well, but eventually gets more and more paragraphs per chunk.  Note though, that this is going to be different for different text.  If you have text with very long paragraphs, you will need a larger chunk setting to capture whole paragraphs.  Meanwhile, if you try the character text splitter, it will cut off in the middle of a sentence on any setting.

This split of a sentence could have a significant impact on the ability of your chunks to capture all of the important semantic meanings of the text within them.  You can offset this by changing the chunk overlap, but you still have partial paragraphs, which will equate to noise to your LLM, distracting it away from providing the optimal response.

In this analysis, we start with using the Character Text Splitter.

<b>Character text splitter</b>

This is the simplest approach to splitting your document.  A text splitter enables you to divide your text into arbitrary N-character-sized chunks.  You can improve this slightly by adding a separator parameter, such as “\n”.  But this is a great place to start to understand how chunking works, and then we can move on to an improved approach.



In [4]:
# Character Text Splitter
character_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    is_separator_regex=False,
)

character_splits = character_splitter.split_documents(docs)

Created a chunk of size 4301, which is longer than the specified 2000


In [5]:
# # Initialize the ChromaDB client
# persist_directory = os.path.join(os.getcwd(), "chroma_db")
# chroma_client = chromadb.PersistentClient(path=persist_directory)

# # Document prep using Character Text Splitter for Naive RAG
# baseline_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "dense"}) for i, doc in enumerate(character_splits)]

# # Create or load baseline vectors
# baseline_vectorstore = Chroma.from_documents(
#     documents=baseline_documents,
#     embedding=embedding_function_ada2,
#     collection_name=baseline_collection_name,
#     client=chroma_client,
#     persist_directory=persist_directory
# )

# # Set up the retriever using Chroma vector db
# baseline_dense_retriever = baseline_vectorstore.as_retriever(search_kwargs={"k": 5})

# baseline_vectorstore.persist()

# # NOTE: Some chunks were larger than settings

# print("Vector store setup complete!")

# # Optional: Display collection info
# collection = chroma_client.get_collection(baseline_collection_name)
# print(f"\nCollection Info:")
# print(f"Name: {baseline_collection_name}")
# print(f"Number of documents: {collection.count()}")

In [6]:
# Initialize the ChromaDB client
persist_directory = os.path.join(os.getcwd(), "chroma_db")
chroma_client = chromadb.PersistentClient(path=persist_directory)

# Delete existing collection if it exists
try:
    chroma_client.delete_collection(baseline_collection_name)
    print(f"Deleted existing collection: {baseline_collection_name}")
except:
    print("No existing collection to delete")

# Document prep using Character Text Splitter for Naive RAG
baseline_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "dense"}) 
                     for i, doc in enumerate(character_splits)]

# Create new vectors with SentenceTransformer embeddings
baseline_vectorstore = Chroma.from_documents(
    documents=baseline_documents,
    embedding=embedding_function_st, # embedding_function_ada2,
    collection_name=baseline_collection_name,
    client=chroma_client,
    persist_directory=persist_directory
)

# Set up the retriever using Chroma vector db
baseline_dense_retriever = baseline_vectorstore.as_retriever(search_kwargs={"k": 5})

# Persist changes
baseline_vectorstore.persist()

print("Vector store setup complete!")

# Optional: Display collection info
collection = chroma_client.get_collection(baseline_collection_name)
print(f"\nCollection Info:")
print(f"Name: {baseline_collection_name}")
print(f"Number of documents: {collection.count()}")

  embedding_function_st = HuggingFaceEmbeddings(
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed


Deleted existing collection: ark_disruptive_innovation_insights_baseline


/pytorch/third_party/ideep/mkl-dnn/src/cpu/aarch64/xbyak_aarch64/src/util_impl_linux.h, 451: Can't read MIDR_EL1 sysfs entry


Vector store setup complete!

Collection Info:
Name: ark_disruptive_innovation_insights_baseline
Number of documents: 23


  baseline_vectorstore.persist()


In [7]:
#### RETRIEVAL and GENERATION ####

# Prompt
prompt = PromptTemplate.from_template(
    """
    You are a financial expert assisting others in 
    understanding the disruptive innovation insights
    that are provided by ARK. Use the following pieces 
    of retrieved context with information about these innovation
    insights to answer the question. 
    
    If you don't know the answer, just say that you don't know.
    
    Question: {question} 
    Context: {context} 
    
    Answer:
    """
)

# SECURITY MEASURE - 
# Checks the relevance of the prompt compared to the retrieved context to prevent certain types of LLM hacks.
relevance_prompt_template = PromptTemplate.from_template(
    """
    Given the following question and retrieved-context, determine if the context is relevant to the question.
    Provide a score from 1 to 5, where 1 is not at all relevant and 5 is highly relevant.
    Return ONLY the numeric score, without any additional text or explanation.

    Question: {question}
    Retrieved Context: {retrieved_context}

    Relevance Score:"""
)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def extract_score(llm_output):
    try:
        score = float(llm_output.strip())
        return score
    except ValueError:
        return 0

# Chain it all together with LangChain
def conditional_answer(x):
    relevance_score = extract_score(x['relevance_score'])
    if relevance_score < 4:
        return "I don't know."
    else:
        return x['answer']

baseline_rag_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | RunnableParallel(
        {"relevance_score": (
            RunnablePassthrough()
            | (lambda x: relevance_prompt_template.format(question=x['question'], retrieved_context=x['context']))
            | baseline_llm
            | str_output_parser
        ), "answer": (
            RunnablePassthrough()
            | prompt
            | baseline_llm
            | str_output_parser
        )}
    )
    | RunnablePassthrough().assign(final_answer=conditional_answer)
)

baseline_final_chain = RunnableParallel(
    {"context": baseline_dense_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

In [9]:
baseline_answer = run_inference(baseline_final_chain, example_user_query, baseline_name)

Example Question used for BASELINE: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional asset classes, and potentially achieve a moderate-to-high risk-reward profile that complements traditional investment strategies.


Retrieved Documents:
Document 1: Document ID: 0 source: dense
Content:
•
1Why Invest In Disruptive Innovation?
Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, sell, or hold any particular security. Past performance is not indicative of future results.As of June 30, 2024•Ris

<b>Explanation of this output</b>

The run_inference function will print out the following relevant data for this example query:

- <b>example question</b>: indicates the query that was used to generate this response
- <b>Relevance score</b>: this is part of the security feature that prevents irrelevant prompts that have nefarious goals
- <b>Final answer</b>: This is the final response from the RAG pipeline
- <b>Retrieved Documents</b>: This is a list of all of the context used to answer the question, as well as important metadata that includes the document rank, document ID (for sanity checks), the vector search source (vector or sparse), and the related content to that data.

We will use this format to provide an example for each of the enhancements, giving you a chance to become more familiar with how these pipelines are generating different responses.

### ENHANCEMENT CODE #1: USING RECURSIVE CHARACTER TEXT SPLITTER

Here we set up the code for our advanced chunking technique, the Recursive Character Text Splitter. This is the splitter LangChain recommends to use when splitting generic text.

As the name states, this splitter recursively splits text, with the intention of keeping related pieces of text next to each other.  You can pass a list of characters as a parameter and it will try to split those characters in order until the chunks are small enough.  The default  list is ["\n\n", "\n", " ", ""], which works well, but we are going to add “. “ to this list as well.  This has the effect of trying to keep together all paragraphs, sentences defined by both “\n” and “. “, and words as long as possible.

Under the hood with this splitter, the chunks are split based on the “\n\n” separator, representing paragraph splits.  But it doesn’t stop there, it will look at the chunk size, and if that is larger than the 1000 we set, then it will split by the next separator (“\n”), and so on.


In [10]:
# Adding the recursive character text splitter (replaces the character text splitter)
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

recursive_splits = recursive_splitter.split_documents(docs)

In [13]:
recursive_persist_directory = os.path.join(os.getcwd(), "recursive_chroma_db")
recursive_chroma_client = chromadb.PersistentClient(path=recursive_persist_directory)

# Document prep using Recursive Character Text Splitter
recursive_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "dense"}) for i, doc in enumerate(recursive_splits)]

# baseline vectors
recursive_vectorstore = Chroma.from_documents(
    documents=recursive_documents,
    embedding=embedding_function_st, # embedding_function_ada2,
    collection_name=recursive_collection_name,
    client=recursive_chroma_client,
    persist_directory=recursive_persist_directory
)

# Set up the retriever using Chroma vector db
recursive_dense_retriever = recursive_vectorstore.as_retriever(search_kwargs={"k": 5})

#### RETRIEVAL and GENERATION ####
recursive_final_chain = RunnableParallel(
    {"context": recursive_dense_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

In [14]:
recursive_answer = run_inference(recursive_final_chain, example_user_query, recursive_name)

Example Question used for RECURSIVE: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to core asset classes, and complement traditional strategies with a moderate-to-high risk-reward profile.


Retrieved Documents:
Document 1: Document ID: 0 source: dense
Content:
•
1Why Invest In Disruptive Innovation?
Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, sell, or hold any particular security. Past performance is not indicative of future results.As of June 30, 2024•Risks of Investing in InnovationPlease no

### ENHANCEMENT CODE #2: USING HYBRID SEARCH

This enhancement adds a keyword/sparse vector search to the retrieval part of the RAG pipeline, turning this into a hybrid search. It retrieves the top k results (5 in this case) from the dense and sparse searches and then applies the Reciprocal Rank Fusion (RRF) algorithm to those results.  

<b>Sparse Search</b>

Sparse search allows you to utilize keyword matching across all of your content. In this enhancement, we use the Best Matching 25 (BM25) algorithm. This very popular model performs really well when it comes to searching across many keywords. The idea behind BM25 is that it counts the number of words within the phrase that you are passing in and then those that appear more than often are weighted as less important when the match occurs. Words that are rare if we match on that the score is a lot higher. If you are familiar with TF-IDF, this is based on that concept.

<b>Reciprocal Rank Fusion (RRF)</b>

One interesting aspect of combining the results from the two retrievers, begging the question, how does it rank results from two relatively different search mechanisms? Our dense vector search uses cosine similarity and provides a similarity score. Our sparse vector is based on TF-IDF. These are not comparable scores. As it turns out there are numerous algorithms we can use to perform the ranking among these two retrievers. The one we will use is called the Reciprocal Rank Fusion (RRF) algorithm because it is built into the ensemble retriever we are using and it handles this discrepancy in scores well.  In short, the RRF focuses on the rank of the content within each of the search results (dense and sparse) and uses those ranks as the primary well to calculate the reranking results.  If it is known what set of search results is more important, RRF also handles weights to emphasize one set of search results over the other.

<b>Why add a keyword search?</b>

In certain domains, including finance, there are often words with little semantic meaning, but with importance in the text for that domain.  For example, domain-specific codes, serial numbers, IDs, and even people and company names. In these cases, a keyword search can improve overall results.  But we will test that out for this specific dataset!fic dataset!n 

In [15]:
hybrid_persist_directory = os.path.join(os.getcwd(), "hybrid_chroma_db")
hybrid_chroma_client = chromadb.PersistentClient(path=hybrid_persist_directory)

In [16]:
# prep docs for hybrid search, with metadata
hybrid_dense_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "dense"}) for i, doc in enumerate(character_splits)]
hybrid_sparse_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "sparse"}) for i, doc in enumerate(character_splits)]

In [17]:
# enhanced vectors
hybrid_dense_vectorstore = Chroma.from_documents(
    documents=hybrid_dense_documents,
    embedding=embedding_function_st, # embedding_function_ada2,
    collection_name=hybrid_collection_name,
    client=hybrid_chroma_client,
    persist_directory=hybrid_persist_directory
)

In [18]:
hybrid_dense_retriever = hybrid_dense_vectorstore.as_retriever(search_kwargs={"k": 5})
hybrid_sparse_retriever = BM25Retriever.from_documents(hybrid_sparse_documents, k=5)
hybrid_group = [hybrid_dense_retriever, hybrid_sparse_retriever]
hybrid_ensemble_retriever = EnsembleRetriever(retrievers=hybrid_group, weights=[0.5, 0.5], c=0)

In [19]:
hybrid_final_chain = RunnableParallel(
    {"context": hybrid_ensemble_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

In [20]:
hybrid_answer = run_inference(hybrid_final_chain, example_user_query, hybrid_name)

Example Question used for HYBRID: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth. It also offers portfolio diversification for investors looking to diversify their existing portfolio with strategies that have low correlation to traditional asset classes and provide a moderate-to-high risk-reward profile. Additionally, investing in disruptive innovation allows investors to focus on secular changes and complement traditional strategies with the potential for long-term growth.


Retrieved Documents:
Document 1: Document ID: 16 source: dense
Content:
Note: Numbers are rounded. ARK Investment Management LLC, 2024. This ARK analysis is based on a range of underlying data from external sources as of December 7

### ENHANCEMENT CODE #3: USING GPT-4o-Mini 

This time we upgrade our LLM from GPT-3.5 Turbo to GPT-4o-Mini.  This is supposed to be a newer and more capable model, but we will put that to the test in our evaluation.

In [21]:
gpt4o_rag_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | RunnableParallel(
        {"relevance_score": (
            RunnablePassthrough()
            | (lambda x: relevance_prompt_template.format(question=x['question'], retrieved_context=x['context']))
            | critic_llm
            | str_output_parser
        ), "answer": (
            RunnablePassthrough()
            | prompt
            | critic_llm
            | str_output_parser
        )}
    )
    | RunnablePassthrough().assign(final_answer=conditional_answer)
)

gpt4o_final_chain = RunnableParallel(
    {"context": baseline_dense_retriever,
     "question": RunnablePassthrough()
}).assign(answer=gpt4o_rag_chain)

In [22]:
gpt4o_answer = run_inference(gpt4o_final_chain, example_user_query, gpt4o_name)

Example Question used for GPT4o: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access growth opportunities by investing in companies at the forefront of technology-enabled innovation, which are positioned in some of the most promising areas of the economy with potential for long-term growth. Additionally, ARK aims to provide portfolio diversification and capture long-term growth with a moderate-to-high risk-reward profile, making it suitable for investors who are willing to stay invested for the medium-to-long term.


Retrieved Documents:
Document 1: Document ID: 0 source: dense
Content:
•
1Why Invest In Disruptive Innovation?
Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, 

### ENHANCEMENT CODE #4: USING HYBRID SEARCH WITH 30% / 70% split (dense / sparse)

This will be a similar pipeline to the previous hybrid pipeline, but we will also try out a scenario where we put more emphasis on the keyword search within the hybrid search.  In this case, we are weighting the results from the two searches with 30% weight on the dense vector search, and 70% on the sparse vector search.  This should push keyword search results closer to the top of our search results during the reranking process and give us a better idea if keyword search is important to the quality of our retrieval (and the responses they are based on). 

In [23]:
hybrid_3070split_ensemble_retriever = EnsembleRetriever(retrievers=hybrid_group, weights=[0.3, 0.7], c=0)

In [24]:
hybrid_3070split_final_chain = RunnableParallel(
    {"context": hybrid_3070split_ensemble_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

In [25]:
hybrid_3070_answer = run_inference(hybrid_3070split_final_chain, example_user_query, hybrid_3070_name)

Example Question used for HYBRID 3070: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional assets, and potentially achieve a moderate-to-high risk-reward profile by focusing on secular changes and disruptive innovation.


Retrieved Documents:
Document 1: Document ID: 16 source: dense
Content:
Note: Numbers are rounded. ARK Investment Management LLC, 2024. This ARK analysis is based on a range of underlying data from external sources as of December 7, 2023, which may be provided upon request. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to 

### ENHANCEMENT CODE #5: IMPROVED DENSE EMBEDDING MODEL

In this enhancement, we test if we can benefit from an upgrade to our OpenAI Embedding model, using their latest model, which has scored higher than the one we are currently using on the MTEB retrieval benchmarks. 

In [26]:
newembed_vectorstore = Chroma.from_documents(
    documents=hybrid_dense_documents,
    embedding=embedding_function_large, # changed from embedding_function_ada2
    collection_name=embedding_collection_name,
    client=chroma_client
)

newembed_retriever = newembed_vectorstore.as_retriever(search_kwargs={"k": 5})

newembed_final_chain = RunnableParallel(
    {"context": newembed_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

In [27]:
newembed_answer = run_inference(newembed_final_chain, example_user_query, newembed_name)

Example Question used for NEW EMBEDDINGS: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional asset classes, and potentially achieve a moderate-to-high risk-reward profile that complements traditional investment strategies.


Retrieved Documents:
Document 1: Document ID: 0 source: dense
Content:
•
1Why Invest In Disruptive Innovation?
Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, sell, or hold any particular security. Past performance is not indicative of future results.As of June 30, 20

### ENHANCEMENT CODE #6: Query expansion

One of the key tasks in this assignment is to "improve coherence, faithfulness, and relevance of answers."

This enhancement is an advanced prompt engineering technique focused on doing just that. Many techniques for enhancing RAG focus on improving one area, like retrieval or generation, but query expansion has the potential to improve both.  With query expansion, we use an LLM to first "expand" on our prompt by coming up with it's own idea of how this question should be answered (no context).  We then feed both the original question and the expanded response to the our original pipeline and use our evaluation to determine if this improved the overall responses to the questions asked. From an LLM standpoint, these types of changes can help broaden the search scope without losing focus on the original intent.

This approach has been known to improve the retrieval model's understanding, as you add more context to the user query that is used for retrieval, increasing the chances of fetching relevant documents. With an improved retrieval, you are already helping to improve the generation, giving it better context to work with, but this approach also has the potential to produce a more effective query, which in turn also helps the LLM deliver an improved response.  

In [28]:
# CODE
def augment_query_generated(user_query):
    system_message_prompt = SystemMessagePromptTemplate.from_template(
        "You are a financial expert assisting others in understanding the disruptive innovation insights that are provided by ARK. Provide an example answer to the given question, that might be found in a document published by ARK."
    )
    
    human_message_prompt = HumanMessagePromptTemplate.from_template("{query}")
    
    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
    
    response = chat_prompt.format_prompt(query=user_query).to_messages()
    
    result = critic_llm(response)
    content = result.content
    
    return content

class QueryExpander:
    def __init__(self):
        pass

    def invoke(self, query):
        hypothetical_answer = augment_query_generated(query)
        joint_query = f"{query} {hypothetical_answer}"
        print(joint_query)
        result = baseline_final_chain.invoke(joint_query)
        return result

# Create an instance of QueryExpander
query_expand = QueryExpander()

In [29]:
queryexpand_answer = run_inference(query_expand, example_user_query, queryexpand_name)

  result = critic_llm(response)


What is the core objective of investing in disruptive innovation according to ARK? According to ARK, the core objective of investing in disruptive innovation is to identify and capitalize on transformative technologies that have the potential to significantly alter industries and create new markets. ARK believes that these innovations can lead to exponential growth opportunities, driven by advancements in areas such as artificial intelligence, genomics, robotics, and blockchain technology.

By focusing on companies that are at the forefront of these disruptive trends, ARK aims to generate superior long-term returns for investors. The firm emphasizes the importance of understanding the underlying technologies and their potential impact on society and the economy, as well as the ability to adapt to rapidly changing market dynamics. Ultimately, ARK seeks to invest in a future where innovation drives progress and creates value across various sectors.
Example Question used for QUERY EXPANSI

Note: In addition to the output you have seen with previous enhancements, we also show the "expanded prompt" generated from the first stage of the query expansion. 

### ENHANCEMENT CODE #7: ULTIMATE COMBO

In this iteration, we take all of the results from below and combine the enhancements with what seems to be the best overall approach.  This allows us to conduct further analysis to the results from our enhancement but also allows us to see how they work together.  

Here is the final list of enhancements applied to this pipeline:

- Text Splitter - Recursive
- LLM - 3.5 Turbo
- Search - Hybrid 50/50
- query expansion - No
- Embedding - large

In [30]:
# use recursive_splits
# use baseline_rag_chain (uses 3.5)
# use these (with recursive splits, as the previous hybrid version was using the baseline splits):
ult_dense_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "dense"}) for i, doc in enumerate(recursive_splits)]
ult_sparse_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "source": "sparse"}) for i, doc in enumerate(recursive_splits)]

# use new vector db for the  new ult_dense_docs and ugraded embeddings
ult_dense_vectorstore = Chroma.from_documents(
    documents=ult_dense_documents,
    embedding=embedding_function_large,
    collection_name=ult_collection_name,
    client=chroma_client
)

# this will use the 50/50 weighted hybrid search
ult_dense_retriever = ult_dense_vectorstore.as_retriever(search_kwargs={"k": 5})
ult_sparse_retriever = BM25Retriever.from_documents(ult_sparse_documents, k=5)
ult_group = [ult_dense_retriever, ult_sparse_retriever]
ult_ensemble_retriever = EnsembleRetriever(retrievers=ult_group, weights=[0.5, 0.5], c=0)

ult_final_chain = RunnableParallel(
    {"context": ult_ensemble_retriever,
     "question": RunnablePassthrough()
}).assign(answer=baseline_rag_chain)

ult_answer = run_inference(ult_final_chain, example_user_query, ult_name)

Example Question used for ULTIMATE COMBO: What is the core objective of investing in disruptive innovation according to ARK?

Relevance Score: 5

Final Answer:
The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation, in promising areas of the economy, with potential for long-term growth.


Retrieved Documents:
Document 1: Document ID: 19 source: dense
Content:
Note: Numbers are rounded. ARK Investment Management LLC, 2024. This ARK analysis is based on a range of underlying data from external sources as of December 7, 2023, which may be provided upon request. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, sell, or hold any particular security. Past performance is not indicative of future results.$24$12$0.35With Line of SightWith Visual ObserverAutonomousDrone Delivery Price(10-m

# OUTPUT ENHANCEMENTS

I have consolidated the responses from the various enhancements here for easy scanning of the differences.

In [31]:
print(f"EXAMPLE QUERY USED TO GENERATE RESPONSES BELOW:\n\n{example_user_query}\n")

EXAMPLE QUERY USED TO GENERATE RESPONSES BELOW:

What is the core objective of investing in disruptive innovation according to ARK?



#### ENHANCEMENT #0: BASELINE/NAIVE RAG

In [32]:
print(baseline_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional asset classes, and potentially achieve a moderate-to-high risk-reward profile that complements traditional investment strategies.


#### ENHANCEMENT #1: ADVANCED "CHUNKING" TECHNIQUE - RECURSIVE SPLITTER RAG

In [33]:
print(recursive_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to core asset classes, and complement traditional strategies with a moderate-to-high risk-reward profile.


#### ENHANCEMENT #2: HYBRID BASE

In [34]:
print(hybrid_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth. It also offers portfolio diversification for investors looking to diversify their existing portfolio with strategies that have low correlation to traditional asset classes and provide a moderate-to-high risk-reward profile. Additionally, investing in disruptive innovation allows investors to focus on secular changes and complement traditional strategies with the potential for long-term growth.


#### ENHANCEMENT #3: GPT-4o

In [35]:
print(gpt4o_answer)

The core objective of investing in disruptive innovation according to ARK is to access growth opportunities by investing in companies at the forefront of technology-enabled innovation, which are positioned in some of the most promising areas of the economy with potential for long-term growth. Additionally, ARK aims to provide portfolio diversification and capture long-term growth with a moderate-to-high risk-reward profile, making it suitable for investors who are willing to stay invested for the medium-to-long term.


#### ENHANCEMENT #4: HYBRID 30/70

In [36]:
print(hybrid_3070_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional assets, and potentially achieve a moderate-to-high risk-reward profile by focusing on secular changes and disruptive innovation.


#### ENHANCEMENT #5: Embedding Open Source Alt

In [37]:
print(newembed_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation in promising areas of the economy with potential for long-term growth, diversify portfolios with low correlation to traditional asset classes, and potentially achieve a moderate-to-high risk-reward profile that complements traditional investment strategies.


#### ENHANCEMENT #6: Query Expansion

In [38]:
print(queryexpand_answer)

The core objective of investing in disruptive innovation according to ARK is to identify and capitalize on transformative technologies that have the potential to significantly alter industries and create new markets. ARK believes that these innovations can lead to exponential growth opportunities driven by advancements in areas such as artificial intelligence, genomics, robotics, and blockchain technology. By focusing on companies at the forefront of these disruptive trends, ARK aims to generate superior long-term returns for investors.


#### ENHANCEMENT #7: ULTIMATE ENHANCEMENT COMBINATION (BEST OF):

In [39]:
print(ult_answer)

The core objective of investing in disruptive innovation according to ARK is to access companies at the forefront of technology-enabled innovation, in promising areas of the economy, with potential for long-term growth.


# EVALUATION SECTION

### Evaluation Introduction

[Back to top](#ARK-Disruptive-Innovation-Insights)

<b>Assignment Task for Evaluation Metrics</b>: Develop an extensive range of metrics to assess the retrieval task, the answer synthesis task, and the overall system performance, ensuring that each component's effectiveness is thoroughly evaluated.

#### Evaluation Description:

Using standardized evaluation frameworks and benchmarks offers a valuable starting point for comparing the performance of different components in your RAG pipeline. They cover a wide range of tasks and domains, allowing you to assess the strengths and weaknesses of various approaches. By considering the results on these benchmarks, along with other factors like computational efficiency and ease of integration, you can narrow down your options and make better-informed decisions when selecting the most suitable components for your specific RAG application. For example, you can look at the MTEB benchmarks (https://huggingface.co/spaces/mteb/leaderboard) to see the two OpenAI embedding models overall rankings and scores (as of Sept 1):

text-embedding-ada-002 (what we used for baseline):
- rank: 99
- average retrieval score: 49.25

text-embedding-3-large (what we used for an enhancement): 
- rank: 29 
- average retrieval score: 55.44

Your immediate thought should be well what is ranked #1?  What did it score?  How do any of these models do on financial domain-specific benchmarks?  We will talk more about better embedding models in the final thoughts, but these are the kinds of questions that these benchmarks can be useful in answering! But just to quench your curiosity, the current best model is Nvidia's new model 'NV-Embed-v2' just released on Friday, with an average retrieval score of 62.65.

However, it's important to note that while these standardized evaluation metrics are helpful for initial component selection or for looking for places for potential improvement over your current implementation, they may not fully capture the performance of your specific RAG pipeline with your unique inputs and outputs. To truly understand how well your RAG system performs in your particular use case, you need to set up your own evaluation framework tailored to your specific requirements. This customized evaluation system will provide the most accurate and relevant insights into the performance of your RAG pipeline.

#### Ragas Evaluation Platform

Retrieval Augmented Generation Assessment (ragas) is an evaluation platform designed specifically for RAG.  We will use ragas to generate synthetic ground truth and then establish a comprehensive set of metrics to evaluate changes to the RAG system. Here is a link to the ragas documentation:

ttps://docs.ragas.io/

For each step, we will evaluate the enhancement against the baseline.  This allows you to compare each enhancement in isolation.  
RAG has two primary stages of action when it is engaged: retrieval and generation. When evaluating a RAG system, you can break down your evaluation by those two categories as well. Let’s first talk about evaluating retrieval.

#### Retrieval evaluation
Retrieval Evaluation is focused on assessing the accuracy and relevance of the documents that were retrieved. For retrieval, ragas has two metrics called context precision and context recall (quoted from their website):

- context_precision: The signal-to-noise ratio of retrieved context. Context Precision is a metric that evaluates whether all of the ground truth-relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth, and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

- context_recall: Can it retrieve all the relevant information required to answer the question? Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

#### Generation evaluation (answer synthesis)
Generation evaluation measures the appropriateness of the response generated by the system when the context is provided.  We do this with ragas using these two metrics, as described on the ragas documentation website:

- faithfullness: How factually accurate is the generated answer? This measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. The answer is scaled to a (0,1) range, with higher being better.

- answer_relevancy: How relevant is the generated answer to the question? Answer relevancy focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.

#### End-to-End evaluation (overall system performance)
Beyond providing the metrics for evaluating each stage of RAG pipeline in isolation, ragas provides metrics for the entire RAG system, called end-to-end evaluation. End-to-End metrics are for evaluating the end-to-end performance of a pipeline, gauging the overall experience of using the pipeline. Combining these metrics provides a comprehensive evaluation of the RAG pipeline.  We do this with ragas using these two metrics, as described on the ragas documentation website:

- answer_correctness: Gauges the accuracy of the generated answer when compared to the ground truth. The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

- answer_similarity: Assesses the semantic resemblance between the generated answer and the ground truth. The concept of answer semantic similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
 
Evaluating the end-to-end performance of a pipeline is also crucial, as it directly affects the user experience and helps to ensure a comprehensive evaluation.

### SYNTHETIC DATA GENERATION

Ground truth is a key element for us to conduct this evaluation analysis. But while you (TIFIN) provided evaluation questions, you did not provide any ground truth...

<b>Oh no!</b>

No problem! We can use ragas to generate synthetic data for this purpose.  Ragas gives you the ability to define the types of questions you want to generate, so in this case, we will use the following distribution:

- simple questions comprising 34%
- reasoning questions 33%
- multi_context questions 33%

Initially, we will synthesize 20 questions with ground truth answers to use for the following evaluations.  In a "real world" scenario, we would discuss this distribution, and possibly even add our own category and question types.  We do not get into the individual analysis of each question type in this evaluation, where you might see that a certain RAG pipeline does better with multi_context questions than reasoning questions, but that is also a possibility with the way this framework is set up!

<b>TIFIN Validation Set</b>: We will use the evaluation questions provided by TIFIN as our final validation set using metrics that do not require ground truth (called reference-free metrics).  This multi-tier approach to evaluation reduces the risk of data leakage and over-fitting and aligns with common practices in machine learning development (with test and validation data sets).

In [None]:
# generator with openai models
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embedding_function
)

In [None]:
# # Extract text from PDF
pdf_reader = PdfReader(pdf_path)
text = ""
for page in pdf_reader.pages:
    text += page.extract_text()

# Split recursively, but not using Document objects
splits = recursive_splitter.split_text(text)

# Create a list of Document objects from the chunks
documents = [Document(page_content=chunk) for chunk in splits]

#### FOR FOLLOWING CODE: Uncomment and run once to generate a source for the test dataset! ####
# generate testset -
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=20,
    distributions={
        simple: 0.34,
        reasoning: 0.33,
        multi_context: 0.33
    }
)

In [None]:
# comparison dataframe
testset_df = testset.to_pandas()

# save dataframes to CSV files in the specified directory
testset_df.to_csv(os.path.join('testset_data.csv'), index=False)

print("testset DataFrame saved successfully in the local directory.")

In [None]:
# Inference sometimes fails, resulting in a nan for ground truth, lets clean those out:
# Read the input CSV file
df = pd.read_csv('testset_data.csv')

# Remove rows with nan values in the ground_truth column
df = df[df['ground_truth'].notna()]

# Save the updated DataFrame to a new CSV file
df.to_csv('processed_testset_data.csv', index=False)

#### ONCE THIS CODE HAS RUN, WILL HAVE TEST DATASET IN CSV, DONT RUN AGAIN, $$$

In [46]:
# pull data from saved testset, rather than generating above
### load dataframs from CSV file
saved_testset_df = pd.read_csv(os.path.join('processed_testset_data.csv'))
print("testset DataFrame loaded successfully from local directory.")
saved_testset_df.head(5)

testset DataFrame loaded successfully from local directory.


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What should investors consider when buying or ...,['. Buying or selling ETF shares on an exchang...,Investors should consider that buying or selli...,simple,[{}],True
1,What factors influence the clinical success pr...,"['Sources: ARK Investment Management LLC, 2024...",The factors influencing the clinical success p...,simple,[{}],True
2,What are the potential benefits of investing i...,"['. Autonomy should reduce the cost of taxi, d...",The potential benefits of investing in disrupt...,simple,[{}],True
3,What role do adaptive robots play in transform...,"['. Autonomy should reduce the cost of taxi, d...","Adaptive Robots, catalyzed by artificial intel...",simple,[{}],True
4,What impact does generative AI have on the cos...,"[""Post 1997 assumes constant words per employe...",Generative AI lowers the cost of user-generate...,simple,[{}],True


### PREPARE DATASET BASED ON SYNTHESIZED DATA

In [47]:
# Convert the DataFrame to a dictionary
saved_testing_data = saved_testset_df.astype(str).to_dict(orient='list')

# Create the testing_dataset
saved_testing_dataset = Dataset.from_dict(saved_testing_data)

# Update the testing_dataset to include ONLY these columns -
# "question", "ground_truth", "answer", "contexts"
saved_testing_dataset_sm = saved_testing_dataset.remove_columns(["evolution_type", "episode_done"])

In [48]:
saved_testing_dataset_sm

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'metadata'],
    num_rows: 18
})

At this point, we have 18 question/answer pairs that we can use to evaluate each of our pipelines against the baseline.  We will start by generating the baseline scores that all other scores will be compared against.

### EVAL SETS FOR EACH CHAIN

In [49]:
# Function to generate answers using the RAG chain
def generate_answer(question, ground_truth, rag_chain):
    result = rag_chain.invoke(question)
    return {
        "question": question,
        "answer": result["answer"]["final_answer"],
        "contexts": [doc.page_content for doc in result["context"]],
        "ground_truth": ground_truth
    }

In [50]:
## EVALUATION VARIABLES AND HELPER FUNCTIONS 
# Analysis that consolidates everything into easier-to-read scores
# key columns to compare
key_columns = [
    'faithfulness',
    'answer_relevancy',
    'context_precision',
    'context_recall',
    'answer_correctness',
    'answer_similarity'
]

ref_free_key_columns = ['faithfulness','answer_relevancy']

metrics_to_track = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
        answer_similarity
    ]

ref_free_metrics_to_track = [
        faithfulness,
        answer_relevancy
    ]

baseline_run = "baseline_run"
recursive_run = "recursive_run"
hybrid_run = "hybrid_run"
gpt4o_run = "gpt4o_run"
hybrid_3070_run = "hybrid3070_run"
newembed_run = "new_embeddings_run"
queryexpand_run = "query_expansion_run"
ult_run = "ult_combo_run"
tifin_eval_utl_run = "tifin_eval_run"

def generate_comparisons(new_df, run_name):
    
    baseline_run = "baseline_run"
    new_means = new_df[key_columns].mean()
    
    # comparison dataframe
    comparison_df = pd.DataFrame({baseline_run: baseline_means, run_name: new_means})
    
    # difference between the means
    comparison_df['Difference'] = comparison_df[run_name] - comparison_df[baseline_run] 
    
    new_df.to_csv(os.path.join(f'{run_name}_data.csv'), index=False)
    comparison_df.to_csv(os.path.join(f'{run_name}_comparison_data.csv'), index=True)
    
    print("Dataframes saved successfully in the local directory.")

    ### load dataframes from CSV files
    comparison_df = pd.read_csv(os.path.join(f'{run_name}_comparison_data.csv'), index_col=0)
    
    print("Dataframes loaded successfully from the local directory.")
    
    # Analysis that consolidates everything into easier to read scores
    print("Performance Comparison:")
    print("\n**Retrieval**:")
    print(comparison_df.loc[['context_precision', 'context_recall']])
    print("\n**Generation**:")
    print(comparison_df.loc[['faithfulness', 'answer_relevancy']])
    print("\n**End-to-end evaluation**:")
    print(comparison_df.loc[['answer_correctness', 'answer_similarity']])

    # plotting - create subplots for each category with increased spacing
    fig, axes = plt.subplots(3, 1, figsize=(12, 18), sharex=False)
    bar_width = 0.35
    categories = ['Retrieval', 'Generation', 'End-to-end evaluation']
    metrics = [
        ['context_precision', 'context_recall'],
        ['faithfulness', 'answer_relevancy'],
        ['answer_correctness', 'answer_similarity']
    ]
    
    # iterate over each category and plot the corresponding metrics
    for i, (category, metric_list) in enumerate(zip(categories, metrics)):
        ax = axes[i]
        x = range(len(metric_list))
    
        # plot bars for Similarity Run (hex color #D51900)
        similarity_bars = ax.bar(x, comparison_df.loc[metric_list, baseline_run], width=bar_width, label=baseline_run, color='#D51900', hatch='///')
    
        # add values to Similarity Run bars
        for bar in similarity_bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.1%}', ha='center', va='bottom', fontsize=10)
    
        # plot bars for Hybrid Run (hex color #992111)
        hybrid_bars = ax.bar([i + bar_width for i in x], comparison_df.loc[metric_list, run_name], width=bar_width, label=run_name, color='#992111', hatch='\\\\\\')
    
        # add values to Hybrid Run bars
        for bar in hybrid_bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.1%}', ha='center', va='bottom', fontsize=10)
    
        ax.set_title(category, fontsize=14, pad=20)
        ax.set_xticks([i + bar_width / 2 for i in x])
        ax.set_xticklabels(metric_list, rotation=45, ha='right', fontsize=12)
    
        # move the legend to the bottom right corner
        ax.legend(fontsize=12, loc='lower right', bbox_to_anchor=(1, 1))
    
    # Add overall labels and title
    fig.text(0.04, 0.5, 'Scores', va='center', rotation='vertical', fontsize=14)
    fig.suptitle('Performance Comparison', fontsize=16)
    
    # adjust the spacing between subplots and increase the top margin
    plt.tight_layout(rect=[0.05, 0.03, 1, 0.95])
    plt.subplots_adjust(hspace=0.6, top=0.92)
    plt.show()

def generate_comparisons_reference_free(baseline_means, new_df, run_name):
    
    baseline_run = "baseline_run"
        
    new_means = new_df[ref_free_key_columns].mean()
    
    # comparison dataframe
    comparison_df = pd.DataFrame({baseline_run: baseline_means, run_name: new_means})
    
    # difference between the means
    comparison_df['Difference'] = comparison_df[run_name] - comparison_df[baseline_run] 
    
    new_df.to_csv(os.path.join(f'{run_name}_data.csv'), index=False)
    comparison_df.to_csv(os.path.join(f'{run_name}_comparison_data.csv'), index=True)
    
    print("Dataframes saved successfully in the local directory.")


    ### load dataframes from CSV files
    comparison_df = pd.read_csv(os.path.join(f'{run_name}_comparison_data.csv'), index_col=0)
    
    print("Dataframes loaded successfully from the local directory.")

    # Analysis that consolidates everything into easier to read scores
    print("Performance Comparison:")
    print("\n**Generation**:")
    print(comparison_df.loc[ref_free_key_columns])
    
    categories = ['Generation']
    metrics = [ref_free_key_columns]
    
    # plotting - create subplots for each category with increased spacing
    fig, ax = plt.subplots(figsize=(12, 6))
    bar_width = 0.35
    
    # iterate over each category and plot the corresponding metrics
    for i, (category, metric_list) in enumerate(zip(categories, metrics)):
        x = range(len(metric_list))
    
        # plot bars for Similarity Run (hex color #D51900)
        similarity_bars = ax.bar(x, comparison_df.loc[metric_list, baseline_run], width=bar_width, label=baseline_run, color='#D51900', hatch='///')
    
        # add values to Similarity Run bars
        for bar in similarity_bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.1%}', ha='center', va='bottom', fontsize=10)
    
        # plot bars for Hybrid Run (hex color #992111)
        hybrid_bars = ax.bar([i + bar_width for i in x], comparison_df.loc[metric_list, run_name], width=bar_width, label=run_name, color='#992111', hatch='\\\\\\')
    
        # add values to Hybrid Run bars
        for bar in hybrid_bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.1%}', ha='center', va='bottom', fontsize=10)
    
        ax.set_title(category, fontsize=14, pad=30)  # Increase the padding to push the title down
        ax.set_xticks([i + bar_width / 2 for i in x])
        ax.set_xticklabels(metric_list, rotation=45, ha='right', fontsize=12)
    
        # move the legend to the bottom right corner
        ax.legend(fontsize=12, loc='lower right', bbox_to_anchor=(1, 1))
    
    # Add overall labels and title
    fig.text(0.04, 0.5, 'Scores', va='center', rotation='vertical', fontsize=14)
    fig.suptitle('Performance Comparison', fontsize=16)
    
    # adjust the spacing between subplots and increase the top margin
    plt.tight_layout(rect=[0.05, 0.03, 1, 0.95])
    plt.subplots_adjust(top=0.85)  # Increase the top margin to push the chart down
    plt.show()

### ENHANCEMENT #0 EVAL: BASELINE

In [None]:
# BASELINE EVAL - Baseline RAG with Dense Vectors
baseline_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], baseline_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Similarity search score
baseline_score = evaluate(baseline_testing_dataset,metrics=metrics_to_track)
baseline_df = baseline_score.to_pandas()
baseline_df

In [None]:
# mean scores for each key column
baseline_means = baseline_df[key_columns].mean()

# save dataframes to CSV files in the specified directory
baseline_df.to_csv(os.path.join('baseline_run_data.csv'), index=False)

<b>Baseline evaluation discussion</b>

These scores represent our baseline evaluation.  By themselves, they already provide some value.  This is typically a good time to step through each question looking for outliers, which in this case, I would point to anything that has a 0 or really low score, or vice-versa, anything that has a higher or perfect score. These can indicate problems in the data, the code, or generally areas where our RAG pipeline is limited.

### ENHANCEMENT EVAL #1: USING RECURSIVE CHARACTER TEXT SPLITTER

In [None]:
# Recursive Splitter RAG with Dense Vectors
recursive_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], recursive_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Similarity search score
recursive_score = evaluate(recursive_testing_dataset,metrics=metrics_to_track)
recursive_df = recursive_score.to_pandas()
recursive_df

In [None]:
generate_comparisons(recursive_df, recursive_run)

<b>ENHANCEMENT EVAL #1 Evaluation discussion</b>:

The recursive approach scores better across the board, we will add this to our selected enhancements.

### ENHANCEMENT EVAL #2: USING HYBRID SEARCH

In [None]:
# Hybrid RAG
hybrid_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], hybrid_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# hybrid score
hybrid_score = evaluate(hybrid_testing_dataset,metrics=metrics_to_track)
hybrid_df = hybrid_score.to_pandas()
hybrid_df

In [None]:
generate_comparisons(hybrid_df, hybrid_run)

<b>ENHANCEMENT EVAL #2 Evaluation discussion</b>:

We saw significant improvement in context precision, while context recall stayed the same.  

While adding a hybrid search is focused on improving retrieval, generation is almost always likely to be impacted because it is downstream from retrieval. In this case, you see faithfulness increase significantly, with answer_relevance dropping slightly.  Overall measures show slight increases.  Considering the small sample sizes, this is likely a statistical draw.

Conclusion: Will compare this to the 30/70 hybrid approach, but it does seem that a hybrid approach in general can improve the pipeline.

### ENHANCEMENT EVAL #3: USING GPT-4o-Mini 

In [None]:
# GPT 4o RAG
gpt4o_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], gpt4o_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# hybrid 4o score
gpt4o_score = evaluate(gpt4o_testing_dataset,metrics=metrics_to_track)
gpt4o_df = gpt4o_score.to_pandas()
gpt4o_df

In [None]:
generate_comparisons(gpt4o_df, gpt4o_run)

<b>ENHANCEMENT EVAL #3 Evaluation discussion</b>:

Generation and end-to-end evalation actually dropped significantly.  Given these results, we will not implement this enhancement in the final pipeline.

Note: It seems highly unlikely (but not impossible) that ChatGPT 3.5 Turbo would perform better than ChatGPT 4o in any scenario.  Given more time, I would want to dig much further into these results as one of my first tasks after this analysis was run.  

### ENHANCEMENT EVAL #4: USING HYBRID SEARCH WITH 30% / 70% split (dense / sparse)

In [None]:
# Hybrid 4o RAG 30/70 split
hybrid_3070split_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], hybrid_3070split_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# hybrid 4o 30/70 score
hybrid_3070split_score = evaluate(hybrid_3070split_testing_dataset,metrics=metrics_to_track)
hybrid_3070split_df = hybrid_3070split_score.to_pandas()
hybrid_3070split_df

In [None]:
generate_comparisons(hybrid_3070split_df, hybrid_3070_run)

<b>ENHANCEMENT EVAL #4 Evaluation discussion</b>:

Note that I added the scores for the original hybrid above for comparison, which had a 50/50 weighting for dense/sparse.  If compared to just the hybrid approaches, the 50/50 version performed notably better for context precision, but the 30/70 still improved over the baseline.  The context_recall is where the 30/70 really made an improvement, where the 50/50 version stayed even with the baseline.  These are the most direct measures, as hybrid search is focused on retrieval improvement, but let's look at downstream impact.

For the generation metrics, we saw a very significant improvement in faithfullness from both approaches, but the 50/50 approach was significantly better than the 30/70 as well.  Answer relevance was already at a very high performance with the baseline, and the hybrid approaches did about the same.  Overall measures both had about the same slight improvements with both versions of hybrid.

Given these scores, it depends on your priorities as to what you determine to be the best approach.  Retrieval might be more important in some cases, and so you could make a case for either approach.  But in my case, and per the assignment, I am aiming to improve coherence, faithfulness, and relevance of answers.  Given that, I will go with the approach that represents the most improved version of generation, which is the 50/50 weighted hybrid approach.

### ENHANCEMENT EVAL #5: new embedding model - option 1 - best small open-source embedding model available

In [None]:
# Optimized Embeddings
newembed_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], newembed_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Similarity search score
newembed_score = evaluate(newembed_testing_dataset,metrics=metrics_to_track)
newembed_df = newembed_score.to_pandas()
newembed_df

In [None]:
generate_comparisons(newembed_df, newembed_run)

<b>ENHANCEMENT EVAL #5 Evaluation discussion</b>:

Better retrieval, the key area new embeddings should make an improvement.  Also slight improvement in the end-to-end metrics, and a large increase in faithfullness.

Given these results, we will implement the upgraded embeddings model in the final pipeline implementation.

### ENHANCEMENT EVAL #6: Query Expansion

In [None]:
# Query Expansion
query_expand = QueryExpander()
queryexpand_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], query_expand), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Similarity search score
queryexpand_score = evaluate(queryexpand_testing_dataset,metrics=metrics_to_track)
queryexpand_df = queryexpand_score.to_pandas()
queryexpand_df

In [None]:
generate_comparisons(queryexpand_df, queryexpand_run) 

<b>ENHANCEMENT EVAL #6 Query Expansion Discussion</b>:

Unfortunately, give the large drop in the generation metrics, and slight drop in the end-to-end evaluations, these results suggest that the query expansion technique did not improve this pipeline.  We will not implement this approach in the final pipeline.

This is an area where we can likely make quick and effective improvements by testing out different prompts.  I wouldn't throw this approach out just yet, but more testing and evaluation is warranted before deciding to use it.

### ENHANCEMENT EVAL #7 Final Evaluation using the "ultimate combination" of enhancements evaluated

Again, for the "ultimate combo", we take all of the results from above and combine the enhancements with what seem to be the best overall approach.  This allows us to conduct further analysis to the results from our enhancement, but also allows us to see how they work toegether.  

Here is the final list of enhancements applied to this pipeline:

- Text Splitter - Recursive
- LLM - 3.5 Turbo
- Search - Hybrid 50/50
- query expansion - No
- Embedding - large

In [None]:
# Ultimate combo
ult_testing_dataset = saved_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"], ult_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Ultimate combo score
ult_score = evaluate(ult_testing_dataset,metrics=metrics_to_track)
ult_df = ult_score.to_pandas()
ult_df

In [None]:
generate_comparisons(ult_df, ult_run)

<b>Discussion of ultimate combo<b>

We saw notable improvements across the board, but we did not see a cumulative effect that is better than any one enhancement.  This is something that would need to be further researched to make sure some enhancements are not interfering with others.  The data is all there to look through and will reveal these answers.  This analysis just gives us the indication that we should take those steps.

### Heatmap of all scores

Ran out of time, but ideally, we use all of the run data to generate a full heatmap comparing all of the results side-by-side.

# TIFIN EVALUATION QUESTION PROCESSING

[Back to top](#ARK-Disruptive-Innovation-Insights)

We use the combination of best approaches from the previous section to generate responses to the TIFIN evaluation questions.  We then compare how a select group of the enhancements perform with those enhanced results as ground truth. 

Using the final results from the evaluation of the synthesized data, I selected the best combination of features to generate/synthesize answers for the evaluation questions provided by TIFIN.  For the sake of this evaluation, we can consider those responses the "best of class" given the available technology.  Now, we will go back and run a final evaluation of a selection of the implementations using the evaluation questions and "best of class" synthesized responses.

This will allow us to use the 20 provided evaluation questions to benchmark both the baseline and some of the advanced RAG methods. We will document the performance improvements achieved by the advanced methods over the baseline.

### KEY THINGS TO NOTE FROM THIS EVALUATION
1. Note the modularity of the pipelines, where I was able to swap in different components where needed without having to rebuild the entire pipeline.  With a little more effort, this could become very modular, improving the flexibility and efficiency of the development process.
2. All of the data generated to run these evaluations is included with this code.  This allows you to reproduce the evaluations I ran exactly as I ran them.  This alleviates the need to re-run any of the expensive ragas evaluation generation, which makes numerous calls to LLM and embedding APIs to generate the evaluation.  You just need to re-run the final function in each evaluation section, and it will re-use the csv files that were generated with the ragas data.
3. This is really more of a demonstration of the evaluation capabilities that I can add to a RAG pipeline and the insights that can provide.  The results portrayed here should probably not be taken seriously without further investigation. Ultimately, with only 20 testing samples, and another 20 evaluation samples, this is likely not giving you a full picture of the performance of these enhancements.  In a "real world" scenario, I would increase the testing data substantially and spend more effort breaking out the "types" of questions and testing them individually, so that we can ensure the outcomes fit the priorities of the organization more closely.  By using such a small sample size, you likely do not have enough statistical coverage to provide you with a reliable baseline, nor reliable scores for the other evaluations.  For example, if I re-ran the synthetic data generation again, and then re-ran the evaluations again, the scores could vary by 5-15% (speaking from experience).  This means the "real" baseline could be 15% higher, and another evaluation score 15% higher, significantly altering our view of that result. The larger your sample size, the more you will reduce that variance and converge on the proper scores.
4. While generating a larger sample size is relatively easy with this code (you just change the number in the generation function), it adds significant expense across the effort.  So I had to avoid it in this analysis when I am paying for it, but it should be relatively inexpensive from an enterprise standpoint.
5. We would also perform EDA on the questions, answers, ground truth, and context, and use this to identify additional issues to address within the pipeline.  In my experience, this EDA has been particularly helpful in identifying how to make very specific improvements that can dramatically improve the performance of the RAG pipeline(s).
6. I would have liked to provide more commentary on the individual results from each enhancement, but I ran up against time constraints. I was able to summarize my thoughts for each though, and generally, I would have gone directly into the data for each result, looked for outliers, overfitting, and other common issues where we can make quick improvements.  These are definitely present in this data.  I would also have discussed each metric specifically, indicating what each meant, how we can use it in improving the pipeline, and the next steps based on the results of that specific metric.

### Formatting/Ingesting TIFIN Questions
The following code ingests the evaluation questions provided by TIFIN and converts them into the data format needed to run the ragas evaluation outlined previously.

In [57]:
# ingest the evaluation questions into the evaluation format
import csv

# Read the questions from the text file
with open('Evaluation_Questions.txt', 'r') as file:
    questions = [line.strip() for line in file if line.strip()]

# Extract the questions without the question numbers
questions = [question.split('. ')[1] for question in questions]

# Create the CSV data
csv_data = [
    ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done']
]

for question in questions:
    csv_data.append([question, '[]', '', 'multi_context', '[{}]', 'True'])

# Write the CSV data to a file
with open('tifineval.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(csv_data)

In [58]:
# pull data from saved testset, rather than generating above
### load dataframs from CSV file
tifin_eval_testset_df = pd.read_csv(os.path.join('tifin_eval.csv'))
print("testset DataFrame loaded successfully from local directory.")
saved_testset_df.head(5)

#### ONCE THIS CODE HAS RUN, WILL HAVE TEST DATASET IN CSV, DONT RUN AGAIN, $$$

# Convert the DataFrame to a dictionary
tifin_eval_testing_data = tifin_eval_testset_df.astype(str).to_dict(orient='list')

# Create the testing_dataset
tifin_eval_testing_dataset = Dataset.from_dict(tifin_eval_testing_data)

# Update the testing_dataset to include ONLY these columns -
# "question", "ground_truth", "answer", "contexts"
tifin_eval_testing_dataset_sm = tifin_eval_testing_dataset.remove_columns(["evolution_type", "episode_done"])

tifin_eval_testing_dataset_sm

testset DataFrame loaded successfully from local directory.


Dataset({
    features: ['question', 'contexts', 'ground_truth', 'metadata'],
    num_rows: 20
})

### TIFIN QUESTIONS BASELINE

In [None]:
# Run a baseline eval with the TIFIN evaluation data (so we have a direct comparison - apples to apples)
# TIFIN eval with the Ultimate combo
tifin_eval_baseline_testing_dataset = tifin_eval_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"],baseline_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Ultimate combo score
tifin_eval_baseline_score = evaluate(
    tifin_eval_baseline_testing_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
    ]
)
tifin_eval_baseline_df = tifin_eval_baseline_score.to_pandas()
tifin_eval_baseline_df

In [None]:
# mean scores for each key column
tifin_eval_baseline_means = tifin_eval_baseline_df[ref_free_key_columns].mean()

# save dataframes to CSV files in the specified directory
tifin_eval_baseline_df.to_csv(os.path.join('tifin_eval_baseline_run_data.csv'), index=False)

### TIFIN QUESTIONS WITH ULTIMATE COMBO PIPELINE

In [None]:
# compare the TIFIN eval baseline scores with the ult combo scores

In [None]:
# TIFIN eval with the Ultimate combo
tifin_eval_utl_testing_dataset = tifin_eval_testing_dataset_sm.map(lambda x: generate_answer(x["question"], x["ground_truth"],ult_final_chain), remove_columns=saved_testing_dataset_sm.column_names)

In [None]:
# Ultimate combo score
tifin_eval_utl_score = evaluate(
    tifin_eval_utl_testing_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
    ]
)
tifin_eval_utl_df = tifin_eval_utl_score.to_pandas()
tifin_eval_utl_df

In [None]:
generate_comparisons_reference_free(tifin_eval_baseline_means, tifin_eval_utl_df, tifin_eval_utl_run)

### Note About Analysis: Reference Free Metrics
You may be wondering why there are only 2 metrics here when there are 6 in the previous evaluations.

These metrics are considered "reference-free" metrics because they do not require ground truth (as a reference) as many other metrics in this field require.  This is an innovative concept discussed much further in this paper, which is the inspiration for the ragas platform:

https://arxiv.org/abs/2309.15217

Ragas has emphasized building reference-free metrics in their work. For many fields, where ground truth is difficult to collect, this is an important point, as this makes at least some of the evaluation still possible.  Faithfulness and answer relevance were reference-free metrics he mentioned.  This can be particularly useful when implementing metrics in production pipelines, where ground truth is not readily available, and data is constantly streaming in and needs to be monitored for changes and degradation.

For this analysis, this concept fits well with the questions provided by TIFIN, since they do not have ground truth included with them.

<b>Full disclaimer</b>: I personally know the primary author of this paper (and the ragas platform), Shahul Es, and have had discussions with him about how and why these metrics are so important to this type of analysis. Shahul Es has even written the forward for my book which will be released in September, focused on implementing RAG in the enterprise.  I will include the full chapter 9 from my book on RAG evaluation with this submission. I've made it a point to get to know this company because they are doing such great work in the field.

### Discussion:

We saw a significant improvement in faithfullness and a significant decrease in answer relevancy.  

With further analysis, we need to look directly at the data and determine why we are seeing that decrease.  For example, here are some problematic scores for the evaluation questions:

<b>#3	Can you list the converging innovation platforms identified by ARK?</b>

We scored 0% on faithfullness and answer relevancy with our "ultimate combo" of enhancements.  Compare this to the baseline pipeline, where we scored 100% for faithfullness and 99% on answer relevancy. This is a big red flag.  Clearly the data in the context to score well with the baseline.  There may even be a bug in the code (such as the context formatting causing issues with the LLM inference).  This sort of insight will help us go into the code and data, make changes as required, and incrementally and methodically improve the RAG pipeline.

<b>#8	What does the ARK’s Convergence Scoring Framework illustrate about innovation platforms?</b>

We scored 33% on faithfullness, but 96% on answer relevancy with our "ultimate combo" of enhancements.  Compare this to the baseline pipeline, where we scored 60% for faithfullness and 98% on answer relevancy.

There seems to be something off for the faithfullness for this particular question, which suggests we look further into this area to determine the cause(s).  Perhaps the context is off due to the hybrid search.    

<b>#11	How do AI Chatbots contribute to the development of robotaxis?</b>

We scored 0% on faithfullness, but 100% on answer relevancy with our "ultimate combo" of enhancements.  That is a big red flag, there may even be a bug in the code (such as the context formatting causing issues with the LLM inference). Again, compare this to the baseline pipeline, where we get the exact same scores.  Also a red flag. We should see some sort of difference based on the enhancements we introduced.  

<b>#12	What are breakthroughs in DNA Sequencing, particularly with neural networks?</b>

Our scores between the baseline and "ultimate combo" are similar here:
baseline:
faithfulness: 0.38461538461538464
answer_relevancy: 0.8471160564373204

Ultimate combo:
faithfulness: 0.3333333333333333
answer_relevancy: 0.9379678735358913

We saw a drop faithfullness, but a significant improvement in answer_relevancy. But generally, faithfullness is significantly lower than most other scores, and so this is another area for further analysis. 

<b>#18	What thematic strategies do ARK ETFs focus on?</b>
On both baseline and "ultimate combo" we have a faithfullness of 0.0.  Our answer relevancy scores are very high, with an improvement to near perfect, with a baseline of 0.9695597827924756 and ultimate combo score of 0.9999999999999997.  This is another case where faithfullness and answer relevance are so different that it suggests some red flags for going back and checking what is going on.

### Future analysis

After this initial analysis, it would also be prudent to go back and look at the high scores with just as much scrutiny.  Is there a valid reason we get a 100% on many scores? Given the nature of Generative AI this suggests there may be some over-fitting occuring. 

I would note as well, as we start to recognize patterns and understand the challenges more in-depth, this type of analysis can be set up to be more systematic, even using generative AI models to help to look for potential issues and solutions.

# FUTURE EFFORTS

[Back to top](#ARK-Disruptive-Innovation-Insights)

### Potential Future Enhancements:
There are an unlimited set of other areas we could explore, here are some of the other areas we could easily add:
- <b>Combinations</b> - The evaluation provided compares each enhancement to the original baseline RAG pipeline. As certain enhancements start to stand out, you can start to combine them and use that combination as the baseline to continue to compare future enhancements against.  

- <b>Other Advanced Chunking Strategies</b> - Test the SemanticChunker class from LangChain, which attempts to chunk text base on the semantics and context of the text, rather than an arbitrary character count.  I have tested this many times with other applications, and under certain circumstances, it can perform better, and is worth considering.  I avoided this in this demonstration because it costs significantly more than the other chunking strategies (which are essentially free).  If you would like to provide an API key for it though, I'm happy to show you. It is very easy to swap out the splitter!  I would say though, in my experience, the recursive character text splitter is fairly capable despite it's simplicity, and should not be overlooked.

- <b>Different Semantic Distance Metrics</b> - This demonstration relied on the default cosine similarity distance metric, which is a powerful distance metric, but not the only one.  Other options include euclidean and dot product (not technically a distance metric, but it is typically included!), Lin similarity, Jaccard similarity, Hamming distance, Manhattan distance, and Levenshtein distance. These could all be implemented and evaluated for improvement in the retrieval mechanism.  This is something that could be easily changed during the vectorstore initialization and could be tested in the same manner as these other enhancements. 
  
- <b>Other vector stores (databases)</b> - We could test other vector store solutions, which can vary signficantly. Many require robust infrastructure implementations, and so I avoided it for this demonstration. This demonstration uses Chroma database, which is versatile and very capable for smaller demonstrations like this one.  It even has some production applications, such as when you have the need to spin up and generate a vector database in real time.  But generally, there are better options that can provide more robust, flexible, and capable solutions for large-scale production focused GenAI applications.  I have significant experience with PostgreSQL (using the PGVector extension for the vector database), which offers the most advanced vector database capabilities on the market commonly seen in other high-end competitors like PineCone.  For example, you can index your database with Locality-Sensitive Hashing (LSH), Production Quantization (PQ), and Hierarchical Navigable Small Worlds (HNSW). PostgreSQL offers you the world's #1 database solution with the largest community (so there are more people that can help you support it!) combined with the most advanced vector database capabilities. I have also used Weaviate, Pinecone, Elasticsearch, Milvus, and LanceDB as vector database solutions.  All are sufficient solutions if they are already in place. Weaviate is probably the most interesting of the other options, with it's GraphQL-like API, making it highly adaptable and giving you some development options the other's do not offer.  With databases, the evaluation we used in this exercise will give you an idea of the retrieval quality of the databases, but we would also want to add database specific metrics, like speed and efficiency to fully vet these options.  And even within each database solution, you can test different options, like using various indexing methods, various search algorithms, and so on.
  
- <b>Different Interface Solutions</b> - There was not a request for the implementation of an interface for this application, so I avoided that! It is typically the most problematic component in terms of launching across different environments and you likely have specialization in this area already.  But if that were a need, there are many POC-focused UI applications, like Streamlit and Gradio, and then there are more robust solutions that start pushing into the web development arena.  I have built many Streamlit and Gradio based applications, as well as mobile-based POCs using Flutter.  I have also supported many other more robust web frameworks in production.  This is an area where we could experiment.  Evaluation would be considerably different (using user feedback, usage metrics, and similar measures to track our progress and effectiveness).

- <b>Fine-tuned LLM for financial expertise</b> - This is actually where one of my specialities is. I am focused on understanding both the RAG system and fine-tuning LLMs in a way that works together in the most effective way.  This can only be achieved and optimized with the full knowledge and understanding of both processes.  Examples for me include fine-tuning LLMs to become J&J scientists, J&J clinicians, and for a different company, an expertise in anesthesiology, and a data science professor.  This is not just focused on "adding more data" to the LLM, as the real value in fine-tuning is more in the realm of teaching the LLM to talk like people do in this domain, and take on the personality of the target domain.  Finance is another area where this personality adjustment can really play a big role. I started my career at First Union Securities as a junior stock analyst, and I understand the nuances of how financial concepts are communicated in this field more than most developers.  I can talk the talk, and even walk the walk, but even more importantly, I can fine-tune a generative AI model to do the same!  This is a hypothesis we could test within the framework of these evaluations.

- <b>Fine-tuned embedding model focused on financial terminology</b> - The financial domain has it's own terminology and semantics, like any other focused domain.  Embedding models can be fine-tuned with that terminology to improve the overall semantic understanding of the vector search, resulting in better retrieval.

- <b>Embedding models listed higher than OpenAI's on MTEB</b> - As mentioned earlier, OpenAI's best model is ranked 29 in retrieval and has not been thoroughly vetted specifically for the financial domain. We can research and identify models that not only do much better in general, but we can also find ones that do well specifically in finance (or like the last suggestion, we can fine-tune them to do better in finance).

- <b>Agentic Workflow with LangGraph</b> LangGraph provides a strong foundation for orchestrating agents and reducing the early challenges the agentic approach introduced (endless loops, lack of control).  An agent could be quickly implemented ontop of the pipelines already presented, adding an additional layer for generative-AI powered improvement and thoughtfulness to giving the best responses.

- <b>Numerous other LLMs</b> Given the modular framework presented here, it is easy to try other LLMs like we did with the upgraded version of OpenAI.  We could continue this effort across many other models.  One particular goal could be finding the "best" open model as a baseline, and then fine-tune that model on financial information to give it the voice and personality of a friendly financial advisor with the brains of a cutting-edge LLM.

- <b>Multi-Modal RAG</b> Multi-modal models offer a whole new element of reponse to end users based on their questions.  Multi-modal can be applicable in both the input and output of the models.  A significant amount of financial information is present in graphics, which that alone should be enough to get us into MM-RAG. I've built MM-RAG pipelines with the same precision and meticulousness that I build LLM-based pipelines, but with a significant step up in presentation and understanding of related graphics.  I demonstrate this kind of effort in the last chapter of my book, where I show how to extract the information needed from a graphic-intensive environmental analysis report from Google during the indexing process, complete with pictures of windmills and an understanding of their most challenging charts and graphs.  

# FINAL THOUGHT
Normally, I would not present this analysis to the team at this stage.  I would spend more time, probably a week or more, stepping through the scored data directly, looking for the major red flags and researching if there are any major flaws in the data or code, or if there are just limitations beyond our control.  As I indicated above, there are numerous red flags, which I view as opportunities to make significant improvements, given more time, but I would consider the work that was done here just the start of the first stage of analysis to establish a baseline and initial analysis. And then it would be after this stage that we begin really digging into the major advancements.