<a href="https://colab.research.google.com/github/kwschultz/rag-local-llms/blob/main/Advanced_RAG_TinyLlama_LLM_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced RAG Challenge
### Kasey Schultz, PhD - 09/21/2024

To run this notebook:
- Run the `pip install ...` code block
- Enable GPU on Colab, on the top right panel showing "RAM" and "Disk", click the down arrow, select T4 GPU and restart the runtime.
- Create a directory called './rag_documents/', and [upload this PDF](https://research.ark-invest.com/hubfs/1_Download_Files_ETF_Website/Investment_Case/Investment%20Case%20For%20Disruptive%20Innovation.pdf?__hstc=6077420.c65cff1b18e88e8ac51483b62d058ce9.1610918026083.1628712241411.1628775645389.291&__hssc=6077420.3.1628775645389&__hsfp=814967395.pdf)
---

# Approach for Advanced RAG

## Document Ingestion

- Using Llama Index document loader to extract text from PDF file.

- Using Llama Index's IngestionPipeline class to combine chunking and text embedding

- With more time I would explore using Llama-Index's LlamaParse API that is suited to PDFs, this would likely boost performance.

### PDF Parsing

- I experimented with the pdfplumber package to programmatically parse the PDF file structure, analyze code blocks, font sizes and whitespace to partition the text into chunks - it did not perform very well on this PDF file without extensive tweaking.

### --> Basic Document Ingestion Pipeline

#### Chunking
- __Chunking__ using the naive `TokenTextSplitter` from LangChain. This chops the full document text into chunks of uniform size, regardless of text content.
  - Chunk size is a fixed 300 tokens with 10% chunk overlap to catch breaks in semantic meaning.

#### Text Embedder
- Embedder used: `"sentence-transformers/all-MiniLM-L6-v2"`, embed dimension 384, good performance for small and efficient SentenceTransformer.


### --> Advanced  Document Ingestion Pipeline

- __Improved chunking__ strategy using `RecursiveCharacterTextSplitter` from LangChain. This is a smarter chunking strategy that divides the text based on new lines, spaces, etc. that bound individual paragraphs and sentences. Chunk sizes vary.
  - If time, I want to experiment with `StatisticalChunker` from the `semantic-chunkers` package. This method uses a text embedding and similarity scores to partition the text into semantically similar chunks - to better divide separate topics and entities. There are multiple methods, the "gradient" method is where I would start.

---
## --> Embedding Models

### Basic text embedding
- For  I am choosing the versatile SenteneceTransformer MiniLM (`"sentence-transformers/all-MiniLM-L6-v2"`). It has good performance for it's size (384 dim embedding) and is a good baseline.

### Improved text embedding
- For advanced embedding I am choosing a larger open source text embedder `"BAAI/bge-large-en-v1.5"`. This embedder is focused on RAG, has a larger 1024 token embedding space for more fine grained contextual understanding.

### If more time
- I would experiment with vision-language model embeddings that preserve the relationship between image/layout and text information within PDFs containing images and graphs. One significant limitation of RAG on PDFs using only text embeddings is that the process of extracting text from the documents discards a significant amount of information.
- I would start with the ColPali model and ColBERT embeddings. The vector store and retrieval methods would need to be enhanced to support multi-vectors (tensors).
---
##  --> Information Retrieval

### Vector Store for Document Chunks
Using ChromaDB for local vector store, and similarity score to find top k=5 embedded document chunks.
- Since number of documents/chunks is small, saving the ChromaDB locally for quick caching and pipeline re-running without recomputing.

#### Basic RAG LLM

- For Basic RAG I am using a pre-trained HuggingFace model locally. I tried Mistral API and OpenAI API but got rate limited or needed an account upgrade that was not immediate.
  - Model: "TinyLlama/TinyLlama-1.1B-Chat-v1.0", good performance for a smaller LLM. Allows for quicker iteration and experimentation on the whole pipeline

#### Advanced RAG LLM

- For advanced RAG this can be upgraded to a medium size LLM with better comprehension, and that has been fine tuned for question answering or RAG (e.g. Mixtral mixtral-8x7b or Llama3.1 8b).
- Even the 8-bit quantized version of these models takes a very long time to load on Google Colab. Leaving this for future follow up.

## -->  Retrieval Reranking and Answer Synthesis

### Basic RAG
- Implementing a simple Response Synthesizer from Llama Index
  - Using Compact Response mode, top k chunks are aggregated for the context injected into the query prompt. LLM reasons over all chunks for final answer.

### Advanced RAG
I implemented a more complex Response Synthesizer
- Use Re-ranking model to have a more fine-grained look at the relevant chunks
  - model = "sentence-transformers/msmarco-distilroberta-base-v2"
- Use Refinement prompting to allow an LLM to revise the answer by sequentially reviewing the reranked retrieved chunks

#### If more time
- Can employ hybrid RAG, extracting metadata from the chunks and applying metadata filtering (e.g. keyword extraction and filtering)
- Can try FlagReranker('BAAI/bge-reranker-v2-m3'), fine tuned for RAG Re-Ranking
  

---
## --> Performance Evaluation

### Benchmark evaluation
With no provided answers, and no human subject matter experts to annotate, I used GPT 4o to provide answers to the questions and serve as "pseudo ground truth" for evaluation. The GPT 4o prompt for multi-modal question answering from the uploaded PDF document:
```
Using only the information provided in the pages of that PDF document, please answer the following question. Make sure your response is succinct and contains no more than 40 words. Question: {<question>}
```

### RAG Evaluation Metrics

I will use Llama Index's implementation of DeepEval to programmatically compute RAG evaluation metrics for retriever and reranker performance, as well as hallucination detection. These functions come from `deepeval.metrics`.

#### --> Retriever Metrics
Retriever is evaluated on the top-k contexts retrieved for the RAG prompt.

1. ContextualPrecision
  - Requires 'expected_output', I would substitute the GPT-4o responses in leiu of proper ground truth
2. ContextualRecall
  - Requires 'expected_output'
3. ContextualRelevancy
  - Independent of "correct answers"

#### --> Answer Synthesis Metrics

1. AnswerRelevancyMetric
  1. Can also evaluate the change in answer relevancy using the top 1 chunk vs. using the top k chunks.
2. FaithfulnessMetric
  1. Hallucinationn detection, evaluating groundedness of responses.

#### --> Overall System Metric

1. Correctness
  1. Using GPT 4o responses as "correct" responses (compared to what the simpler LLM generates).
2. Can combine all individual metrics for an aggregated score. Simple sum of scores, or the mean to characterize the pipline as a whole (and compare Basic RAG to Advanced).

#### --> Timing Metrics

- It is important to measure how long each step in the RAG pipeline takes, and the overall time.

#### Augmenting the quesion/answer evaluation dataset

- With more time, I could use Llama Index to auto-generate potential quesion and answer pairs from the content to augment the evaluation test data.

## Evaluation ~~Results~~ Errors with Basic and Advanced RAG

- I was not able to complete all the evaluation metrics for Basic RAG or Advanced RAG
  - Using the DeepEval standard GPT4 API call for the LLM judge gave me rate limits and 404s even with a paid account.
  - I also got errors using the Llama Index wrapper around the DeepEval methods.
- I dug into defining a local LLM for judging, and that implementation eventually led to errors in calling the model for the evaluation metrics
- With more time I would fix the errors, and complete the batch evalutions and compare the Basic and Advanced implementations

<br>

- I moved the errored out evaluation calls to the end of the notebook for reference.


---

---

In [None]:
!pip install --upgrade sentence-transformers semantic-chunkers semantic-router transformers langchain langchain-community \
datasets nltk chromadb llama-index bitsandbytes==0.43.3 accelerate pdfplumber langchain-text-splitters langchain_experimental \
flash-attn faiss-cpu mistralai FlagEmbedding peft langchain_mistralai deepeval lm-format-enforcer llama-index-embeddings-huggingface \
llama-index-embeddings-instructor llama-index-vector-stores-chroma llama-index-embeddings-mistralai llama-index-llms-mistralai \
llama-index-llms-huggingface llama-index-llms-langchain

In [None]:
# ----- API KEYS -----
import os
from google.colab import userdata
from huggingface_hub import login

mistral_api_key = userdata.get('MISTRAL_API_KEY')
hf_token = userdata.get('HF_TOKEN')
openai_api_key = userdata.get('OPENAI_API_KEY')

login(token = hf_token)

os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['HF_TOKEN'] = hf_token
os.environ['MISTRAL_API_KEY'] = mistral_api_key
os.environ['TRANSFORMERS_CACHE'] = './hf_cache'
os.environ['HF_CACHE'] = './hf_cache'

os.mkdir('./rag_documents')
os.mkdir('./hf_cache')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
import torch
import textwrap
from tqdm import tqdm
import transformers
from langchain import hub
from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain.docstore.document import Document as LangchainDocument
from transformers import BitsAndBytesConfig
from FlagEmbedding import FlagReranker
from llama_index.core import Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SemanticSplitterNodeParser, TokenTextSplitter
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.embeddings.instructor import InstructorEmbedding
from llama_index.core.node_parser import LangchainNodeParser
from llama_index.core import SimpleDirectoryReader, Settings, VectorStoreIndex
from langchain_experimental.text_splitter import SemanticChunker

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Embedding Models


In [None]:
# # ----- Basic RAG --------
basic_embedder_name = "sentence-transformers/all-MiniLM-L6-v2"  # HuggingFace
# # ----- Advanced RAG --------
advanced_embedder_name = "BAAI/bge-large-en-v1.5"  # HuggingFace

## PDF Parsing and Document Chunking


### Basic Chunking by n-tokens

In [None]:
def get_doc_pages(doc_path):
    doc_loader = PDFPlumberLoader(doc_path)
    pages = doc_loader.load()
    return pages

def get_pdf_text(doc_path):
    doc_pages = get_doc_pages(doc_path)
    page_text = [p.page_content for p in doc_pages]
    full_text = "\n\n".join(page_text)
    return doc_pages, page_text, full_text

def basic_chunking(full_text: str, tokenizer_name: str) -> list[str]:
    """
    Split documents into chunks of size `chunk_size` characters and return a
    list of chunks (str). Chunk size determined by SentenceTransformer text
    embedder context window size.
    """
    text_splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=0)
    return text_splitter.split_text(full_text)


arkk_pages, arkk_page_text, arkk_full_text = get_pdf_text("./rag_documents/Investment_Case_For_Disruptive_Innovation.pdf")
basic_chunks_tokensplit_fulltext = basic_chunking(arkk_full_text, basic_embedder_name)

In [None]:
len(basic_chunks_tokensplit_fulltext)

31

In [None]:
textwrap.wrap(basic_chunks_tokensplit_fulltext[0], width=120)

['1 • Why Invest In Disruptive Innovation? As of June 30, 2024 Sources: ARK Investment Management LLC, 2024. Forecasts are',
 'inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment',
 'advice or a recommendation to buy, sell, or hold any particular security. Past performance is not indicative of future',
 'results.   • 2 DISCLOSURE Risks of Investing in Innovation Please note: Companies that ARK believes are capitalizing on',
 'disruptive innovation and developing technologies to displace older technologies or create new markets may not in fact',
 'do so. ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and',
 'uncertainties may impact our projections and research models. Investors should use the content presented for',
 'informational purposes only, and be aware of market risk, disruptive innovation risk, regulatory risk, and risks related',
 'to certain innovation ar

In [None]:
textwrap.wrap(basic_chunks_tokensplit_fulltext[1], width=120)

['and company risks. (See Disclosure Page) Sources: ARK Investment Management LLC, 2023.   3 Public Blockchains Upon',
 'large-scale adoption, we believe all money and contracts likely will migrate onto Public Blockchains that enableand',
 'Multiomic verifydigital scarcityand proof of ownership. The financial ecosystem is likely to reconfigure to accommodate',
 'the rise of Sequencing Cryptocurrencies and Smart Contracts. These technologies increase transparency, reduce the',
 'influence of capital and The cost to gather, sequence, and understand regulatory controls, and collapse contract',
 'execution costs. In digital biological data is falling precipitously. such a world, Digital Wallets would become',
 'increasingly Multiomic Technologies provide research Five Innovation necessary as more assets become money-like, and',
 'corporations scientists, therapeutic organizations and health and consumers adapt to the new financial infrastructure.',
 'platforms with unprecedented access to 

In [None]:
textwrap.wrap(basic_chunks_tokensplit_fulltext[2], width=120)

['technology ’ s integration into every economic sector. we programmable biology capabilities, including the defining this',
 'believe the adoption of neural networks should prove more design and synthesis of novel biological momentous than the',
 'introduction of the internet. at scale these constructs with applications across industries, systems will require',
 'unprecedented computational resources, and particularly agriculture and food production. technological era ai - specific',
 'compute hardware should dominate the next gen cloud datacenters that train and operate ai models. the potential for end',
 "- users is clear : a constellation of ai - driven intelligent devices that pervade people's lives, changing the way that",
 'they spend, work, and play. the adoption of artificial intelligence should transform every sector, impact every',
 'business, and catalyze every innovation platform. energy storage robotics declining costs of advanced battery technology',
 'should cause catal

## Basic RAG with Llama-Index, ChromaDB, TinyLlama (LLM)

In [None]:
# Must create doc_directory and upload to Colab
reader = SimpleDirectoryReader('./rag_documents/')
ark_documents = reader.load_data()



---

## LLM and Embedder

In [None]:
# from transformers import AutoConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
)


hf_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
context_window = 2048
llm_model = HuggingFaceLLM(
    model_name=hf_model_name,
    tokenizer_name=hf_model_name,
    context_window=context_window,
    max_new_tokens=156,
    model_kwargs={"quantization_config": quantization_config},
    generate_kwargs={"top_k": 50},
    device_map="cuda",
)

# hf_model_name = 'mistralai/Mixtral-8x7B-v0.1'
# context_window = 4096
# llm_model = HuggingFaceLLM(
#     model_name=hf_model_name,
#     tokenizer_name=hf_model_name,
#     context_window=context_window,
#     max_new_tokens=256,
#     model_kwargs={"quantization_config": quantization_config},
#     generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
#     device_map="cuda",
# )

tokenizer = AutoTokenizer.from_pretrained(
    hf_model_name, cache_dir='./hf_cache'
)

embed_model = HuggingFaceEmbedding(model_name=basic_embedder_name, cache_folder='./hf_cache')

# Llama index settings from model
Settings.tokenizer = tokenizer
Settings.llm = llm_model
Settings.embed_model = embed_model
Settings.context_window = context_window


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



### Llama Index Ingestion Pipeline

In [None]:
# from llama_index.llms.mistralai import MistralAI
# from llama_index.embeddings.mistralai import MistralAIEmbedding

# Wrap the LangChain parser
basic_langchain_parser = LangchainNodeParser(
    TokenTextSplitter(chunk_size=300, chunk_overlap=30)
)

Settings.chunk_size = 300
Settings.chunk_overlap = 30

# create the pipeline with transformations
basic_ingestion_pipeline = IngestionPipeline(
    transformations=[
        basic_langchain_parser,
        embed_model
    ]
)

# run the pipeline
basic_nodes = basic_ingestion_pipeline.run(documents=ark_documents)

In [None]:
len(basic_nodes)

63

## Llama Index and ChromaDB Vector Store

In [None]:
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import PromptTemplate
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.core.indices.vector_store.retrievers.retriever import VectorIndexRetriever
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.core.prompts.base import PromptTemplate
from llama_index.core.prompts.prompt_type import PromptType

In [None]:
# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db_basic_minilm")

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
basic_index = VectorStoreIndex(basic_nodes, storage_context=storage_context, embed_model=embed_model)


## Llama Index Retriever (Basic), with Prompt Engineering

In [None]:
# Retriever function
basic_retriever = VectorIndexRetriever(
    index=basic_index,
    similarity_top_k=5,
)

# Craft the prompt template
DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. In your answer, do not begin by saying 'Based on the given context information', just give the answer directly. \n"
    "Query: {query_str}\n"
    "Answer: "
)
DEFAULT_TEXT_QA_PROMPT = PromptTemplate(
    DEFAULT_TEXT_QA_PROMPT_TMPL, prompt_type=PromptType.QUESTION_ANSWER
)


# configure response synthesizer, using the LLM,
# Compact response concatenates all retrieved relevant chunks
# Text QA template taken from LangChain, optimized for Llama3
basic_response_synthesizer = get_response_synthesizer(
    llm = llm_model,
    response_mode=ResponseMode.COMPACT,
    text_qa_template=DEFAULT_TEXT_QA_PROMPT
)

# assemble query engine
basic_query_engine = RetrieverQueryEngine(
    retriever=basic_retriever,
    response_synthesizer=basic_response_synthesizer,
)


## Questions for RAG

In [None]:
questions = """
Q01. What is the core objective of investing in disruptive innovation according to ARK?

Q02. What are the significant risks associated with investing in innovation as highlighted by ARK?

Q03. Can you list the converging innovation platforms identified by ARK?

Q04. How does ARK describe the impact of Artificial Intelligence on technology’s integration into economic sectors?

Q05. What transformative potential does Multiomic Sequencing hold according to ARK?

Q06. What are the implications of declining battery technology costs as outlined by ARK?

Q07. How is the field of Robotics anticipated to evolve with the advancements in AI?

Q08. What does the ARK’s Convergence Scoring Framework illustrate about innovation platforms?

Q09. How do neural networks serve as a catalyst for other technologies?

Q10. What unique view does ARK have towards Autonomous Mobility and its market potential?

Q11. How do AI Chatbots contribute to the development of robotaxis?

Q12. What are breakthroughs in DNA Sequencing, particularly with neural networks?

Q13. How does the application of AI language models in robotics enhance general task completion rates?

Q14. In what ways are battery advances critical to the future of intelligent devices and augmented reality?

Q15. How do reusable rockets contribute to global connectivity?

Q16. What economic implications do disruptive innovations have according to ARK?

Q17. What are the top 10 holdings of ARK Innovation ETF (ARKK)?

Q18. What thematic strategies do ARK ETFs focus on?

Q19. What is ARK's strategy for capturing the benefits of disruptive innovation in its investment approach?

Q20. How does ARK ensure its investment strategies align with reality of disruptive innovation trends?
"""
questions = [q[5:] for q in questions.splitlines() if q]

## Answers from GPT4o (pseudo ground truth)

In [None]:
gpt_answers = """
A01. The core objective of investing in disruptive innovation, according to ARK, is to access long-term growth by investing in companies at the forefront of technological advancements, transforming industries and creating new markets​.

A02. ARK highlights significant risks in investing in innovation, including market risk, regulatory hurdles, competitive pressures, political or legal challenges, and uncertainties surrounding the success of disruptive technologies in creating new markets.

A03. The converging innovation platforms identified by ARK are public blockchains, multiomic sequencing, artificial intelligence, energy storage, and robotics​.

A04. RK describes Artificial Intelligence as accelerating technology integration across all economic sectors by automating knowledge work, solving complex problems, and transforming businesses through AI-driven intelligent devices, which impact how people spend, work, and play​.

A05. According to ARK, Multiomic Sequencing holds transformative potential by enabling precision therapies, advancing gene editing, and unlocking programmable biology, which could revolutionize healthcare and industries like agriculture and food production​.

A06. ARK outlines that declining battery technology costs will drive the expansion of autonomous mobility, enable micro-mobility and aerial systems, reduce transportation costs, and transform energy systems by shifting from liquid fuel to electricity.

A07. ARK anticipates that advancements in AI will enable adaptive robots to operate alongside humans, navigate legacy infrastructure, enhance manufacturing through 3D printing, and reduce production costs with AI-guided robots.

A08. ARK’s Convergence Scoring Framework illustrates that innovation platforms, such as AI, robotics, and blockchain, are converging, generating a significant technological wave with the potential to transform industries and drive economic growth​.

A09. Neural networks act as a catalyst by automating knowledge work, solving complex problems, and accelerating technology integration across sectors, driving the development of AI-driven devices that transform industries and catalyze other innovation platforms.

A10. ARK views Autonomous Mobility as transformative, driven by declining battery costs, enabling new forms of transportation like flying taxis, reducing transport costs, and potentially reshaping city landscapes by decreasing the need for individual car ownership

A11. AI chatbots, like ChatGPT, improve conversational AI, which helps autonomous systems like robotaxis interpret complex instructions and interact with users. This advances the human-machine interface, making robotaxi operations more efficient and user-friendly​

A12. Breakthroughs in DNA sequencing, powered by neural networks, provide unprecedented access to digital biological data, enabling precision therapies. These advancements accelerate gene editing and programmable biology, unlocking new possibilities in healthcare and industries like agriculture​.

A13. AI language models in robotics enhance task completion rates by improving robots' ability to understand and execute complex instructions, enabling more adaptive and precise interactions in various environments​

A14. Battery advances are critical as they reduce costs, enabling more efficient intelligent devices and augmented reality systems by supporting longer usage, greater mobility, and integration into everyday tasks and environments​.

A15. Reusable rockets reduce the cost of launching satellite constellations, enabling uninterruptible global connectivity by facilitating the deployment of infrastructure needed for consistent communication and data services across the world​

A16. According to ARK, disruptive innovations drive economic growth by enhancing productivity, lowering costs, and creating new markets, ultimately contributing to significant long-term gains in real GDP and consumer surplus.

A17. The top 10 holdings of ARK Innovation ETF (ARKK) are Tesla, Roku, Coinbase, Roblox, Block, CRISPR Therapeutics, Robinhood, UiPath, Palantir, and Shopify, collectively representing 63% of the portfolio.

A18.ARK ETFs focus on thematic strategies such as innovation, next-generation internet, autonomous technology and robotics, genomic revolution, fintech innovation, Israel innovative technology, 3D printing, and space exploration and innovation.

A19. ARK's strategy for capturing disruptive innovation involves combining top-down and bottom-up research to identify innovative companies, focusing on multi-cap exposure across sectors, and aiming for the best risk-reward opportunities from innovation-driven themes.

A20. ARK ensures alignment with disruptive innovation trends by conducting rigorous research, combining top-down and bottom-up analysis, and continually assessing both market and sector risks, as well as regulatory and competitive landscapes​.
"""
gpt_answers = [a[5:] for a in gpt_answers.splitlines() if a]

## Test one RAG question

In [None]:
print(questions[0])
print()
# response = retrieval_chain.invoke({"input": questions[0]}) # for LangChain
# answer = response["answer"]

response = basic_query_engine.query(questions[0])  #Llama Index
print(str(response))


What is the core objective of investing in disruptive innovation according to ARK?


ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and uncertainties may impact our projections and research models. Investors should use the content presented for informational purposes only, and be aware of market risk, disruptive innovation risk, regulatory risk, and risks related to certain innovation areas. Please read risk disclosure carefully.


In [None]:
textwrap.wrap(response.response, width=100)

[' ARK aims to educate investors and seeks to size the potential investment opportunity, noting that',
 'risks and uncertainties may impact our projections and research models. Investors should use the',
 'content presented for informational purposes only, and be aware of market risk, disruptive',
 'innovation risk, regulatory risk, and risks related to certain innovation areas. Please read risk',
 'disclosure carefully.']

## Call all questions and store answers

In [None]:
basic_rag_answers = []
basic_rag_responses = []
for q in tqdm(questions):
    # response = retrieval_chain.invoke({"input": q}) # LangChain
    # answer = response["answer"]
    response = basic_query_engine.query(q)  # LlamaIndex
    basic_rag_answers.append(response.response)
    basic_rag_responses.append(response)

100%|██████████| 20/20 [05:24<00:00, 16.24s/it]


In [None]:
with open('basic_rag_answers.txt', 'w') as f:
    for i, answer in enumerate(basic_rag_answers):
        f.write(f'A{i+1:02}. {answer}\n---------------\n\n')

In [None]:
basic_rag_response_contexts = []
for response in basic_rag_responses:
    basic_rag_response_contexts.append([n.text for n in response.source_nodes])

In [None]:
basic_rag_evaluation_data = []
for i in range(len(questions)):
  eval_data = {
      "input": questions[i],
      "actual_output": basic_rag_answers[i],
      "expected_output": gpt_answers[i],
      "context": basic_rag_response_contexts[i],
      'response': basic_rag_responses[i]
  }
  basic_rag_evaluation_data.append(eval_data)


In [None]:
import pickle
with open('basic_rag_evaluation_data.pkl', 'wb') as f:
    pickle.dump(basic_rag_evaluation_data, f)

In [None]:
import pickle
with open('basic_rag_evaluation_data.pkl', 'rb') as f:
    basic_rag_evaluation_data = pickle.load(f)

In [None]:
# unpack saved data
basic_rag_answers = []
basic_rag_response_contexts = []
basic_rag_responses = []

for i in range(len(questions)):
  eval_data = basic_rag_evaluation_data[i]
  basic_rag_answers.append(eval_data['actual_output'])
  basic_rag_response_contexts.append(eval_data['context'])
  basic_rag_responses.append(eval_data['response'])


## Evaluations



In [None]:
# from llama_index.llms.openai import OpenAI
# gpt = OpenAI(temperature=0, model="gpt-3.5-turbo")

# # hit rate limiit with OpenAI (default evaluation method for DeepEval)

In [None]:
# Trying to use Llama Index wrapper for Eval methods

import nest_asyncio
nest_asyncio.apply()

# Faithfulness
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    ContextRelevancyEvaluator,
    AnswerRelevancyEvaluator,
)

faithfulness = FaithfulnessEvaluator(llm=llm_model)
relevancy = RelevancyEvaluator(llm=llm_model)
correctness = CorrectnessEvaluator(llm=llm_model)
semantic_similarity = SemanticSimilarityEvaluator()
context_relevancy = ContextRelevancyEvaluator()
answer_relevancy = AnswerRelevancyEvaluator(llm=llm_model)

In [None]:
faithfulness_eval = faithfulness.evaluate_response(response=basic_rag_responses[0], query=questions[0])
print(f"\nFaithfulness score for Q01. with Basic RAG: {faithfulness_eval.score}")
print(f"\nFaithfulness reason for Q01. with Basic RAG: {faithfulness_eval.reason}")
print('n', faithfulness_eval)

In [None]:
relevancy.evaluate_response(response=basic_rag_responses[0], query=questions[0], reference=gpt_answers[0])

EvaluationResult(query='What is the core objective of investing in disruptive innovation according to ARK?', contexts=['•Risks of Investing in InnovationPlease note: Companies that ARK believes are capitalizing on disruptive innovation and developing technologies to displace older technologies or create new markets may not in fact do so. ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and uncertainties may impact our projections and research models. Investors should use the content presented for informational purposes only, and be aware of market risk, disruptive innovation risk, regulatory risk, and risks related to certain innovation areas. Please read risk disclosure carefully.\nDISRUPTIVE INNOVATIONRAPID PACE OF CHANGE\nUNCERTAINTY AND UNKNOWNS EXPOSURE ACROSS SECTORS AND MARKET CAPRISK OF INVESTING', 'IN INNOVATIONREGULATORY HURDLES\nCOMPETITIVE LANDSCAPEPOLITICAL OR LEGAL PRESSURE\nSources: ARK Investment Management LLC, 202

## Evaluation Wrapper Errors with Llama Index
- I was not able to complete all the evaluation metrics for Basic RAG or Advanced RAG
  - Using the standard GPT4 API call for the LLM judge gave me rate limits and 404s even with a paid account
  - I dug into defining a local LLM for judging, and that implementation eventually led to errors in calling the model for the evaluation metrics
- With more time I would fix the errors, and complete the batch evalutions and compare the Basic and Advanced implementations

<br>

- I moved the errored out evaluation calls to the end of the notebook for reference.


---

# Advanced RAG

In [None]:
## Advanced RAG
from llama_index.core.postprocessor import SentenceTransformerRerank

adv_embed_model = HuggingFaceEmbedding(model_name=advanced_embedder_name, cache_folder='./hf_cache')

# Llama index settings from model
Settings.tokenizer = tokenizer
Settings.llm = llm_model
Settings.embed_model = adv_embed_model
Settings.context_window = context_window

# Wrap the LangChain parser
advanced_langchain_parser = LangchainNodeParser(
    RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
)
Settings.chunk_size = 500
Settings.chunk_overlap = 00

# create the pipeline with transformations
advanced_ingestion_pipeline = IngestionPipeline(
    transformations=[
        advanced_langchain_parser,
        adv_embed_model
    ]
)
# run the pipeline
advanced_nodes = advanced_ingestion_pipeline.run(documents=ark_documents)

# initialize client, setting path to save data
adv_db = chromadb.PersistentClient(path="./chroma_db_advanced")
# create collection
adv_chroma_collection = adv_db.get_or_create_collection("quickstart")
# assign chroma as the vector_store to the context
adv_vector_store = ChromaVectorStore(chroma_collection=adv_chroma_collection)
adv_storage_context = StorageContext.from_defaults(vector_store=adv_vector_store)
# create your index
advanced_index = VectorStoreIndex(advanced_nodes, storage_context=adv_storage_context, embed_model=adv_embed_model)

advanced_retriever = VectorIndexRetriever(
    index=advanced_index,
    similarity_top_k=8,
)

# Craft the prompt template
# QA Template
DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. In your answer, do not begin by saying 'Based on the given context information', just give the answer directly. \n"
    "Query: {query_str}\n"
    "Answer: "
)
DEFAULT_TEXT_QA_PROMPT = PromptTemplate(
    DEFAULT_TEXT_QA_PROMPT_TMPL, prompt_type=PromptType.QUESTION_ANSWER
)
# Refinement template
DEFAULT_REFINE_PROMPT_TMPL = (
    "The original query is as follows: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the query. "
    "If the context isn't useful, return the original answer.\n"
    "Refined Answer: "
)
DEFAULT_REFINE_PROMPT = PromptTemplate(
    DEFAULT_REFINE_PROMPT_TMPL, prompt_type=PromptType.REFINE
)

# configure response synthesizer, using the LLM,
# Compact response concatenates all retrieved relevant chunks
# Text QA template taken from LangChain, optimized for Llama3
advanced_response_synthesizer = get_response_synthesizer(
    llm = llm_model,
    response_mode=ResponseMode.REFINE,
    text_qa_template=DEFAULT_TEXT_QA_PROMPT,
    refine_template=DEFAULT_REFINE_PROMPT
)

In [None]:
# assemble advanced query engine
advanced_query_engine = RetrieverQueryEngine(
    retriever=advanced_retriever,
    response_synthesizer=advanced_response_synthesizer,
    # Pretrained ReRanker
    node_postprocessors = [
        SentenceTransformerRerank(
            model="sentence-transformers/msmarco-distilroberta-base-v2", top_n=3
        )
    ]
    # Another ReRanker option, fine tuned for RAG Re-Ranking
    # reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True)
)


config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/msmarco-distilroberta-base-v2 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]



In [None]:
print(questions[0])
print()
response = advanced_query_engine.query(questions[0])  #Llama Index
textwrap.wrap(response.response, width=100)

What is the core objective of investing in disruptive innovation according to ARK?



[' ARK seeks to capture disruptive innovation by investing in companies that are at the forefront of',
 'technological advancements that are disrupting established industries. This strategy aims to',
 'generate long-term capital appreciation while also providing exposure to companies that are at the',
 "forefront of innovation. ARK's investment approach is based on a fundamental analysis of the",
 'underlying business models and technologies, as well as a rigorous screening process to identify',
 "companies that are poised for growth and potential disruption. ARK's investment team is composed of",
 'experienced investment professionals with a deep understanding of the technology and business',
 'landscape. They are committed to identifying and investing in companies that are at the forefront of']

In [None]:
advanced_rag_answers = []
advanced_rag_responses = []
for q in tqdm(questions):
    response = advanced_query_engine.query(q)
    advanced_rag_responses.append(response)

 15%|█▌        | 3/20 [01:56<11:37, 41.06s/it]

----

## Evaluation Experiments with DeepEval and Llama Index

In [None]:
faithfulness_de = DeepEvalFaithfulnessEvaluator(model="gpt-3.5-turbo")
evaluation_result = faithfulness_de.evaluate_response(
    query=questions[0], response=basic_rag_responses[0]
)
print(evaluation_result)


Output()

ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 1 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 2 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 3 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 3 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 4 time(s)...
ERROR:root:OpenAI rate limit exceeded. Retrying: 4 time(s)...


KeyboardInterrupt: 

In [None]:
faithfulness_de = DeepEvalFaithfulnessEvaluator()
evaluation_result = faithfulness_de.evaluate_response(
    query=questions[0], response=basic_rag_responses[0]
)
print(evaluation_result)

Output()

NotFoundError: Error code: 404 - {'error': {'message': 'The model `gpt-4o` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

In [None]:
from deepeval.integrations.llama_index import (
    DeepEvalAnswerRelevancyEvaluator,
    DeepEvalFaithfulnessEvaluator,
    DeepEvalContextualRelevancyEvaluator,
    DeepEvalSummarizationEvaluator,
    DeepEvalBiasEvaluator,
    DeepEvalToxicityEvaluator,
)

In [None]:
ans_relevancy_de = DeepEvalAnswerRelevancyEvaluator(model=llm_de)
ans_relevancy_de.evaluate(query=questions[0], response=basic_rag_responses[0], contexts=basic_rag_response_contexts[0])

Output()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


AttributeError: 'list' object has no attribute 'find'

In [None]:
from tqdm.asyncio import tqdm_asyncio
from llama_index.core.evaluation import BatchEvalRunner
runner_no_llm = BatchEvalRunner(
    {"semantic_similarity": semantic_similarity, "context_relevancy": context_relevancy},
    workers=8,
)

In [None]:
runner.evaluate_responses(
    queries = questions[:1],
    responses = basic_rag_responses[:1],
    references = gpt_answers[:1],
    contexts = basic_rag_response_contexts[1:]
)

TypeError: llama_index.core.evaluation.faithfulness.FaithfulnessEvaluator.aevaluate() got multiple values for keyword argument 'contexts'

In [None]:
from deepeval.models import DeepEvalBaseLLM

class CustomTinyLlama(DeepEvalBaseLLM):
    def __init__(self, hf_model, hf_tokenizer):
        self.model = hf_model
        self.tokenizer = hf_tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        model = self.load_model()

        pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=self.tokenizer,
            use_cache=True,
            device_map="auto",
            max_length=2500,
            do_sample=True,
            top_k=5,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        return pipeline(prompt)

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return "TinyLlama 1B"

llm_de = CustomTinyLlama3(
    AutoModelForCausalLM.from_pretrained(
            hf_model_name,
            device_map="auto",
        ),
    AutoTokenizer.from_pretrained(hf_model_name),
)

In [None]:

from deepeval.integrations.llamaindex import DeepEvalFaithfulnessEvaluator
evaluator = DeepEvalFaithfulnessEvaluator()
evaluation_result = evaluator.evaluate_response(
    query=questions[0], response=basic_rag_responses[0]
)
print(evaluation_result)

ModuleNotFoundError: No module named 'deepeval.integrations.llamaindex'

In [None]:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def compute_answer_relevancy(
    question: str,
    response: str,
    retrieved_context,
    threshold=0.5,
    local_llm_evaluator=None,
    ):
    answer_relevancy_metric = AnswerRelevancyMetric(
        threshold=threshold, model=local_llm_evaluator
    )
    test_case = LLMTestCase(
        input=question,
        actual_output=response,
        retrieval_context=retrieved_context
    )
    # assert_test(test_case, [answer_relevancy_metric])
    return answer_relevancy_metric.measure(test_case)


In [None]:
print(llm_de.generate("What is aquaman's favorite food?"))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': "What is aquaman's favorite food? Aquaman's favorite food is seafood, especially fish. He loves to eat fish and other seafood dishes, and is known to be particularly fond of seafood from the ocean, which is fitting for a character who is a superhero and a king of the underwater kingdom of Atlantis.\nWhat is Aquaman's favorite food?\nAquaman's favorite food is seafood, especially fish. He loves to eat fish and other seafood dishes, and is known to be particularly fond of seafood from the ocean, which is fitting for a character who is a superhero and a king of the underwater kingdom of Atlantis.\nWhat is Aquaman's favorite food?\nAquaman's favorite food is seafood, especially fish. He loves to eat fish and other seafood dishes, and is known to be particularly fond of seafood from the ocean, which is fitting for a character who is a superhero and a king of the underwater kingdom of Atlantis.\nWhat is Aquaman's favorite food?\nAquaman's favorite food is seafood, especia

In [None]:
question="What if these shoes don't fit?"
response="We offer a 30-day full refund at no extra cost.",
retrieved_context=["All customers are eligible for a 30 day full refund at no extra cost."]

relevancy_metric = compute_answer_relevancy(
    question, response, retrieved_context, local_llm_evaluator=llm_de
)
print(relevancy_metric.score, relevancy_metric.reason)

Output()

AttributeError: 'list' object has no attribute 'find'

---

---

# ----- End of Report -----

---
## Unused Code Below, Experimentation

In [None]:
def basic_chunking_docs(doc_list: list[LangchainDocument], tokenizer_name: str) -> list[str]:
    """
    Split documents into chunks of size `chunk_size` characters and return a
    list of chunks (str). Chunk size determined by SentenceTransformer text
    embedder context window size.  Return Document objects
    """
    text_splitter = SentenceTransformersTokenTextSplitter(
        model_name=basic_embedder_name,
        chunk_overlap=0,
    )
    return text_splitter.split_documents(doc_list)


def advanced_chunking(
    doc_list: list[str],
    huggingface_embedder_name: str
    ) -> list[LangchainDocument]:
    """
    Split document pages into chunks using semantic similarity.
    Using the higher dimension of the advanced text embedder, and
    using the embedding gradient chuck boundary method since all data is in
    the financial domain and may have highly correlated embedding similarity.

    Applying Chunking on each page separately to preserve that semantic boundary.
    """
    text_splitter = SemanticChunker(
        HuggingFaceEmbeddings(model_name=advanced_embedder_name),
        breakpoint_threshold_type="gradient"
    )
    return text_splitter.create_documents(doc_list)


In [None]:
# from semantic_router.encoders import HuggingFaceEncoder
# from semantic_chunkers import StatisticalChunker
# from semantic_router.encoders import HuggingFaceEncoder
# from langchain_community.document_loaders import TextLoader
# from langchain_mistralai.chat_models import ChatMistralAI
# from langchain_mistralai.embeddings import MistralAIEmbeddings
# from langchain_community.vectorstores import FAISS
# from langchain_community.docstore.in_memory import InMemoryDocstore
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.chains.combine_documents import create_stuff_documents_chain
# from langchain_core.prompts import ChatPromptTemplate
# from langchain.chains import create_retrieval_chain

## BASIC RAG IMPLEMENTATION - LangChain

# # Load data
arkk_pages, arkk_page_text, arkk_full_text = get_pdf_text("Investment_Case_For_Disruptive_Innovation.pdf")
# # Split text into chunks
doc_chunks = basic_chunking_docs(arkk_pages, basic_embedder_name)

# # Define the embedding model
embedding_function = HuggingFaceEmbeddings(model_name=basic_embedder_name)

# # Create the vector store
vector_db = FAISS.from_documents(
    doc_chunks,
    embedding_function,
    docstore= InMemoryDocstore(),
)

# Define a retriever interface
retriever = vector_db.as_retriever(
    search_kwargs={"k": 3},
    search_type = 'similarity'
)

# Define LLM
llm = ChatMistralAI(mistral_api_key=mistral_api_key)

# Define prompt template
prompt = hub.pull("rlm/rag-prompt-mistral")
""" Prompt used
<s> [INST] You are an assistant for question-answering tasks. Use the following pieces
of retrieved context to answer the question. If you don't know the answer, just say
that you don't know. Use three sentences maximum and keep the answer concise. [/INST]
 </s> \n[INST] Question: {question} \nContext: {context} \nAnswer: [/INST]"
"""

# Create a retrieval chain to answer questions
document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)




## Download Llama3 instruct 8b for local LLM eval
# from transformers import BitsAndBytesConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer


# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

# model_4bit = AutoModelForCausalLM.from_pretrained(
#     "NousResearch/Meta-Llama-3.1-8B-Instruct",
#     device_map="auto",
#     quantization_config=quantization_config,
# )
# llama_tokenizer = AutoTokenizer.from_pretrained(
#     "NousResearch/Meta-Llama-3.1-8B-Instruct"
# )

In [None]:
# ### Llama3 prompt
# langchain_prompt = hub.pull("rlm/rag-prompt-llama3")
# ## NEED To implement the partial varibles by name for query
# """
# "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an assistant for question-answering tasks.
# Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that
# you don't know. Use three sentences maximum and keep the answer concise <|eot_id|><|start_header_id|>user<|end_header_id|>
# \nQuestion: {question} \nContext: {context} \nAnswer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n
# """


# # RAG Prompt
# langchain_prompt = hub.pull("rlm/rag-prompt")
# """
# You are an assistant for question-answering tasks. Use the following pieces of
# retrieved context to answer the question. If you don't know the answer, just say
# that you don't know. Use three sentences maximum and keep the answer concise.

# Question: {question}

# Context: {context}

# Answer:
# """

# ## Mistral/Mixtral prompt
# langchain_prompt_prompt = hub.pull("rlm/rag-prompt-mistral")
# """
# <s> [INST] You are an assistant for question-answering tasks. Use the following pieces
#  of retrieved context to answer the question. If you don't know the answer, just say
#  that you don't know. Use three sentences maximum and keep the answer concise. [/INST] </s>

# [INST] Question: {question}

# Context: {context}

# Answer: [/INST]
# """
