### Steps:
1. Decide on your LLM - GPT-4o-mini
2. select your embedding
3. Document reader
4. Index
5. Storing
6. Query Engine or Chat Engine
7. Evaluation
8. Observability

#### A Note on Tokenization#
By default, LlamaIndex uses a global tokenizer for all token counting. This defaults to cl100k from tiktoken, which is the tokenizer to match the default LLM gpt-3.5-turbo.

If you change the LLM, you may need to update this tokenizer to ensure accurate token counts, chunking, and prompting.

The tokenizer used by LLaMA is a SentencePiece Byte-Pair Encoding tokenizer. Note that this is a tokenizer for LLaMA models, and it's different than the tokenizers used by OpenAI models.

#### Resources
* https://docs.llamaindex.ai/en/stable/module_guides/models/llms/

You can set a global tokenizer like so:


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [2]:
from llama_index.llms.openai import OpenAI

# non-streaming
resp = OpenAI(model="gpt-4o-mini").complete("Llama 3.1 is ")
print(resp)

As of my last update in October 2023, Llama 3.1 refers to a version of the LLaMA (Large Language Model Meta AI) series developed by Meta (formerly Facebook). The LLaMA models are designed for various natural language processing tasks and are known for their efficiency and performance. Each version typically includes improvements in architecture, training data, and capabilities compared to its predecessors.

If you have specific questions about Llama 3.1 or its features, feel free to ask!


In [3]:
from llama_index.core import Settings

# tiktoken
import tiktoken

Settings.tokenizer = tiktoken.encoding_for_model("gpt-4").encode

# huggingface
# from transformers import AutoTokenizer

# Settings.tokenizer = AutoTokenizer.from_pretrained(
#     "HuggingFaceH4/zephyr-7b-beta"
# )

Embeddings are used in LlamaIndex to represent your documents using a sophisticated numerical representation. Embedding models take text as input, and return a long list of numbers used to capture the semantics of the text. These embedding models have been trained to represent text this way, and help enable many applications, including search!

At a high level, if a user asks a question about dogs, then the embedding for that question will be highly similar to text that talks about dogs.

By default, LlamaIndex uses text-embedding-ada-002 from OpenAI.

When calculating the similarity between embeddings, there are many methods to use (dot product, cosine similarity, etc.). By default, LlamaIndex uses cosine similarity when comparing embeddings.

* https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/

In [4]:
from llama_index.embeddings.openai import OpenAIEmbedding
# global
Settings.embed_model = OpenAIEmbedding()

The SimpleDirectoryReader is the most commonly used data connector that just works.
By default, it can be used to parse a variety of file-types on your local filesystem into a list of Document objects. Additionaly, it can also be configured to read from a remote filesystem just as easily! This is made possible through the fsspec protocol.

[Resource](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader_remote_fs.ipynb)

In [7]:
# create a folder in content called data and upload files
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(
    input_dir='data',
    recursive=True,  # recursively searches all subdirectories
)

In [8]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
#docs[:71] #you exclude all the refrences and contributors

Loaded 92 docs


In [10]:
# show the metadata of each document
# for idx, doc in enumerate(docs):
#     print(f"{idx} - {doc.metadata}")

### Chunking

This suggests that a chunk size of 1024 might strike an optimal balance between response time and the quality of the responses, measured in terms of faithfulness and relevancy.

In [11]:
# https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/
# https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

# global
Settings.chunk_size = 1024
index = VectorStoreIndex.from_documents(docs)

### Querying
Querying is the most important part of your LLM application.

Right now, we support the following options:

refine: create and refine an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk.
Details: the first chunk is used in a query using the text_qa_template prompt. Then the answer and the next chunk (as well as the original question) are used in another query with the refine_template prompt. And so on until all chunks have been parsed.

#### Resources:
* https://docs.llamaindex.ai/en/stable/module_guides/querying/
* https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/response_modes/


In [12]:
query_engine = index.as_query_engine()
response = query_engine.query("Llama 3.1 is")

In [13]:
print(response)

Llama 3.1 is a set of foundation models for language that includes models with 8B, 70B, and 405B parameters. These models support multilinguality, coding, reasoning, and tool usage. The development process optimized for data, scale, and managing complexity to enhance the quality and performance of the models.


To stream response:

In [14]:
query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Llama 3.1 is")
streaming_response.print_response_stream()

a set of foundation models for language that natively support multilinguality, coding, reasoning, and tool usage.

Low-Level Composition API
* https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/usage_pattern/

In [15]:
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# build index
index = VectorStoreIndex.from_documents(docs)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    #verbose=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("Llama 3.1 is")
print(response)

a set of foundation models for language that natively support multilinguality, coding, reasoning, and tool usage.


### Evaluation

LlamaIndex offers key modules to measure the quality of generated results. We also offer key modules to measure retrieval quality.

Response Evaluation: Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines?
Retrieval Evaluation: Are the retrieved sources relevant to the query?

Response Evaluation#
Evaluation of generated results can be difficult, since unlike traditional machine learning the predicted result isn't a single number, and it can be hard to define quantitative metrics for this problem.

LlamaIndex offers LLM-based evaluation modules to measure the quality of results. This uses a "gold" LLM (e.g. GPT-4) to decide whether the predicted answer is correct in a variety of ways.

* https://docs.llamaindex.ai/en/stable/module_guides/evaluating/

In [17]:
# Uncomment if you are in a Jupyter Notebook
import nest_asyncio
nest_asyncio.apply()


In [18]:
# The FaithfulnessEvaluator evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there's hallucination).
from llama_index.core.evaluation import FaithfulnessEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)

# query index
query_engine = index.as_query_engine()
response = query_engine.query(
    "LLama 3.1 is"
)
eval_result = evaluator.evaluate_response(response=response)
print(str(eval_result.passing))

True


* https://huggingface.co/learn/cookbook/en/llm_judge
* https://huggingface.co/learn/cookbook/en/rag_evaluation - Excellent Diagram to study Rag

In [19]:
response_str = response.response
for source_node in response.source_nodes:
    eval_result = evaluator.evaluate(
        response=response_str, contexts=[source_node.get_content()]
    )
    print(str(eval_result.passing))

True
False
