## Note to Reviewer:
I have not yet used LLMs in a corporate setting, but I have done training and have learned the fundamentals on my own. I know that GenAI will be a focus of this role. I wanted to provide sample code to demonstrate my skills in this area. I will be able to leverage my knowledge and training obtained outside of a corporate setting to more quickly learn on the job.

In [0]:
from langchain.chat_models import ChatDatabricks
from langchain.embeddings import DatabricksEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import GutenbergLoader

from trulens_eval import (
    Feedback,
    TruChain,
    Tru,
)
from trulens_eval.app import App
from trulens_eval.feedback.provider.langchain import Langchain

import re
import pandas as pd
import numpy as np

In [0]:
# Reading in a dataset with main characters from each novel so I can perform extensive evaluation of models
character_df = pd.read_csv("/dbfs/mnt/finance_tables/alexandre_dumas_characters.csv")

In [0]:
character_df.head(8)

In [0]:
chat_llm_name = "databricks-dbrx-instruct"
eval_llm_name = "databricks-mixtral-8x7b-instruct"
embedding_name = "databricks-gte-large-en"

In [0]:
###### Setting embedding models
db_embedding_model = DatabricksEmbeddings(endpoint=embedding_name)

###### Setting llms for prediction and evaluation
db_llm = ChatDatabricks(endpoint=chat_llm_name, temperature=0)
db_eval_llm = ChatDatabricks(endpoint=eval_llm_name, temperature=0)

# Using a different model for evaluation than for generating the models to avoid bias
# Need a langchain object for trulens model evaluation
langchain_provider = Langchain(
    chain=ChatDatabricks(endpoint=eval_llm_name, temperature=0)
)

# Using semantic splitter solely for this use case because non-semantic alternatives did not perform well in v1
text_splitter = SemanticChunker(
    db_embedding_model, 
    breakpoint_threshold_type="standard_deviation"
)

In [0]:
# Adding book title metadata to help the model find the correct context
book_dict = {
    "1257": "The Three Musketeers",
    "1184": "The Count of Monte Cristo",
    "1259": "Twenty Years After",
    "2759": "The Man in the Iron Mask",
    "965": "The Black Tulip",
}


# Performing clean up of text to improve quality
def clean_section(txt):
    txt = re.sub(r"\n|\r", " ", txt)
    return re.sub(" +", " ", txt)


# Cleaning and chunking data
processed_documents = []

for document_number in list(book_dict.keys()):
    loader = GutenbergLoader(
        f"https://www.gutenberg.org/cache/epub/{document_number}/pg{document_number}.txt"
    )
    data = loader.load()
    document_text = clean_section(data[0].page_content)
    documents = text_splitter.create_documents([document_text])
    for doc in documents:
        doc.metadata["book_title"] = book_dict[document_number]
    processed_documents.append(documents)

In [0]:
# Putting vector index into ChromaDB
vectorstore = Chroma.from_documents(
    documents=processed_documents[0], embedding=db_embedding_model
)
vectorstore.aadd_documents(processed_documents[1:5])
retriever = vectorstore.as_retriever(search_kwargs=dict(k=6))

## Prompt building:
- Adding an example to the prompt greatly improved the model responses
- Added visual cues for the user inputs
- Outlined the steps the model should take to build the prompt
- Directed the model to not make up information to reduce hallucinations
- First versions had a lot of generic adjectives and character descriptions. I told the LLM to instead focus on key plot elements to improve the quality of the biographies.

In [0]:
# Define the prompt template with multiple user-provided input variables
custom_prompt_template = """
You are an AI assistant that writes biographies for book characters. The user will
provide you with a character name. When given a character and book title,
use the given context to create a short biography of the character from that book. Use the following steps:
1. Use the document metadata to identify the appropriate document
2. Once the correct document is identified, determine the main plot events this character was involved in
3. Then use these plot events to construct a biography using the given example to show what kind of information and what style to use.

If you are not familiar with the character provided, do not make up information.

Use only the given context to build the biography:
#### Context: {context}

Create a biography for the following character:
#### Character name: {character_name} 

Search for context from the following book:
#### Book title: {book_title}. 

Use the following example to generate output: 
#### Example: Edmond Dantès: A Biography

Edmond Dantès, the protagonist of Alexandre Dumas' novel "The Count of Monte Cristo," is a young sailor who is betrayed by his friends and imprisoned for a crime he did not commit. Born in Marseille, France, Dantès is the son of a Bonapartist admiral and is engaged to be married to his father's ward, Mercédès.

Dantès' life takes a drastic turn when he is falsely accused of treason and imprisoned in the Château d'If, a notorious island prison. While in prison, Dantès meets a fellow prisoner, Abbé Faria, who becomes his mentor and teaches him about the world outside of his cell.

After Faria's death, Dantès finds a hidden treasure that Faria had been searching for, and he uses it to escape from prison and start a new life. Adopting the persona of the wealthy and mysterious Count of Monte Cristo, Dantès sets out to clear his name and seek revenge against those who wronged him.

Throughout the novel, Dantès faces numerous challenges and obstacles as he navigates the complexities of high society and tries to uncover the truth about his past. Despite the many injustices he has suffered, Dantès remains a sympathetic and compelling character."""

# Create the PromptTemplate with the specified input variables
input_variables = ["character_name", "book_title", "context"]
PROMPT = PromptTemplate(
    template=custom_prompt_template, input_variables=input_variables
)

# Specifying LLM chains for prediction and evaluation using different models
llm_chain = LLMChain(prompt=PROMPT, llm=db_llm)
eval_llm_chain = LLMChain(prompt=PROMPT, llm=db_eval_llm)

# Call the chain with the required inputs. Doing one example for testing.
character_name = "Aramis"
book_title = "The Three Musketeers"
result = llm_chain.run(
    {"character_name": character_name, "book_title": book_title, "context": retriever}
)

print(result)

## Trulens Triad
### Context Relevance
The first step of any RAG application is retrieval; to verify the quality of our retrieval, we want to make sure that each chunk of context is relevant to the input query. This is critical because this context will be used by the LLM to form an answer, so any irrelevant information in the context could be weaved into a hallucination. TruLens enables you to evaluate context relevance by using the structure of the serialized record.

### Groundedness
After the context is retrieved, it is then formed into an answer by an LLM. LLMs are often prone to stray from the facts provided, exaggerating or expanding to a correct-sounding answer. To verify the groundedness of our application, we can separate the response into individual claims and independently search for evidence that supports each within the retrieved context.

### Answer Relevance
Last, our response still needs to helpfully answer the original question. We can verify this by evaluating the relevance of the final response to the user input.

Source: https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/

In [0]:
# Create trulens evaluation function
# Would normally put this in a separate .py file but I thought putting in a notebook would speed up review.
def run_tru_evaluation(
    app_id_name,
    character_df,
    langchain_provider,
    llm_chain,
    retriever,
    reset_database=False,
):
    tru = Tru()

    if reset_database:
        tru.reset_database()

    # select context to be used in feedback
    context = App.select_context(retriever)

    # Define a groundedness feedback function
    groundedness = (
        Feedback(
            langchain_provider.groundedness_measure_with_cot_reasons,
            name="Groundedness",
        )
        .on(context.collect())
        .on_output()
    )

    # Question/answer relevance between overall question and answer
    answer_relevance = Feedback(
        langchain_provider.relevance, name="Answer Relevance"
    ).on_input_output()
    # Question/statement relevance between question and each context chunk
    context_relevance = (
        Feedback(langchain_provider.qs_relevance, name="Context Relevance")
        .on_input()
        .on(context)
        .aggregate(np.mean)
    )

    tru_recorder = TruChain(
        llm_chain,
        app_id=app_id_name,
        feedbacks=[answer_relevance, context_relevance, groundedness],
    )

    # Loop through characters and book titles for evaluation
    for index, row in character_df.iterrows():
        character_name = row["character"]
        book_title = row["title"]

        with tru_recorder as recording:
            llm_chain.invoke(
                input={
                    "character_name": character_name,
                    "book_title": book_title,
                    "context": retriever,
                }
            )

    record = None
    if "get_ipython" in globals():
        record = recording.get()

    for feedback, feedback_result in record.wait_for_feedback_results().items():
        feedback_value = feedback_result.result
        if isinstance(feedback_value, int):
            feedback_value = float(feedback_value)

    records, feedback = tru.get_records_and_feedback(app_ids=[app_id_name])

    records["book_title"] = character_df["character"].reset_index(drop=True)
    records["character"] = character_df["title"].reset_index(drop=True)

    return records

In [0]:
app_id_name = "CharacterBiographies"

# Performing evaluation
records = run_tru_evaluation(
    app_id_name,
    character_df,
    langchain_provider,
    eval_llm_chain,
    retriever,
    reset_database=False,
)

In [0]:
# Printing results
records.head(10)

In [0]:
# Showing leaderboard
tru = Tru()

tru.get_leaderboard(app_ids=[app_id_name])

# Results
| Model | Threshold Type | Embedding Model | Answer Relevance | Context Relevance | Groundedness | latency |
| -------- | ------- | ------- | ------- | ------- | ------- | ------- |
| DBRX | Percentile | BGE | 0.983333 | 0.9025 | 0.873267 | 3.191667 |
| Llama3 | Percentile | BGE | 0.988542 | 0.870312 | 0.906493 | 3.005208 |
| Llama2 | Percentile | BGE | 0.986905 | 0.877976 | 0.89929 | 3.02381 |
| Mixtral | Percentile | BGE | 0.979167 | 0.926042 | 0.850545 | 3.358333 |
| DBRX | Percentile | GTE | 0.958333 | 0.958333 | 0.855829 | 3.5 |
| DBRX | Std Dev | GTE | 0.979167 | 0.96875 | 0.835575 | 3.416667 |
| DBRX | IQR | GTE | 0.972222 | 0.965278 | 0.811165 | 3.361111 |

### Best Model:** 
- **LLM**: databricks-dbrx-instruct performed best
- **Embedding model**: databricks-gte-large-en performed substantially better than databricks-bge-large-en
- **Breakpoint threshold type**: Using the standard deviation for identifying similar sentences during chunking performed slightly better than percentile and interquartile range.

## Decision Making in this Code:
- Used **langchain instead of llamaindex**
  - Llamaindex puts a lot of its functionality "underneath the hood". For example, it uses OpenAI by default, and when you turn it off, using an alternate tokenizer does not accurately chunk data into properly sized chunks
  - I saw much improved scalability and performance when using langchain versus llamaindex
- I chose a **semantic data chunker** for this code.
  - I started off with llamaindex's advanced RAG functions (sentence chunker, window retrieval, etc.) but they had several downsides:
    - They used substantially more memory and often wouldn't finish on a reasonably-sized cluster
    - They had a trouble extracting relevant context
- I used **visual cues in my prompt** to indicate where user-defined inputs and context were located
- Langchain has a **PromptTemplate functionality** that helped clean up the prompt

## Next Steps for Model Improvement
- **Ground Truth Evaluation:** I would hand-create some "golden samples" that could be used for ground-truth evaluation to supplement the generic Trulens functions. I would also create my own custom evaluation metric and grading rubric. MLflow has a method that can then be used with an LLM-as-a-judge to evaluate the model with the custom rubric.
- **Human Evaluation:** I would perform more extensive human quality evaluation with a subject-matter expert.
- **Model Tuning:** I would do more extensive hyperparameter tuning. For example, I would adjust the model context (i.e. number of similar pieces of context to obtain from the vector index). I would also experiment with long-context reorder post-processing so the middle of the books might be given more attention.
- **Few Shot Learning:** I would utilize model memory to give more examples on how to create good biographies so the model could better pick up on the correct writing style