## <b><font color='darkblue'>Preface</font></b>
([source](https://learn.deeplearning.ai/courses/langchain/lesson/6/evaluation)) <b><font size='3ptx'>Building applications with language models involves many moving parts. One of the most critical components is ensuring that the outcomes produced by your models are reliable and useful across a broad array of inputs, and that they work well with your application's other software components. ([more](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/))</font></b>

<b> Ensuring reliability usually boils down to some combination of application design, testing & evaluation, and runtime checks.</b>

The guides in this section review the APIs and functionality LangChain provides to help you better evaluate your applications. Evaluation and testing are both critical when thinking about deploying LLM applications, since production environments require repeatable and useful outcomes.

### <b><font color='darkgreen'>Outline</font></b>
* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [1]:
!pip freeze | grep -P '(openai|langchain)'

langchain==0.2.1
langchain-anthropic==0.1.15
langchain-community==0.0.38
langchain-core==0.2.3
langchain-google-genai==1.0.6
langchain-groq==0.1.3
langchain-openai==0.1.9
langchain-text-splitters==0.2.0
langchainhub==0.1.14
openai==1.28.1


In [2]:
import datetime
import os
import openai
from dotenv import load_dotenv, find_dotenv
from langchain_openai import ChatOpenAI
import pandas as pd


TEST_DATA = pd.DataFrame({'index': ['row1', 'row2'], 'review': ['review1', 'review2']})

a = load_dotenv(find_dotenv(os.path.expanduser('~/.env'))) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [12]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

## <b><font color='darkblue'>Create our Q and A application</font></b>
We will leverage [the code in Ch5](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch5_question_and_answer.ipynb) to build a QA chatbot for evaluation:

In [22]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.docarray.in_memory import DocArrayInMemorySearch
from langchain.indexes import VectorstoreIndexCreator
from IPython.display import display, Markdown
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.embeddings import HuggingFaceEmbeddings

TEST_CSV_FILE_PATH = 'test_data/qa_data.csv'

In [24]:
loader = CSVLoader(TEST_CSV_FILE_PATH)
docs = loader.load()

In [25]:
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" # Selecting a sentence embedding model
#model_kwargs = {'device': 'cuda'}
#encode_kwargs = {'normalize_embeddings': False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
)

In [26]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=hf_embeddings,
).from_loaders([loader])

In [27]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    hf_embeddings
)

In [28]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
retriever = db.as_retriever()

In [29]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

### <b><font color='darkgreen'>Coming up with test datapoints</font></b>

In [31]:
docs[0].page_content

'type: car\ndescription: The 2024 ElectroZoom is a sleek, all-electric sedan designed for the modern driver. Available in a range of vibrant colors, including Sapphire Blue, Ruby Red, and Onyx Black, the ElectroZoom boasts a spacious interior with premium vegan leather seating and state-of-the-art technology features. With a range of up to 350 miles on a single charge and lightning-fast acceleration, the ElectroZoom offers both performance and sustainability.'

In [32]:
docs[1].page_content

'type: clothes\ndescription: The CozyCloud Hoodie is a unisex pullover made from ultra-soft, organic cotton fleece. Available in sizes XS to 3XL and a variety of calming colors like Sky Blue, Lavender Mist, and Charcoal Gray, the CozyCloud Hoodie features a relaxed fit, kangaroo pocket, and adjustable drawstring hood. Perfect for lounging at home or layering for outdoor adventures, the CozyCloud Hoodie is designed for ultimate comfort and versatility.'

### <b><font color='darkgreen'>Hard-coded examples</font></b>

In [42]:
examples = [
    {
        "query": "What is the estimated range of the 2024 ElectroZoom on a single charge?",
        "answer": "The 2024 ElectroZoom has an estimated range of up to 350 miles on a single charge."
    },
    {
        "query": "What sizes are available for the CozyCloud Hoodie?",
        "answer": "The CozyCloud Hoodie is available in sizes XS to 3XL."
    }
]

### <b><font color='darkgreen'>LLM-Generated examples</font></b>

In [34]:
from langchain.evaluation.qa import QAGenerateChain

In [35]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [36]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in docs[-5:]]
)



In [37]:
new_examples[0]

{'qa_pairs': {'query': 'Who is the 13.3-inch MacBook Pro perfect for according to the document?',
  'answer': 'According to the document, the 13.3-inch MacBook Pro is perfect for students, professionals, and creative individuals.'}}

In [38]:
docs[-5].page_content

'type: 3c\ndescription: The 13.3-inch MacBook Pro is a powerful and versatile laptop that is perfect for students, professionals, and creative individuals. It features a stunning Retina display, a long-lasting battery, and a powerful M1 chip.'

### <b><font color='darkgreen'>Combine examples</font></b>

In [43]:
examples += [new_example['qa_pairs'] for new_example in new_examples]

In [44]:
examples[:4]

[{'query': 'What is the estimated range of the 2024 ElectroZoom on a single charge?',
  'answer': 'The 2024 ElectroZoom has an estimated range of up to 350 miles on a single charge.'},
 {'query': 'What sizes are available for the CozyCloud Hoodie?',
  'answer': 'The CozyCloud Hoodie is available in sizes XS to 3XL.'},
 {'query': 'Who is the 13.3-inch MacBook Pro perfect for according to the document?',
  'answer': 'According to the document, the 13.3-inch MacBook Pro is perfect for students, professionals, and creative individuals.'},
 {'query': 'What are the key features of the stiletto heels described in the document?',
  'answer': 'The key features of the stiletto heels are a pointed toe, a stiletto heel, and a heel height of 4 inches.'}]

## <b><font color='darkblue'>Manual Evaluation</font></b>

In [45]:
import langchain
langchain.debug = True

In [46]:
qa_stuff.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the estimated range of the 2024 ElectroZoom on a single charge?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the estimated range of the 2024 ElectroZoom on a single charge?",
  "context": "type: car\ndescription: The 2024 ElectroZoom is a sleek, all-electric sedan designed for the modern driver. Available in a range of vibrant colors, including Sapphire Blue, Ruby Red, and Onyx Black, the ElectroZoom boasts a spacious interior with premium vegan leather seating and state-of-the-art technology features. With a range of up to 350 miles on a single charge and lightning-fast acceleration, the ElectroZoom offers both performance and sustain

{'query': 'What is the estimated range of the 2024 ElectroZoom on a single charge?',
 'result': 'The estimated range of the 2024 ElectroZoom on a single charge is up to 350 miles.'}

In [47]:
# Turn off the debug mode
langchain.debug = False

## <b><font color='darkblue'>LLM assisted evaluation</font></b>

In [51]:
from langchain.evaluation.qa import QAEvalChain

In [50]:
predictions = qa_stuff.batch(examples)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [52]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [53]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [58]:
graded_outputs[1]

{'results': 'CORRECT'}

In [59]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: What is the estimated range of the 2024 ElectroZoom on a single charge?
Real Answer: The 2024 ElectroZoom has an estimated range of up to 350 miles on a single charge.
Predicted Answer: The estimated range of the 2024 ElectroZoom on a single charge is up to 350 miles.
Predicted Grade: CORRECT

Example 1:
Question: What sizes are available for the CozyCloud Hoodie?
Real Answer: The CozyCloud Hoodie is available in sizes XS to 3XL.
Predicted Answer: The CozyCloud Hoodie is available in sizes XS to 3XL.
Predicted Grade: CORRECT

Example 2:
Question: Who is the 13.3-inch MacBook Pro perfect for according to the document?
Real Answer: According to the document, the 13.3-inch MacBook Pro is perfect for students, professionals, and creative individuals.
Predicted Answer: The 13.3-inch MacBook Pro is perfect for students, professionals, and creative individuals.
Predicted Grade: CORRECT

Example 3:
Question: What are the key features of the stiletto heels described in the 

## <b><font color='darkblue'>LangChain evaluation platform</font></b>
The LangChain evaluation platform, LangChain Plus, can be accessed here https://www.langchain.plus/. Use the invite code `lang_learners_2023`

## <b><font color='darkblue'>Supplement</font></b>
* [Deeplearning.ai - Langchain Ch2: Model, prompt and parser](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch2_model_prompt_and_parser.ipynb)
* [Deeplearning.ai - Langchain Ch3: Memory](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch3_memory.ipynb)
* [Deeplearning.ai - Langchain Ch4: Chain](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch4_chains.ipynb)
* [Deeplearning.ai - Langchain Ch5: Question and answer](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch5_question_and_answer.ipynb)
* [Deeplearning.ai - Langchain Ch6: Evaluation](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch6_evaluation.ipynb)
* [Deeplearning.ai - Langchain Ch7: Agents](https://github.com/johnklee/ml_articles/blob/master/deeplearning_ai/langchain/ch7_agents.ipynb)