# LangChain: Evaluation 📊

## Introduction 
In this notebook, we will explore various evaluation techniques for LangChain, a framework designed for creating applications that leverage large language models (LLMs). We will cover the following topics:  

- Example generation  
- Manual evaluation (and debugging)  
- LLM-assisted evaluation  
- LangChain evaluation platform  

### Setup

In [None]:
# Import necessary libraries
import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file
_ = load_dotenv(find_dotenv()) 

**Note** To handle the deprecation of LLM models, we use the current date to select the appropriate model:

In [None]:
# Handling Model Deprecation
import datetime

# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

### Create our QandA application
In this section, we will develop a `Question and Answer (QandA) application using LangChain`. This involves setting up the necessary components for data loading, vector store creation, and initializing the LLM for generating responses. We will integrate these components to create a functional QandA application capable of retrieving and answering queries based on the provided dataset.

In [3]:
# Import the required modules from langchain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import OpenAIEmbeddings

In [4]:
# Load the CSV file
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')

In [5]:
# Load data from CSV
data = loader.load()

In [8]:
#from langchain.embeddings import OpenAIEmbeddings  # or other embeddings
# Create the embedding model
embedding_model = OpenAIEmbeddings()  # You can use any supported embedding model here

  embedding_model = OpenAIEmbeddings()  # You can use any supported embedding model here


In [9]:
# Create the vector store index
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding_model 
).from_loaders([loader])



In [10]:
# Initialize the ChatOpenAI model with zero temperature for deterministic responses
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

# Create a RetrievalQA chain with the specified LLM, retriever, and additional configurations
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

  llm = ChatOpenAI(temperature = 0.0, model=llm_model)


### Coming up with test datapoints

Let's have a quick look at the documents that are enclosed in the uploaded dataset. 

- 1 test datapoints

In [11]:
# Access and display the 11th document in the dataset (index 10) to inspect its content and metadata
data[10]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

**Explanation output** The above  output represents `a document from the 'OutdoorClothingCatalog_1000.csv' file, specifically the 11th row (index 10)`. The document provides detailed information about the "Cozy Comfort Pullover Set, Stripe," including its `description, size and fit details, fabric composition, care instructions, and additional features` such as the relaxed fit top and pull-on pants with side pockets.

- 2 test datapoints

In [12]:
# Access and display the 12th document in the dataset (index 11) to inspect its content and metadata
data[11]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

**Explanation output** The above output represents `a document from the 'OutdoorClothingCatalog_1000.csv' file, specifically the 12th row (index 11)`. The document provides detailed information about the "Ultra-Lofty 850 Stretch Down Hooded Jacket," including its description, fit, activity suitability, material composition, and additional features such as welded baffles, an adjustable hood, and an interior media port.

### Hard-coded examples

In this section, we define a `set of hard-coded examples to test the QandA application`. These examples include specific queries and their expected answers, which help in validating the functionality and accuracy of the application. Using hard-coded examples ensures that the system can handle known inputs and produce correct outputs.

In [None]:
# Set up hard-coded examples
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

In [14]:
from langchain.evaluation.qa import QAGenerateChain

In [15]:
# Initialize the QAGenerateChain with the LLM model
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [16]:
# Generate new examples
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]])



**Explanation output** The slice data[:5] means that you're selecting elements from index 0 to index 4 (5 is not included). So, index 5 is not included in this case. To sum up, we have generated 5 examples 

Let's have a look at the contents of each of these 5 examples 

- Example 1 

In [17]:
new_examples[0]

{'qa_pairs': {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."}}

In [18]:
data[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

- Example 2 

In [34]:
new_examples[1]

{'qa_pairs': {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size measures 18" x 28" and the medium size measures 22.5" x 34.5".'}}

In [35]:
data[1]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 1}, page_content=': 1\nname: Recycled Waterhog Dog Mat, Chevron Weave\ndescription: Protect your floors from spills and splashing with our ultradurable recycled Waterhog dog mat made right here in the USA. \n\nSpecs\nSmall - Dimensions: 18" x 28". \nMedium - Dimensions: 22.5" x 34.5".\n\nWhy We Love It\nMother nature, wet shoes and muddy paws have met their match with our Recycled Waterhog mats. Ruggedly constructed from recycled plastic materials, these ultratough mats help keep dirt and water off your floors and plastic out of landfills, trails and oceans. Now, that\'s a win-win for everyone.\n\nFabric & Care\nVacuum or hose clean.\n\nConstruction\n24 oz. polyester fabric made from 94% recycled materials.\nRubber backing.\n\nAdditional Features\nFeatures an -exclusive design.\nFeatures thick and thin fibers for scraping dirt and absorbing water.\nDries quickly and resists fading, rotting, mildew and shedding.\nUse

- Example 3 

In [37]:
new_examples[2]

{'qa_pairs': {'query': "What features does the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece offer according to the document?",
  'answer': 'The swimsuit offers bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage. It is recommended to machine wash and line dry for best results.'}}

In [38]:
data[2]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 2}, page_content=": 2\nname: Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece\ndescription: She'll love the bright colors, ruffles and exclusive whimsical prints of this toddler's two-piece swimsuit! Our four-way-stretch and chlorine-resistant fabric keeps its shape and resists snags. The UPF 50+ rated fabric provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Machine wash and line dry for best results. Imported.")

- Example 4

In [40]:
new_examples[3]

{'qa_pairs': {'query': 'What is the main fabric used in the Refresh Swimwear, V-Neck Tankini Contrasts?',
  'answer': 'The main fabric used in the Refresh Swimwear, V-Neck Tankini Contrasts is 82% recycled nylon with 18% Lycra® spandex.'}}

In [41]:
data[3]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 3}, page_content=": 3\nname: Refresh Swimwear, V-Neck Tankini Contrasts\ndescription: Whether you're going for a swim or heading out on an SUP, this watersport-ready tankini top is designed to move with you and stay comfortable. All while looking great in an eye-catching colorblock style. \n\nSize & Fit\nFitted: Sits close to the body.\n\nWhy We Love It\nNot only does this swimtop feel good to wear, its fabric is good for the earth too. In recycled nylon, with Lycra® spandex for the perfect amount of stretch. \n\nFabric & Care\nThe premium Italian-blend is breathable, quick drying and abrasion resistant. \nBody in 82% recycled nylon with 18% Lycra® spandex. \nLined in 90% recycled nylon with 10% Lycra® spandex. \nUPF 50+ rated – the highest rated sun protection possible. \nHandwash, line dry.\n\nAdditional Features\nLightweight racerback straps are easy to get on and off, and won't get in your way. \nFlattering V-ne

- Example 5 

In [43]:
new_examples[4]

{'qa_pairs': {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and waterproof?',
  'answer': 'The EcoFlex 3L Storm Pants use TEK O2 technology to provide enhanced breathability and waterproof protection.'}}

In [None]:
data[4]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 4}, page_content=": 4\nname: EcoFlex 3L Storm Pants\ndescription: Our new TEK O2 technology makes our four-season waterproof pants even more breathable. It's guaranteed to keep you dry and comfortable – whatever the activity and whatever the weather. Size & Fit: Slightly Fitted through hip and thigh. \n\nWhy We Love It: Our state-of-the-art TEK O2 technology offers the most breathability we've ever tested. Great as ski pants, they're ideal for a variety of outdoor activities year-round. Plus, they're loaded with features outdoor enthusiasts appreciate, including weather-blocking gaiters and handy side zips. Air In. Water Out. See how our air-permeable TEK O2 technology keeps you dry and comfortable. \n\nFabric & Care: 100% nylon, exclusive of trim. Machine wash and dry. \n\nAdditional Features: Three-layer shell delivers waterproof protection. Brand new TEK O2 technology provides enhanced breathability. Interior gai

In [None]:
# Access each of element of new_examples and get the 'qa_pairs' value
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]])

In [19]:
# Now `qa_pairs` is a dictionary with 'query' and 'answer' directly accessible
new_examples = [item['qa_pairs'] for item in new_examples]

In [20]:
# Let's check the modified structure
new_examples[4]

{'query': 'What is the main technology used in the EcoFlex 3L Storm Pants to ensure breathability and comfort?',
 'answer': 'The main technology used in the EcoFlex 3L Storm Pants is the TEK O2 technology, which offers the most breathability ever tested by the company.'}

In [21]:
data[4]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 4}, page_content=": 4\nname: EcoFlex 3L Storm Pants\ndescription: Our new TEK O2 technology makes our four-season waterproof pants even more breathable. It's guaranteed to keep you dry and comfortable – whatever the activity and whatever the weather. Size & Fit: Slightly Fitted through hip and thigh. \n\nWhy We Love It: Our state-of-the-art TEK O2 technology offers the most breathability we've ever tested. Great as ski pants, they're ideal for a variety of outdoor activities year-round. Plus, they're loaded with features outdoor enthusiasts appreciate, including weather-blocking gaiters and handy side zips. Air In. Water Out. See how our air-permeable TEK O2 technology keeps you dry and comfortable. \n\nFabric & Care: 100% nylon, exclusive of trim. Machine wash and dry. \n\nAdditional Features: Three-layer shell delivers waterproof protection. Brand new TEK O2 technology provides enhanced breathability. Interior gai

### Combine examples

Let's combine now all the examples. To be noticed that the total examples are 7:  
-  the two examples that are the Hard-coded examples (example 1 and example 2 in the new list);  
-  and the other 5 that are coming from the dataset csv (example 3, 4, 5, 6, 7 in the list)   

In [22]:
# Combine hard-coded and generated examples
examples += new_examples

In [23]:
# Run the QA model on the first example
qa.run(examples[0]["query"])

  qa.run(examples[0]["query"])




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, the Cozy Comfort Pullover Set does have side pockets.'

### Manual Evaluation

In this section, we `enable manual evaluation for our QandA application by turning on LangChain's debug mode`. This allows us to inspect the intermediate steps, inputs, and outputs of the chain, helping us identify and debug any issues. By running the QA chain with sample queries, we can manually verify the correctness and performance of the application.

In [24]:
import langchain
langchain.debug = True

In [25]:
# Run the QA chain with a sample query for manual evaluation
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditiona

'Yes, the Cozy Comfort Pullover Set does have side pockets.'

In [26]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation
In this section, we `leverage large language models (LLMs) to automatically generate evaluation examples for our QandA application`. By utilizing LLMs, we can create diverse and complex queries that may not be covered by hard-coded examples. This helps in assessing the robustness and adaptability of the application, ensuring it can handle a wide range of inputs effectively.

In [29]:
# Run the QA model on the provided examples to generate predictions for each query
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [30]:
# Iterate over the generated predictions and print each one with its example number
for i, prediction in enumerate(predictions):
    print(f"Example {i+1}: {prediction}\n")


Example 1: {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?', 'answer': 'Yes', 'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

Example 2: {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?', 'answer': 'The DownTek collection', 'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'}

Example 3: {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?", 'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.", 'result': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."}

Example 4: {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat in the Chevron Weave design?', 'answer': 'The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".', 'result': 'The dimensions of the small size of the Recycled Wat

In [31]:
from langchain.evaluation.qa import QAEvalChain

In [32]:
# Initialize the ChatOpenAI model with zero temperature for deterministic responses
llm = ChatOpenAI(temperature=0, model=llm_model)

# Create an evaluation chain using the initialized LLM for question-answer evaluation
eval_chain = QAEvalChain.from_llm(llm)

In [33]:
# Evaluate the predictions against the examples using the evaluation chain, and store the graded outputs
graded_outputs = eval_chain.evaluate(examples, predictions)

In [37]:
# Iterate over each example and its corresponding prediction to display the question, real answer, predicted answer, and predicted grade
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    #print("Predicted Grade: " + graded_outputs[i]['text'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat in the Chevron Weave design?
Real Answer: The small size has dimensions of 18" x 28" and the medium size has dimensions of 

In [38]:
graded_outputs[0]

{'results': 'CORRECT'}

## Conclusion
In this notebook, we have successfully demonstrated how to evaluate LangChain using various techniques, including hard-coded examples and LLM-generated examples. We also covered manual evaluation to ensure the reliability and accuracy of our Q&A application. These methods help in identifying potential improvements and debugging issues effectively.