# LangChain: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

1. we need to have the chain or the application that we need to evaluate in the first place

In [20]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [21]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

## Create our QandA application

2. We are going to use Q&A application from L4

In [22]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

3. we are going to load the same data that we were using

In [23]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

4. we are going to create the index with 1 line

In [24]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

5. and then we are going to create the retrieval QA chain

In [25]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints

6. what are some data points we want to evaluate?
- we are going to choose datapoints ourselves what we think are good examples
    - come up with example question and example ground truth answer we evaluate
- we are going to look at some examples to understand what is going on inside them

In [26]:
data[10]

Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10})

In [27]:
data[11]

Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11})

### Hard-coded examples

7. from the details above we create example Q&A pairs
- this doesn't scale well, is there a way we can automate it?

In [28]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

8. one of the ways to automate it is with language models themselves
- we import the QA generation chain
- it will take documents and create Q&A pairs

In [29]:
from langchain.evaluation.qa import QAGenerateChain


9. it will do this using the LLM itself

In [30]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [31]:
# the warning below can be safely ignored

10. we will use the apply and parse method
- this is applying the output parser to the result
- because we want to get back the dictionary with Q&A pair, not just a string

Here i ran into problems:

Chat GPT summary and fix:
- Root cause: ChatOpenAI + QAGenerateChain tries to sum OpenAIObject token-usage fields → TypeError.
- Fix: Use langchain.llms.OpenAI with gpt-3.5-turbo-instruct for the evaluator and pass d.page_content (strings), not Documents.

Course code

In [32]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

TypeError: unsupported operand type(s) for +=: 'OpenAIObject' and 'OpenAIObject'

chat gpt solution:
1. Fix (stay on your pinned versions)
- Use the completions wrapper OpenAI (instruct model) for the evaluator
QAGenerateChain works fine with OpenAI in this version
2. (Alternative) Keep ChatOpenAI but avoid the evaluator
- If you must use ChatOpenAI, avoid QAGenerateChain.apply_and_parse on this stack (or upgrade LangChain, which you said you can’t

11. we get a query and the answers

In [19]:
new_examples[0]

NameError: name 'new_examples' is not defined

12. let's check the document

In [33]:
data[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

### Combine examples

13. add those examples in the examples we created

In [34]:
examples += new_examples

NameError: name 'new_examples' is not defined

14. run an example through the chain and see what results it produces
- we run the query and get an answer
- limiting in things we can see: what is the prompt, what are the documents?
- it is sometimes not enough just to see the final answer

In [35]:
qa.run(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, the Cozy Comfort Pullover Set does have side pockets on the pull-on pants.'

## Manual Evaluation

15. if we set the debug utility it starts printing out a lot more information
- often times there is a problem in the retrieval

In [36]:
import langchain
langchain.debug = True

16. it is printing out a lot more infor
- retrieval
- stuff
- llm chain
- question
- context
- what is entering the LLM
- full prompt that is passed
- human question
- return time, token usage, model name
- final response - side pockets

In [37]:
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:RunTypeEnum.chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[1:RunTypeEnum.chain:RetrievalQA > 2:RunTypeEnum.chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:RunTypeEnum.chain:RetrievalQA > 2:RunTypeEnum.chain:StuffDocumentsChain > 3:RunTypeEnum.chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & C

'Yes, the Cozy Comfort Pullover Set does have side pockets on the pull-on pants.'

In [38]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

17. create predictions for all the different examples

In [39]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


18. import the QA eval chain

In [40]:
from langchain.evaluation.qa import QAEvalChain

19. create this chain with a language model

In [41]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

20. call evaluate on this chain, pass examples and predictions and get back a bunch of graded outputs

In [42]:
graded_outputs = eval_chain.evaluate(examples, predictions)

TypeError: unsupported operand type(s) for +=: 'OpenAIObject' and 'OpenAIObject'

21. to see what is going on with each example, we're going to loop through them
- print out the Q, the real A (generated by llm), the predictive answer, and predicted grade (all generated by LLM)

In [43]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.


NameError: name 'graded_outputs' is not defined

22. let's look at the 1st example
- answers are arbitrary strings, as long as they have the same semantic meaning they should be graded as similar
- we need to create new evaluation metrics
- we are using the llm to do the evaluation

In [44]:
graded_outputs[0]

NameError: name 'graded_outputs' is not defined