# Semantic Kernel: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)

Prerequisite: You have run the L4-SK-CreateDB notebook to populate the venctor database with catalog from CSV file


In [3]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [42]:
# Standard Semantic Kernel initialization
import semantic_kernel as sk
import os
import logging
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('__name__')
kernel=sk.Kernel(log=logger)

api_key = os.environ['OPENAI_API_KEY']
kernel.add_chat_service(
        "chat-gpt", OpenAIChatCompletion("gpt-3.5-turbo", api_key)
)

<semantic_kernel.kernel.Kernel at 0x7ff588cb7970>

In [43]:
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding
kernel.add_text_embedding_generation_service(
        "ada", OpenAITextEmbedding("text-embedding-ada-002", api_key)
    )

<semantic_kernel.kernel.Kernel at 0x7ff588cb7970>

In [44]:
# register the pre-populated vectore store with embeddings from catalog CSV (created in L4-SK-CreateDB notebook)
from semantic_kernel.connectors.memory.chroma import ChromaMemoryStore
memstore=ChromaMemoryStore(persist_directory="catalog")
kernel.register_memory_store(memory_store=memstore)

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb.db.duckdb:loaded in 1000 embeddings
INFO:chromadb.db.duckdb:loaded in 1 collections


## Create our QandA application

In [45]:
import pandas as pd
df = pd.read_csv('OutdoorClothingCatalog_1000.csv')

### Coming up with test datapoints

In [46]:
print("name:", df.iloc[10]["name"], "\ndescription:", df.iloc[10].description)

name: Cozy Comfort Pullover Set, Stripe 
description: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.

Size & Fit
- Pants are Favorite Fit: Sits lower on the waist.
- Relaxed Fit: Our most generous fit sits farthest from the body.

Fabric & Care
- In the softest blend of 63% polyester, 35% rayon and 2% spandex.

Additional Features
- Relaxed fit top with raglan sleeves and rounded hem.
- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.

Imported.


In [47]:
print("name:", df.iloc[11]["name"], "\ndescription:", df.iloc[11].description)

name: Ultra-Lofty 850 Stretch Down Hooded Jacket 
description: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.


### Hard-coded examples

In [48]:
# Create some questions from above 2 examples manually
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

#### Pick a few records to generate more evaluation questions from the LLM instead of manual examples

In [49]:
# Pick record 0-2 from the database and we will use these to pass to LLM to generate some question and answers for evaluation
docs = await memstore.get_batch_async(collection_name="outdoordb", keys=["0", "1", "2"], with_embeddings=False)


In [50]:
docs[0]._text

"Women's Campside Oxfords :  This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries."

In [51]:
qdocs = "\n```\n".join([docs[i]._text for i in range(len(docs))])
   
    

In [52]:
async def example_gen_from_llm(qdocs) -> str :

    # Ask LLM to generate questions and answers from the records we get from the retrieved vector DB
    prompt = """{{ $qdocs}} 
    
    Question: Please generate one question and answer for each of above record delimited by triple backticks  
    and return results in a well formed JSON list with fields named as query and answer.
    """
    
    questgen = kernel.create_semantic_function(prompt, temperature=0.0)
    context_variables = sk.ContextVariables(variables={
        "qdocs": qdocs
    })
    response = questgen(variables=context_variables)
    return response

In [53]:
new_examples = await example_gen_from_llm(qdocs)


DEBUG:__name__:Extracting blocks from template: {{ $qdocs}} 
    
    Question: Please generate one question and answer for each of above record delimited by triple backticks  
    and return results in a well formed JSON list with fields named as query and answer.
    
DEBUG:asyncio:Using selector: EpollSelector
DEBUG:__name__:Rendering string template: {{ $qdocs}} 
    
    Question: Please generate one question and answer for each of above record delimited by triple backticks  
    and return results in a well formed JSON list with fields named as query and answer.
    
DEBUG:__name__:Extracting blocks from template: {{ $qdocs}} 
    
    Question: Please generate one question and answer for each of above record delimited by triple backticks  
    and return results in a well formed JSON list with fields named as query and answer.
    
DEBUG:__name__:Rendering list of 2 blocks
DEBUG:__name__:Rendered prompt: Women's Campside Oxfords :  This ultracomfortable lace-to-toe Oxford boasts

In [54]:
new_examples["input"]

'[\n  {\n    "query": "What is the weight of the Women\'s Campside Oxfords?",\n    "answer": "Approx. weight: 1 lb.1 oz. per pair."\n  },\n  {\n    "query": "What is the size of the Small Recycled Waterhog Dog Mat?",\n    "answer": "Small - Dimensions: 18\\" x 28\\"."\n  },\n  {\n    "query": "What is the fabric composition of the Recycled Waterhog Dog Mat?",\n    "answer": "24 oz. polyester fabric made from 94% recycled materials."\n  },\n  {\n    "query": "What is the sun protection rating of the Infant and Toddler Girls\' Coastal Chill Swimsuit?",\n    "answer": "UPF 50+ rated fabric provides the highest rated sun protection possible, blocking 98% of the sun\'s harmful rays."\n  }\n]'

In [55]:
# Convert string to JSON
import json
jlist = json.loads(new_examples["input"])


### Combine examples

In [56]:
# Add LLM generated evaluation q&a to the manual ones we defined earlier
examples += jlist

In [57]:
# Now again lets ask the LLM to answer evaluation questions
async def ragqna(kernel, query, limit) -> str:
    docs = await kernel.memory.search_async(collection="outdoordb", limit=limit, min_relevance_score=0.3, query=query)
    qdocs = "\n```\n".join([docs[i].text for i in range(len(docs))])
    
    prompt = """{{ $qdocs}} 
    
    Use the above documents delimited by triple backticks and answer the following question: {{ $query }}
    
    
    """
    
    qna = kernel.create_semantic_function(prompt, temperature=0.0)
    context_variables = sk.ContextVariables(variables={
        "qdocs": qdocs,
        "query": query
    })
    response = qna(variables=context_variables)
    return response

## Manual Evaluation

In [60]:
# Lets test one example before trying all evaluations
response = await ragqna(kernel, examples[0]["query"], 3)

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"model": "text-embedding-ada-002", "input": ["Do the Cozy Comfort Pullover Set        have side pockets?"], "encoding_format": "base64"}' message='Post details'
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=34 request_id=b7ea65ff84f4092e5deb2c27f5ba5df9 response_code=200
DEBUG:chromadb.db.index.hnswlib:time to pre process our knn query: 1.6689300537109375e-06
DEBUG:chromadb.db.index.hnswlib:time to run knn query: 0.00015115737915039062
DEBUG:__name__:Extracting blocks from template: {{ $qdocs}} 
    
    Use the above documents delimited by triple backticks and answer the following question: {{ $query }}
    
    
    
DEBUG:asyncio:Using selector: EpollSelector
DEBUG:__name__:Rendering string template: {{ $qdocs}} 
    
    Use the above documents delimited by triple backticks and answer the followin

In [62]:
print(examples[0]["query"])
print(response["input"])

Do the Cozy Comfort Pullover Set        have side pockets?
Yes, the Cozy Comfort Pullover Set does have side pockets.


In [63]:
# Now that one example work, lets try to evaluate all evaluation questions
# We will save inferred answer in the "Predicted" field
for example in examples:
    response = await ragqna(kernel, example["query"], 3)
    example["Predicted"] = response["input"]

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"model": "text-embedding-ada-002", "input": ["Do the Cozy Comfort Pullover Set        have side pockets?"], "encoding_format": "base64"}' message='Post details'
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=42 request_id=f7003ab68f6822aa5b7f3873b030d8f4 response_code=200
DEBUG:chromadb.db.index.hnswlib:time to pre process our knn query: 1.9073486328125e-06
DEBUG:chromadb.db.index.hnswlib:time to run knn query: 0.00030112266540527344
DEBUG:__name__:Extracting blocks from template: {{ $qdocs}} 
    
    Use the above documents delimited by triple backticks and answer the following question: {{ $query }}
    
    
    
DEBUG:asyncio:Using selector: EpollSelector
DEBUG:__name__:Rendering string template: {{ $qdocs}} 
    
    Use the above documents delimited by triple backticks and answer the following q

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/chat/completions
DEBUG:openai:api_version=None data='{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Ultra-Lofty 850 Stretch Down Hooded Jacket :  This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20\\u00b0 and moderate activity up to -30\\u00b0. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.\\n```\\nWomen\'s Ultra-Loft Down Sweater Hooded Jac

INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=None request_id=3ffc0529ab2fed0b9e670fde572bdfbe response_code=429
INFO:openai:error_code=None error_message='Rate limit reached for default-gpt-3.5-turbo in organization org-rocrupyvzgcl4yf25rqq6d1v on tokens per min. Limit: 90000 / min. Current: 89510 / min. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"model": "text-embedding-ada-002", "input": ["What is the size of the Small Recycled Waterhog Dog Mat?"], "encoding_format": "base64"}' message='Post details'
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=35 request_id=7a0ce1ec3f62317bd1ebee9e59a69a5e response_code=200


DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/chat/completions
DEBUG:openai:api_version=None data='{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Recycled Waterhog Dog Mat, Chevron Weave :  Protect your floors from spills and splashing with our ultradurable recycled Waterhog dog mat made right here in the USA. \\n\\nSpecs\\nSmall - Dimensions: 18\\" x 28\\". \\nMedium - Dimensions: 22.5\\" x 34.5\\".\\n\\nWhy We Love It\\nMother nature, wet shoes and muddy paws have met their match with our Recycled Waterhog mats. Ruggedly constructed from recycled plastic materials, these ultratough mats help keep dirt and water off your floors and plastic out of landfills, trails and oceans. Now, that\'s a win-win for everyone.\\n\\nFabric & Care\\nVacuum or hose clean.\\n\\nConstruction\\n24 oz. polyester fabric made from 94% recycled materials.\\nRubber backing.\\n\\nAdditional Features\\nFeatures an -exclusive design.\\nFeatures thick

INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=879 request_id=11806ca3644c02378ab39b7ba4d18c17 response_code=200


In [64]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes',
  'Predicted': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'Predicted': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "What is the weight of the Women's Campside Oxfords?",
  'answer': 'Approx. weight: 1 lb.1 oz. per pair.',
  'Predicted': ''},
 {'query': 'What is the size of the Small Recycled Waterhog Dog Mat?',
  'answer': 'Small - Dimensions: 18" x 28".',
  'Predicted': ''},
 {'query': 'What is the fabric composition of the Recycled Waterhog Dog Mat?',
  'answer': '24 oz. polyester fabric made from 94% recycled materials.',
  'Predicted': ''},
 {'query': "What is the sun protection rating of the Infant and Toddler Girls' Coastal Chill Swimsuit?",
  'answer': "UPF 50+ rated fabric provides the h

## LLM assisted evaluation

In [None]:
#Exercise: You can also pass back this output to the LLM and ask it to check how close the "answer" to the "predicted"
# Hint: You can use the Q&A and this example to do this