>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor language model profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit) to leverage the power of LangKit and WhyLabs together!*

# Logging and Monitoring Text Metrics for LLMs with LangKit and WhyLabs

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Batch_to_Whylabs.ipynb)

In [None]:
%pip install 'langkit[all]' -q

In this example, we'll show how you can generate out-of-the-box text metrics using LangKit and whylogs, and then log and monitor them in the WhyLabs Observability Platform.

With LangKit, you'll be able to extract relevant signals from unstructured text data, such as:



## Loading the Dataset - Chatbot prompts

Let's first download a huggingface dataset containing prompts and responses from a chatbot. We'll generate text metrics for the prompts and responses, and then log them to WhyLabs.

In [1]:
from datasets import load_dataset
print("initialize hugging face archived chat prompt/response dataset...")
archived_chats = load_dataset('alespalla/chatbot_instruction_prompts', split="test", streaming=True)

initialize hugging face archived chat prompt/response dataset...


In order to send our profile to WhyLabs, let's first set up an account. You can skip this if you already have an account and a model set up.

We will need three pieces of information:

- API token
- Organization ID
- Dataset ID (or model-id)

Go to https://whylabs.ai/free and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.

After that, you’ll be prompted to create an API token. Once you create it, copy and store it locally. The second important information here is your org ID. Take note of it as well. After you get your API Token and Org ID, you can go to https://hub.whylabsapp.com/models to see your projects dashboard. You can create a new project and take note of it's ID (if it's a model project it will look like `model-xxxx`).

In [None]:
from langkit.config import check_or_prompt_for_api_keys

check_or_prompt_for_api_keys()

## Initializing Metrics from LangKit

In order to calculate the text metrics, we simply need to import the relevant modules from `LangKit`. In this case, we will calculate metrics using the following modules:

- textstat: text statistics such as scores for readability, complexity, and grade
- sentiment: sentiment scores
- regexes: label text according to user-defined regex pattern groups
- themes: compute sentence similarity scores with respect to groups of: a) known jailbreak and b) LLM refusal of service responses

After importing the modules, we can generate a schema that will inform whylogs of the metrics we want to calculate. We can then use this schema to log our data.

In [None]:
from langkit import llm_metrics

print("downloading models and initialized metrics...")
text_metrics_schema = llm_metrics.init()

## Profiling and Writing to WhyLabs - Single Example

The following code block will log a single prompt/response pair. The resulting profile will then be sent over to your dashboard at WhyLabs.

In [10]:
import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter

# Let's define a WhyLabs writer
writer = WhyLabsWriter()

# define an iterator over the hugging face dataset of archived prompts
chats = iter(archived_chats)
# grab the first archived prompt/response and log it
archived_prompt_response = next(chats)
print("Log this first prompt with whylogs and grab the profile")
profile = why.log(archived_prompt_response, schema=text_metrics_schema).profile()

# This is a single prompt profile for today, lets write it to WhyLabs
print("Writing initial profile to WhyLabs:")
status = writer.write(profile)
print(f"Done writing initial profile to WhyLabs, with success: {status}")
print()


Log this first prompt with whylogs and grab the profile
Writing initial profile to WhyLabs:
Done writing initial profile to WhyLabs, with success: (True, 'log-zj4EscAKHbaokR7c')



## Profiling and Writing to WhyLabs - Multiple Batches

Let's get us closer to a real scenario. If you have an LLM-powered system, you'll be interested in monitoring your text inputs/outputs in a streaming fashion. In this case, we'll simulate a streaming scenario by iterating through the examples and logging them into daily batches. Let's say we have 7 days worth of data, with 10 examples per day.

In [11]:
from datetime import datetime, timedelta, timezone

current_date = datetime.now(timezone.utc)

print(f"Now lets write some data to simulate daily logging for the past 7 days.")
batch_size = 10
for day in range(1, 7):
  # create a separate profile for each day
  archived_prompt_response = next(chats)
  profile = why.log(archived_prompt_response, schema=text_metrics_schema).profile()
  # now log some additional archived prompt/response pairs for this profile to aggregat statistics
  # in this profile. The number of prompt/response pairs logged per profile can be very large
  # or calculated on different machines, but because the statistics are all mergeable
  # we get the rollup of these statistics across the instances processing your data for this day.
  archived_prompt_responses = []
  dataset_date = current_date - timedelta(days=day)
  print(f"Downloading {batch_size} records from Hugging Face for {dataset_date} and profiling")

  for _ in range(10):
    record = next(chats)
    archived_prompt_responses.append(record)

  for record in archived_prompt_responses:
    profile.track(record)
    print(".", end="", flush=True)
  # Now lets take the aggregate profile, set the timestamp and write it to WhyLabs
  profile.set_dataset_timestamp(dataset_date)
  writer.write(profile)
  print()
print("Done. Go see your metrics on the WhyLabs dashboard!")

Now lets write some data to simulate daily logging for the past 7 days.
Downloading 10 records from Hugging Face for 2023-10-11 16:25:50.201344+00:00 and profiling
..........
Downloading 10 records from Hugging Face for 2023-10-10 16:25:50.201344+00:00 and profiling
..........
Downloading 10 records from Hugging Face for 2023-10-09 16:25:50.201344+00:00 and profiling
..........
Downloading 10 records from Hugging Face for 2023-10-08 16:25:50.201344+00:00 and profiling
..........
Downloading 10 records from Hugging Face for 2023-10-07 16:25:50.201344+00:00 and profiling
..........
Downloading 10 records from Hugging Face for 2023-10-06 16:25:50.201344+00:00 and profiling
..........
Done. Go see your metrics on the WhyLabs dashboard!


And that's it! You can now go to your WhyLabs dashboard and explore the profiles for the past 7 days.

Feel free to play around with the code and the metrics. You can inject anomalies manually to see how the metrics change, or you can set monitors and alert over at the WhyLabs dashboard.