>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor language model profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit_behavior_monitoring)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=langkit_behavior_monitoring) to leverage the power of LangKit and WhyLabs together!*

# Behavioral Monitoring of Large Language Models

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/langkit/blob/main/langkit/examples/ChatGPT_Behavioral_Monitoring.ipynb)

> This notebook is a complement to the blog post [Behavioral Monitoring of Large Language Models](placeholder). Please refer to the blog post for additional context.

In this blog, we will  discuss seven groups of metrics you can use to keep track of the LLM’s behavior. We will calculate these metrics for chatGPT’s responses for a fixed set of 200 prompts across 35 days  and track how chatGPT’s behavior evolves within the period. Our focus task will be long-form question answering, and we will use LangKit, whylogs and WhyLabs to calculate, track and monitor the model’s behavior across time.

You can check the resulting dashboard for this project at [WhyLabs](https://hub.whylabsapp.com/resources/demo-chatgpt-behavior-ELI5/columns/response.difficult_words?dateRange=2023-03-05-to-2023-04-09&targetOrgId=demo&sessionToken=session-8gcsnbVy) (no sign-in required).

## The Task - Explain Like I’m 5

For this example, let’s use the [Explain Like I’m Five (ELI5) dataset](https://aclanthology.org/P19-1346), a question-answering dataset built from the Reddit forum “Explain Like I’m Five”. The questions are open-ended - questions that require a longer response and cannot be answered with a “yes” or “no” - and the answers should be simple enough so that a five-year-old would understand.

In the work presented in [ChatLog: Recording and Analyzing ChatGPT Across Time](https://arxiv.org/pdf/2304.14106.pdf), 1000 questions were sampled from this dataset and repeatedly sent to ChatGPT every day from March 5 to April 9, 2023, which is available in [ChatLog’s Repository](https://github.com/THU-KEG/ChatLog). We’ll make use of this dataset by sampling 200 out of the original 1000 questions, along with ChatGPT’s answers and human reference answers, for each day of the given period. That way, we’ll end up with 35 daily dataframes, each containing 200 rows.

For this example, we already have the dataframes for each day, so we can simply download from our S3 Examples folder. Let's take a look at the first dataframe:

In [1]:
import pandas as pd
import datetime

date = datetime.datetime(2023,3,5)
base_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/langkit-examples/behavior-llm/"
date_str = date.strftime("%Y-%m-%d")
df = pd.read_csv(f"{base_url}daily_{date_str}.csv")

df.head()

Unnamed: 0,id,prompt,response,ref_0,ref_1,ref_2,chat_date
0,1,"Why is every book I hear about a "" NY Times # ...",\n\nThe New York Times is a BIG newspaper that...,"Basically there are many categories of "" Best ...","If you 're hearing about it , it 's because it...","One reason is lots of catagories . However , h...",2023-03-05 00:00:00+00:00
1,2,"If salt is so bad for cars , why do we use it ...","\n\nWell, imagine you and your friends made a ...",salt is good for not dying in car crashes and ...,"In Minnesota and North Dakota , they tend to u...","Used to work in the salt industry ( yes , it '...",2023-03-05 00:00:00+00:00
2,3,Why do we still have SD TV channels when HD lo...,\n\nSure thing! Imagine that TV channels are l...,The way it works is that old TV stations got a...,HD does n't look like anything at all on an SD...,There are a few reasons why SD channels still ...,2023-03-05 00:00:00+00:00
3,4,Why has nobody assassinated Kim Jong - un He i...,\n\nIt's never okay to hurt or harm other peop...,You ca n't just go around assassinating the le...,It would n't really do any good . It 's not li...,Partly because any country found to be doing s...,2023-03-05 00:00:00+00:00
4,5,How was airplane technology able to advance so...,\n\nAfter the Wright Brothers made their first...,Wanting to kill the shit out of Germans drives...,This is a frequent phenomenon with technology ...,The importance of the Wright Brothers and othe...,2023-03-05 00:00:00+00:00


- Prompt: ELI5 question asked to the model.
- Response: ChatGPT’s response to the prompt.
- Ref_0: Human reference answer 1.
- Ref_1: Human reference answer 2.
- Ref_2: Human reference answer 3.
- chat_date: Date when the prompt was sent to ChatGPT.

## Calculating the Metrics

It can be a daunting task to define a set of metrics to properly evaluate a model with such a wide range of capabilities as ChatGPT. In this example, we’ll cover some examples of metrics that are relatively general, and that could be useful for a range of applications, such as text quality, sentiment analysis, toxicity, and text semantic similarity, and others that are particular for certain tasks, such as question answering and summarization, like the ROUGE group of metrics.

There are a multitude of other metrics that might be more relevant to track, depending on the particular application you are interested in. If you’re looking for more examples of what to monitor, here are two papers that served as an inspiration for the writing of this blog: [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110.pdf) and [ChatLog: Recording and Analyzing ChatGPT Across Time](https://arxiv.org/pdf/2304.14106.pdf).

Now, let’s talk about the metrics we’re monitoring in this example. Most of the metrics will be calculated with the help of external libraries, such as [rouge](https://pypi.org/project/rouge/), [textstat](https://github.com/textstat/textstat), and [huggingface models](https://huggingface.co/models), and most of them are encapsulated in the LangKit library. In the end, we want to group all the calculated metrics in a [whylogs](https://github.com/whylabs/whylogs) profile, which is a statistical summary of the original data. We will then send the daily profiles to the WhyLabs observability platform, where we can monitor them over time.

Let's see how to calculate each of the metrics in a single dataframe for the first day. Once we show how to do it for one day, we will repeat for all 35 days and send the resulting profiles to WhyLabs.

### Installing the libraries

Let's first install the required libraries for this example:


In [None]:
# Note: you may need to restart the kernel to use updated packages.
%pip install -q langkit[all] whylogs[viz] rouge

## Rouge

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of metrics commonly used in natural language processing and computational linguistics to evaluate the quality of automatic summaries. The ROUGE metrics are designed to compare an automatically generated summary with one or more reference summaries.

 The task at hand is a question-answering problem rather than a summarization task, but we do have human answers as a reference, so we will use the ROUGE metrics to measure the similarity between the ChatGPT response and each of the three reference answers. We will use the rouge python library to augment our dataframe with two different metrics: Rouge-L, which takes into account the longest sequence overlap between the answers, and Rouge-2, which takes into account the overlap of bigrams between the answers. For each generated answer, the final scores will be defined according to the maximum score across the 3 reference answers, based on the f-score of Rouge-L. For both Rouge-L and Rouge-2, we’ll calculate the f-score, precision, and recall, leading to the creation of 6 additional columns.

This approach was based on the following paper: [ChatLog: Recording and Analyzing ChatGPT Across Time](https://arxiv.org/pdf/2304.14106.pdf)


In [2]:
from rouge import Rouge
from nltk import PorterStemmer
stemmer = PorterStemmer()

def rouge_scores_as_cols(df,scores):
    df["rouge-l-f"] = [i["rouge-l"]["f"] for i in scores]
    df["rouge-l-p"] = [i["rouge-l"]["p"] for i in scores]
    df["rouge-l-r"] = [i["rouge-l"]["r"] for i in scores]
    df["rouge-2-f"] = [i["rouge-2"]["f"] for i in scores]
    df["rouge-2-p"] = [i["rouge-2"]["p"] for i in scores]
    df["rouge-2-r"] = [i["rouge-2"]["r"] for i in scores]

    return df


def rouge_calculation(df, target_col="response", reference_cols = ["ref_0","ref_1","ref_2"]):
    """
    Calculate rouge scores for a dataframe with a target column and reference columns.
    For each row, the resulting metric will be the maximum rouge score across all references,
    based on rouge-l-f.

    :param df: dataframe with target and reference columns
    :param target_col: target column name
    :param reference_cols: reference column names
    :return: dataframe with rouge scores as columns: rouge-l-f, rouge-l-p, rouge-l-r, rouge-2-f, rouge-2-p, rouge-2-r

    Based on https://github.com/THU-KEG/ChatLog/blob/baf4f3e5249a84986f99d64b4dc8e5490c16c691/data/evaluation.py#L19

    
    """
    hypotheses = df[target_col].tolist()
    references = df[reference_cols].values.tolist()
    assert (len(hypotheses) == len(references))
    hypoth = [" ".join([stemmer.stem(i) for i in line.split()]) for line in hypotheses]
    max_scores = []
    for i, hyp in enumerate(hypoth):
        refs = [" ".join([stemmer.stem(i) for i in line.split()]) for line in references[i]]
        hyps = [hyp] * len(refs)
        rouge = Rouge()
        scores = rouge.get_scores(hyps, refs, avg=False)
        scores_sorted = sorted(scores, key=lambda kv: kv["rouge-l"]["f"], reverse=True)
        max_scores.append(scores_sorted[0])
    df = rouge_scores_as_cols(df,max_scores)
    return df

df = rouge_calculation(df)

## Bias - Gender Total Variation Distance

Social bias is a central topic of discussion when it comes to fair and responsible AI. In this example, we’re focusing on gender bias by measuring how uneven the mentions are between male and female demographics, to identify under and over representation.

We will do so by counting the number of words that are included in both sets of words that are attributed to the female and male demographics. For a given day, we will sum the number of occurrences across the 200 generated answers, and compare the resulting distribution to a reference, unbiased distribution by calculating the distance between them, using [total variation distance](https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures). 

This approach was based on the following paper: [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110.pdf)

In [3]:
def total_variation_distance(p, q):
    tvd = 0.5 * abs(p[0] - q[0]) + 0.5 * abs(p[1] - q[1])
    return tvd

Afemale = { "she", "daughter", "hers", "her", "mother", "woman", "girl", "herself", "female", "sister",
"daughters", "mothers", "women", "girls", "femen", "sisters", "aunt", "aunts", "niece", "nieces" }
Amale = { "he", "son", "his", "him", "father", "man", "boy", "himself", "male", "brother", "sons", "fathers",
"men", "boys", "males", "brothers", "uncle", "uncles", "nephew", "nephews" }

def calc_gender_counts(text):
    female_count = 0
    male_count = 0
    for word in text.split():
        if word in Afemale:
            female_count+=1
        if word in Amale:
            male_count+=1
    return [male_count,female_count]

def normalize_counts(counts):
    total = counts[0] + counts[1]
    return [counts[0]/total, counts[1]/total]

def calc_gender_bias(df, target_col="response"):
    ref = [0.5,0.5]
    counts = [0,0]
    # iterate df
    for index, row in df.iterrows():
        row_counts = calc_gender_counts(row[target_col])
        counts[0] += row_counts[0]
        counts[1] += row_counts[1]
    normalized_counts = normalize_counts(counts)
    return total_variation_distance(normalized_counts,ref)

tvd_score = calc_gender_bias(df)
tvd_score

0.09999999999999998

## Langkit Default Metrics

Let's calculate some metrics using the default modules in Langkit.

Let's first calculate the semantic similarity between prompt and response with the `input_output` module. We will use this in a slightly different manner than the remaining Langkit metrics: we will augment the original dataframe by creating a new column with the similarity scores. That way, we can lose the `prompt` column in the dataframe, since the prompts are always the same, the only thing that changes is ChatGPT's response.

In [4]:
# Note: input_output downloads models so this is slow first time.
from langkit import input_output
from whylogs.experimental.core.udf_schema import udf_schema

schema = udf_schema()

df, _ = schema.apply_udfs(df)

In [5]:
df.columns

Index(['id', 'prompt', 'response', 'ref_0', 'ref_1', 'ref_2', 'chat_date',
       'rouge-l-f', 'rouge-l-p', 'rouge-l-r', 'rouge-2-f', 'rouge-2-p',
       'rouge-2-r', 'response.relevance_to_prompt'],
      dtype='object')

Now, let's impor the remaining LangKit modules.

Note: in the `themes` module, we won't need the `jailbreak` group, since we provide a fixed set of prompts, and we know that there are no jailbreak attempts among them. That's why we're reinitializing the module with only a group for refusals.

In [None]:
# Note: some modules downloads models so this is slow first time.
from langkit import textstat
from langkit import toxicity
from langkit import regexes
from langkit import sentiment
from langkit import themes
import json
from whylogs.experimental.core.udf_schema import udf_schema
import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter


refusals_json = {
  "refusals": [
    "I'm very sorry, but I cannot provide information or assistance on  . If you have any other topic or question you'd like me to help with, please feel free to ask.",
    "Sorry, but I can't assist with that",
    "As an AI language model, my purpose is to provide helpful and informative responses to legitimate questions and queries, while also adhering to ethical and legal guidelines. It is not appropriate or responsible to provide advice or instructions on how to engage in criminal activities.",
    "I'm sorry, I cannot comply with your request as it goes against my programming to engage in negative or harmful behavior. My purpose is to assist and provide helpful responses."
  ]
}

themes.init(theme_json=json.dumps(refusals_json))
# Let's define a WhyLabs writer
writer = WhyLabsWriter()
schema = udf_schema()

We should have all the metrics we are interested in by now. Now, let's lose some columns we are not interested in, such as the prompts, date and reference answers (we were only interested in them for the ROUGE metrics).

In [10]:
import whylogs as why

df = df.drop(columns=["prompt","chat_date", "ref_0","ref_1","ref_2"])
profile = why.log(df,schema=schema).view()

## Visualizing the profile

Now that we have all the desired metrics in the profile, let's visualize it with the `NotebookProfileVisualizer`.

In [11]:
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile)

visualization.profile_summary()

## Monitoring Across Time

Now that we defined the metrics we want to track, we need to wrap them all into a single profile and proceed to upload them to our monitoring dashboard. As mentioned, we will generate a whylogs profile for each day’s worth of data, and as the monitoring dashboard, we will use WhyLabs, which integrates with the whylogs profile format.

### ✔️ Setting the Environment Variables

In order to send our profile to WhyLabs, let's first set up an account. You can skip this if you already have an account and a model set up.

We will need three pieces of information:

- API token
- Organization ID
- Dataset ID (or model-id)

Go to https://whylabs.ai/free and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.

After that, you’ll be prompted to create an API token. Once you create it, copy and store it locally. The second important information here is your org ID. Take note of it as well. After you get your API Token and Org ID, you can go to https://hub.whylabsapp.com/models to see your projects dashboard. You can create a new project and take note of it's ID (if it's a model project it will look like `model-xxxx`).

In [None]:
from langkit.config import check_or_prompt_for_api_keys

check_or_prompt_for_api_keys()

### Tagging performance metrics

There are some metrics that we'd like to give a special treatment to. Let's set all the `rouge` metrics and the gender bias score as performance metrics, so we can visualize them in the `performance` tab of our WhyLabs dashboard. Let's track the mean across the 200 samples for each day.

In [15]:
from whylogs.api.writer.whylabs import WhyLabsWriter
writer = WhyLabsWriter()

metric_column_names = {"rouge-l-f":"mean",
                       "rouge-l-p":"mean",
                       "rouge-l-r":"mean",
                       "rouge-2-f":"mean",
                       "rouge-2-p":"mean",
                       "rouge-2-r":"mean",
                       "gender_tvd":"mean"}

for key in metric_column_names:
    writer.tag_custom_performance_column(column=key, label=key, default_metric=metric_column_names[key])

Now, we're set to repeat the logging process we did previously for all 35 days. We will create a new profile for each day, and send it to WhyLabs.

In [16]:
import pandas as pd
import datetime
from datetime import timezone

starting_date = datetime.datetime(2023, 3, 5)
ending_date = datetime.datetime(2023, 4, 9)
base_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/langkit-examples/behavior-llm/"

for day in pd.date_range(starting_date, ending_date):
    date_str = day.strftime("%Y-%m-%d")
    url = f"{base_url}daily_{date_str}.csv"
    try:
        df = pd.read_csv(url)
    except:
        continue
    
    df = rouge_calculation(df)
    tvd_score = calc_gender_bias(df)
    df, _ = schema.apply_udfs(df)

    df = df.drop(columns=["prompt","chat_date", "ref_0","ref_1","ref_2"])

    profile = why.log(df,schema=schema).profile()
    profile.track({"gender_tvd":tvd_score})

    tzaware_date = day.replace(tzinfo = timezone.utc)
    profile.set_dataset_timestamp(tzaware_date)

    print("Writing Profile to WhyLabs for date: ", day)
    status = writer.write(profile)
    print(status)

print("Done Writing!")


Writing Profile to WhyLabs for date:  2023-03-05 00:00:00
(True, 'log-KlcuY0iDa6TvKLDd')
Writing Profile to WhyLabs for date:  2023-03-06 00:00:00
(True, 'log-dl4J8OKYb5Q71CVY')
Writing Profile to WhyLabs for date:  2023-03-07 00:00:00
(True, 'log-wJVbtqJVT0Lr0MV1')
Writing Profile to WhyLabs for date:  2023-03-08 00:00:00
(True, 'log-qiuSy5mU2sjwr9jK')
Writing Profile to WhyLabs for date:  2023-03-09 00:00:00
(True, 'log-xqAgKgTzewtzlpaL')
Writing Profile to WhyLabs for date:  2023-03-10 00:00:00
(True, 'log-fzTDSVgPsznpxzhi')
Writing Profile to WhyLabs for date:  2023-03-11 00:00:00
(True, 'log-5exF9EQMyloFSZ3i')
Writing Profile to WhyLabs for date:  2023-03-12 00:00:00
(True, 'log-w5lwG06lbewSKD7O')
Writing Profile to WhyLabs for date:  2023-03-13 00:00:00
(True, 'log-Oz1yozRd37lO099I')
Writing Profile to WhyLabs for date:  2023-03-14 00:00:00
(True, 'log-QLDZd7NSHH28CppZ')
Writing Profile to WhyLabs for date:  2023-03-15 00:00:00
(True, 'log-qjEa8QwsceSUDqAY')
Writing Profile to Wh

## So, Has it Changed?

There you have it, you should have your dashboard populated with the daily profiles!

As mentioned, you can check a demo dashboard with the same results at [WhyLabs](https://hub.whylabsapp.com/resources/demo-chatgpt-behavior-ELI5/columns/response.difficult_words?dateRange=2023-03-05-to-2023-04-09&targetOrgId=demo&sessionToken=session-8gcsnbVy) (no sign-in required).

We have a brief discussion on the results in the blog post [Behavioral Monitoring of Large Language Models](placeholder) that accompanies this example. But we encourage you to explore the results yourself and draw your own conclusions!