# Monitoring Hugging Face LLMs with LangKit

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/langkit/blob/main/langkit/examples/huggingface_langkit_whylabs.ipynb)

In this example, we'll show how to generate out-of-the-box text metrics for Hugging Face LLMs using LangKit and monitor them in the WhyLabs Observability Platform.

LangKit can extract relevant signals from unstructured text data, such as:

- [Text Quality](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/quality.md)
- [Text Relevance](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/relevance.md)
- [Security and Privacy](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/security.md)
- [Sentiment and Toxicity](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/sentiment.md)

We'll use the GPT2 model for this example since it's lightweight and easy to run without a GPU, but any of the larger Hugging Face models can be used.

### Install Hugging Face Transformers & LangKit

In [None]:
%pip install transformers
%pip install 'langkit[all]'

### Import and initialize the Hugging Face GPT2 model + tokenizer

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

### Create GPT model function
This will take in a prompt and return a dictionary containing the prompt and model response.

In [3]:
def gpt_model(prompt):

  # Encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors='pt')

  # Generate a response
  output = model.generate(input_ids, max_length=100, temperature=0.8,
                          do_sample=True, pad_token_id=tokenizer.eos_token_id)

  # Decode the output
  response = tokenizer.decode(output[0], skip_special_tokens=True)

  # Combine the prompt and the output into a dictionary
  prompt_and_response = {
      "prompt": prompt,
      "response": response
  }

  return prompt_and_response

In [None]:
prompt_and_response = gpt_model("Tell me a story about a cute dog")
print(prompt_and_response)

### Create & Inspect Language Metrics with LangKit

LangKit provides a toolkit of metrics for LLM applications, lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.

In [None]:
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

why.init(session_type='whylabs_anonymous')
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

In [6]:
# Let's look at our prompt_and_response created above
profile = why.log(prompt_and_response, name="HF prompt & response", schema=schema)

✅ Aggregated 1 rows into profile 'HF prompt & response'

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?sessionToken=session-8gcsnbVy&profile=ref-AVbB1yOaSblsa89U

Click the link generated above to view the language metrics in Whylabs.

We can also see all our values by viewing our LangKit profile in a pandas data frame.

You can use this data in real time to make a decision about prompts and responses, such as setting guardrails on your model.

In [7]:
profview = profile.view()
profview.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,udf/toxicity:distribution/q_95,udf/toxicity:distribution/q_99,udf/toxicity:distribution/stddev,udf/toxicity:frequent_items/frequent_strings,udf/toxicity:types/boolean,udf/toxicity:types/fractional,udf/toxicity:types/integral,udf/toxicity:types/object,udf/toxicity:types/string,udf/toxicity:types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
prompt,1.0,1.0,1.00005,0,1,0,0,,0.0,,...,0.005245,0.005245,0.0,"[FrequentItem(value='0.005245', est=1, upper=1...",0.0,1.0,0.0,0.0,0.0,0.0
response,1.0,1.0,1.00005,0,1,0,0,,0.0,,...,0.004731,0.004731,0.0,"[FrequentItem(value='0.004731', est=1, upper=1...",0.0,1.0,0.0,0.0,0.0,0.0
response.relevance_to_prompt,1.0,1.0,1.00005,0,1,0,0,0.627636,0.627636,0.627636,...,,,,,,,,,,


## ML Monitoring for Hugging Face LLMs in WhyLabs


To send LangKit profiles to WhyLabs we will need three pieces of information:

- API token
- Organization ID
- Dataset ID (or model-id)

Go to [https://whylabs.ai/free](https://whylabs.ai/free) and grab a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

1. Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
2. Create an API token from the "Access Tokens" tab
3. Copy your org ID from the same "Access Tokens" tab

Replace the placeholder string values with your own OpenAI and WhyLabs API Keys below:

In [None]:
import os
# set authentication & project keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'ORGID'
os.environ["WHYLABS_API_KEY"] = 'APIKEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'MODELID'

In [None]:
from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

In [None]:
# Single Profile
telemetry_agent = WhyLabsWriter()
profile = why.log(prompt_and_response, schema=schema)
telemetry_agent.write(profile.view())

This will write a single profile to WhyLabs.

As more profiles are written on different dates, you'll get a time series pattern you can analyze & set monitors like in the [Demo org](https://bit.ly/3NOq0Od).

You can also backfill batches of data by overwriting the date and time as seen in [this example](https://github.com/whylabs/langkit/blob/main/langkit/examples/Batch_to_Whylabs.ipynb).

![](https://github.com/whylabs/langkit/blob/main/static/img/sentiment-monitor.png?raw=1)

## Optional: Use a Rolling Logger
A rolling logger can be used instead of the method above to write profiles at pre-defined intervals.

In [None]:
telemetry_agent = why.logger(mode="rolling", interval=5, when="M",schema=schema, base_name="huggingface")
telemetry_agent.append_writer("whylabs")

In [None]:
# Log data + model outputs to WhyLabs.ai
telemetry_agent.log(prompt_and_response)

<whylogs.api.logger.result_set.ProfileResultSet at 0x7f6f7c5b3af0>

In [None]:
# Close the whylogs rolling logger when the service is shut down
telemetry_agent.close()

# More Resources

Learn more about monitoring LLMs in production with LangKit

- [Intro to LangKit Example](https://github.com/whylabs/langkit/blob/main/langkit/examples/Intro_to_Langkit.ipynb)
- [LangKit LangChain Integration](https://github.com/whylabs/langkit/blob/main/langkit/examples/Langchain_OpenAI_LLM_Monitoring_with_WhyLabs.ipynb)
- [LangKit GitHub](https://github.com/whylabs/langkit)
- [whylogs GitHub - data logging & AI telemetry](https://github.com/whylabs/whylogs)
- [WhyLabs - Safeguard your Large Language Models](https://whylabs.ai/safeguard-large-language-models)
- [Hugging Face GPT2 Model](https://huggingface.co/gpt2)