[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Logging_Text.ipynb)

In [None]:
%pip install langkit


In [2]:
import whylogs as why
from langkit import light_metrics

llm_schema = light_metrics.init()
print("Done initializing metrics.")


schema does not contain metadata, LangKit won't update metadata


Done initializing metrics.


`light_metrics` is composed by the following modules:
- `textstat`: Text quality, readability, complexity, and grade level.
- `regexes`: Regex pattern matching for sensitive information

## Scenario 1: Feature Extraction

Langkit can be used to extract features from text data.

The following snippet will extract additional features from the text data. The llm_schema created previously will guide the feature extraction process. 

In [5]:
from langkit import extract
import pandas as pd

df = pd.DataFrame({'prompt': ['Hello', 'What is your number?'], 'response': ['World','my phone is +1 309-404-7587']})

enhanced_df = extract(df, schema=llm_schema)

enhanced_df


Unnamed: 0,prompt,response,prompt.flesch_reading_ease,response.flesch_reading_ease,prompt.automated_readability_index,response.automated_readability_index,prompt.aggregate_reading_level,response.aggregate_reading_level,prompt.syllable_count,response.syllable_count,...,prompt.letter_count,response.letter_count,prompt.polysyllable_count,response.polysyllable_count,prompt.monosyllable_count,response.monosyllable_count,prompt.difficult_words,response.difficult_words,prompt.has_patterns,response.has_patterns
0,Hello,World,36.62,121.22,2.6,2.6,0.0,0.0,2,1,...,5,5,0,0,0,1,0,0,,
1,What is your number?,my phone is +1 309-404-7587,92.8,117.16,0.6,2.7,1.0,2.0,5,5,...,16,20,0,0,3,5,0,0,,phone number


You can also provide a __dictionary__:

In [7]:
enhanced_row = extract({"prompt": "What is your number?","response": "my phone is +1 309-404-7587"},schema=llm_schema)
enhanced_row


{'prompt': 'What is your number?',
 'response': 'my phone is +1 309-404-7587',
 'prompt.flesch_reading_ease': 92.8,
 'response.flesch_reading_ease': 117.16,
 'prompt.automated_readability_index': 0.6,
 'response.automated_readability_index': 2.7,
 'prompt.aggregate_reading_level': 1.0,
 'response.aggregate_reading_level': 2.0,
 'prompt.syllable_count': 5,
 'response.syllable_count': 5,
 'prompt.lexicon_count': 4,
 'response.lexicon_count': 5,
 'prompt.sentence_count': 1,
 'response.sentence_count': 1,
 'prompt.character_count': 17,
 'response.character_count': 23,
 'prompt.letter_count': 16,
 'response.letter_count': 20,
 'prompt.polysyllable_count': 0,
 'response.polysyllable_count': 0,
 'prompt.monosyllable_count': 3,
 'response.monosyllable_count': 5,
 'prompt.difficult_words': 0,
 'response.difficult_words': 0,
 'prompt.has_patterns': None,
 'response.has_patterns': 'phone number'}

## Scenario 2: Statistical Profiling with whylogs

LangKit modules contain UDFs that automatically wire into the collection of UDFs on String features provided by whylogs by default.

All we have to do is pass the schema to `why.log()`:

In [6]:
results = why.log({"prompt": "Hello,", "response": "World!"}, schema=llm_schema)
print("Done profiling! Let's look at some of the metrics:")

view = results.view()
for col_name in view.get_columns():
    print(col_name)
print()
print("Here is the summary for response metrics")
view.get_column("response").to_summary_dict()


Done profiling! Let's look at some of the metrics:
prompt
response

Here is the summary for response metrics


{'counts/n': 1,
 'counts/null': 0,
 'counts/nan': 0,
 'counts/inf': 0,
 'types/integral': 0,
 'types/fractional': 0,
 'types/boolean': 0,
 'types/string': 1,
 'types/object': 0,
 'types/tensor': 0,
 'distribution/mean': 0.0,
 'distribution/stddev': 0.0,
 'distribution/n': 0,
 'distribution/max': nan,
 'distribution/min': nan,
 'distribution/q_01': None,
 'distribution/q_05': None,
 'distribution/q_10': None,
 'distribution/q_25': None,
 'distribution/median': None,
 'distribution/q_75': None,
 'distribution/q_90': None,
 'distribution/q_95': None,
 'distribution/q_99': None,
 'cardinality/est': 1.0,
 'cardinality/upper_1': 1.000049929250618,
 'cardinality/lower_1': 1.0,
 'udf/has_patterns:counts/n': 1,
 'udf/has_patterns:counts/null': 1,
 'udf/has_patterns:counts/nan': 0,
 'udf/has_patterns:counts/inf': 0,
 'udf/has_patterns:types/integral': 0,
 'udf/has_patterns:types/fractional': 0,
 'udf/has_patterns:types/boolean': 0,
 'udf/has_patterns:types/string': 0,
 'udf/has_patterns:types/ob