## IBM Granite :- Granite Guardian HAP 38M

Model Summary:-

This model is IBM's lightweight, 4-layer toxicity binary classifier for English. Its latency characteristics make it a suitable guardrail for any large language model. It can also be used for bulk processing of data where high throughput is needed. It has been trained on several benchmark datasets in English, specifically for detecting hateful, abusive, profane and other toxic content in plain text.

Developers: IBM Research
Release Date: September 6th, 2024
License: Apache 2.0.

======================================================

Usage

This model offers very low inference latency and is capable of running on CPUs apart from GPUs and AIUs. It features 38 million parameters, reducing the number of hidden layers from 12 to 4, decreasing the hidden size from 768 to 576, and the intermediate size from 3072 to 768, compared to the original RoBERTa model architecture. The latency on CPU vs accuracy numbers for this model in comparision to others is shown in the chart below.

========================================================

Ethical Considerations and Limitations

The use of model-based guardrails for Large Language Models (LLMs) involves risks and ethical considerations people must be aware of. This model operates on chunks of texts and provides a score indicating the presence of hate speech, abuse, or profanity. However, the efficacy of the model can be limited by several factors: the potential inability to capture nuanced meanings or the risk of false positives or negatives on text that is dissimilar to the training data. Previous research has demonstrated the risk of various biases in toxicity or hate speech detection. That is also relevant to this work. We urge the community to use this model with ethical intentions and in a responsible way.

In [1]:
import os
from dotenv import load_dotenv
from huggingface_hub import InferenceClient

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
load_dotenv()

True

In [3]:
client = InferenceClient(
    provider = "hf-inference",
    api_key = os.environ["HF_TOKEN"]
)

In [16]:
result = client.text_classification(
    """I’m not threatening anyone, but people like you really don’t belong in serious conversations.
You slow everyone down, and things would honestly work better if you just stayed quiet.
""",
    model="ibm-granite/granite-guardian-hap-38m",
)

In [21]:
for r in result:
    if r.label == "LABEL_0":
        r.label = "NOT HARMFUL"
    else:
        r.label = "HARMFUL"
result

[TextClassificationOutputElement(label='NOT HARMFUL', score=0.996728777885437),
 TextClassificationOutputElement(label='HARMFUL', score=0.0032712456304579973)]