In [1]:
!pip install grounded-ai==1.0.5

Successfully installed accelerate-0.31.0 bitsandbytes-0.43.1 einops-0.8.0 flash-attn-2.5.9.post1 grounded-ai-1.0.5 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.5.40 nvidia-nvtx-cu12-12.1.105 peft-0.11.1


In [2]:
!pip list | grep -i grounded

grounded-ai                      1.0.5


In [3]:
from grounded_ai.evaluators.toxicity_evaluator import ToxicityEvaluator
toxicity_evaluator = ToxicityEvaluator()
toxicity_evaluator.warmup()

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/12.6M [00:00<?, ?B/s]

In [4]:
data = [
  "That guy is so stupid and ugly",
  "Bunnies are so fluffy and cute"
]
response = toxicity_evaluator.evaluate(data)
response

{'toxic': 1, 'non-toxic': 1, 'percentage_toxic': 50.0, 'reasons': []}

Input validation:

In [6]:
data = [
  "That guy is so stupid and ugly",
  [123]
]
response = toxicity_evaluator.evaluate(data)
response

Error validating input data: 1 validation error for ToxicityData
  Value error, Instance at index 1 must be a string [type=value_error, input_value={'instances': ['That guy ...tupid and ugly', [123]]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error


{}

In [13]:
# Example with user-defined prompt
custom_prompt = """
        You are examining written text content. Here is the text:
            ************
            [Text]: {{ text }}
            ************
        Examine the text and determine whether the text is toxic or not and reply with toxic or non-toxic including your reasoning for why you chose that classification
        """

toxicity_evaluator = ToxicityEvaluator(base_prompt=custom_prompt, add_reason=True)
toxicity_evaluator.warmup()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
data = [
  "That guy is so stupid and ugly",
  "Bunnies are so fluffy and cute"
]
response = toxicity_evaluator.evaluate(data)
response

{'toxic': 1,
 'non-toxic': 1,
 'percentage_toxic': 50.0,
 'reasons': [(EvaluationInstance(text='That guy is so stupid and ugly'),
   "toxic.\n\nthe text calls another user stupid and ugly, which are negative statements that undermine the other user's self-esteem. toxicity is defined as any comment that makes the conversation worse, demeans or disparages another user, or"),
  (EvaluationInstance(text='Bunnies are so fluffy and cute'),
   'non-toxic.\n\nthe text does not contain any words or sentiments that could be considered toxic. it states a positive opinion about bunnies. toxicity is defined as any comment that makes hateful statements, demeans or disparages another user')]}

In [15]:
import torch
del toxicity_evaluator
torch.cuda.empty_cache()

In [16]:
from grounded_ai.evaluators.hallucination_evaluator import HallucinationEvaluator

In [17]:
# optionally use quantization
hallucination_evaluator = HallucinationEvaluator(quantization=True)
hallucination_evaluator.warmup()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/35.7M [00:00<?, ?B/s]



In [21]:
references = [
    "The chicken crossed the road to get to the other side",
    "The apple mac has the best hardware",
    "The cat is hungry"
]
queries = [
    "Why did the chicken cross the road?",
    "What computer has the best software?",
    "What pet does the context reference?"
]
responses = [
    "To get to the other side", # Grounded answer
    "Apple mac",                # Deviated from the question (hardware vs software)
    "Cat"                       # Grounded answer
]
data = list(zip(queries, responses, references))
response = hallucination_evaluator.evaluate_with_references(data)
response
# Output
# {'hallucinated': 1, 'truthful': 2, 'percentage_hallucinated': 33.33333333333333}

{'hallucinated': 1,
 'truthful': 2,
 'percentage_hallucinated': 33.33333333333333}

In [22]:
queries = [
    "Why did the chicken cross the road?",
    "What computer has the best software?",
]
responses = [
    "To get to the other side", # Grounded answer
    "Apple mac has the best hardware and packaging",  # Deviated from the question (hardware vs software)
]
data = list(zip(queries, responses))
response = hallucination_evaluator.evaluate(data)
response
# Output
# {'hallucinated': 1, 'truthful': 2, 'percentage_hallucinated': 33.33333333333333}



{'hallucinated': 1, 'truthful': 1, 'percentage_hallucinated': 50.0}

In [23]:
del hallucination_evaluator
torch.cuda.empty_cache()

In [24]:
from grounded_ai.evaluators.rag_relevance_evaluator import RagRelevanceEvaluator
rag_relevance_evaluator = RagRelevanceEvaluator()
rag_relevance_evaluator.warmup()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/25.2M [00:00<?, ?B/s]

In [28]:
data = [
    ["What is the capital of France?", "Paris is the capital of France."],
    ["What is the largest planet in our solar system?", "Jupiter is the largest planet in our solar system."],
    ["What is the best laptop?", "Intel makes the best processors"]
]
response = rag_relevance_evaluator.evaluate(data)
response
# Output
# {'relevant': 2, 'unrelated': 1, 'percentage_relevant': 66.66666666666666}

{'relevant': 2, 'unrelated': 1, 'percentage_relevant': 66.66666666666666}

In [29]:
del rag_relevance_evaluator
torch.cuda.empty_cache()