## Filtering logprobs based on a Grammar specification

In [1]:
#%pip install --upgrade --quiet transformers-cfg

In [1]:
MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env");

## Use Grammar to ensure that only an arithmetic expression gets generated

In [2]:
from transformers import pipeline
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import GrammarConstrainedLogitsProcessor

pipe = pipeline(
    task="text-generation", 
    model=MODEL_ID,
 #   tokenizer=tokenizer,
    kwargs={
        "return_full_text": False,
    },
    model_kwargs={}
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [10]:
def get_expression_that_solves(math_problem: str) -> str:
    system_prompt = """
    You are a math instructor. I will ask you a math question.
    Respond with the mathematical expression that can be used to solve the problem.
    """
    
    # load the grammar
    grammar_str = """
root  ::= (expr "=" ws term "\n")+
expr  ::= term ([-+*/] term)*
term  ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num   ::= [0-9]+ ws
ws    ::= [ \t\n]*
    """
    grammar = IncrementalGrammarConstraint(grammar_str, "root", pipe.tokenizer)
    grammar_processor = GrammarConstrainedLogitsProcessor(grammar)

    input_message = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": math_problem}   
    ]

    results = pipe(input_message, 
                   max_new_tokens=256, 
                   do_sample=False, 
                   logits_processor=[grammar_processor])
    return results[0]['generated_text'][-1]['content'].strip()

In [11]:
result = get_expression_that_solves("""
Bill has 3 apples and 2 oranges.
Mae has 2 apples and 4 oranges.
How many total apples do Bill and Mae have?
""")
print(result)

bill_apples +mae_apples = total_apples

3 +2 = 5


Our result:
```
bill_apples +mae_apples = total_apples

3 +2 = 5
```

In [12]:
# here, the expression is (3+2) > (2+4), which is not allowed by our grammar which expects equality
result = get_expression_that_solves("""
Bill has 3 apples and 2 oranges.
Mae has 2 apples and 4 oranges.
Do Bill and Mae have more apples than oranges?
""")
print(result)

1 +2 = 3
2 +4 = 6

3 = 3
6 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 = 6

3 =


Our result:
```
1 +2 = 3
2 +4 = 6

3 = 3
...
```
Obviously, this is a problem if your grammar is too limited.

## Use Grammar to specify a format + validation

We want:
author | title | year


In [13]:
def parse_book_info(paragraph: str) -> str:
    system_prompt = """
    You will be given a short paragraph about a book.
    Extract the author, title, and publication year of the book.
    Return the result as author | title | year
    If any piece of information is not found, fill the spot with NULL
    """
    
    # load the grammar
    grammar_str = """
record ::= author separator title separator year
author ::= [a-zA-Z ]* | unk
title ::= [a-zA-Z ]* | unk
year ::= [1-2][0-9][0-9][0-9] | unk
unk ::= "NULL"
separator ::= "|"
    """
    grammar = IncrementalGrammarConstraint(grammar_str, "record", pipe.tokenizer)
    grammar_processor = GrammarConstrainedLogitsProcessor(grammar)

    input_message = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": paragraph}   
    ]

    results = pipe(input_message, 
                   max_new_tokens=256, 
                   do_sample=False, 
                   logits_processor=[grammar_processor])
    return results[0]['generated_text'][-1]['content'].strip()


parse_book_info("""
Love in the Time of Cholera (Spanish: El amor en los tiempos del cólera) is a novel written in Spanish
by Colombian Nobel Prize-winning author Gabriel García Márquez and published in 1985.
""")

'Gabriel Garcia Marquez | Love in the Time of Cholera |1985'

Result:
```
Gabriel Garcia Marquez | Love in the Time of Cholera |1985
```
Note that accents have been removed.

In [15]:
parse_book_info("""
The Tirukkural (Tamil: திருக்குறள், lit. 'sacred verses')
is a classic Tamil language text whose authorship is traditionally attributed to Valluvar,
also known in full as Thiruvalluvar. The text has been dated variously from 300 BCE to 5th century CE. 
The traditional accounts describe it as the last work of the third Sangam, but linguistic analysis
suggests a later date of 450 to 500 CE and that it was composed after the Sangam period.
""")

'Valluvar | The Tirukkural |NULL'

Result:
```
Valluvar | The Tirukkural |NULL
```
Note the use of NULL