In [2]:
import numpy as np

In [3]:
import checklist
from checklist.test_suite import TestSuite
import logging
logging.basicConfig(level=logging.ERROR)

In [4]:
from transformers import pipeline

In [5]:
model = pipeline("question-answering", model="./trained_model", device=0)

In [6]:
suite_path = '../checklist/release_data/squad/squad_suite.pkl'
suite = TestSuite.from_file(suite_path)

In [7]:
def predconfs(context_question_pairs):
    preds = []
    confs = []
    for c, q in context_question_pairs:
        try:
            p = model(question=q, context=c, truncation=True, )
        except:
            print('Failed', q)
            preds.append(' ')
            confs.append(1)
        preds.append(p['answer'])
        confs.append(p['score'])
    return preds, np.array(confs)

In [8]:
suite.run(predconfs, overwrite=True)   #add param n=100 for quicker testing 

Running A is COMP than B. Who is more / less COMP?
Predicting 988 examples
Running Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Predicting 5964 examples
Running size, shape, age, color
Predicting 2000 examples
Running Profession vs nationality
Predicting 5000 examples
Running Animal vs Vehicle
Predicting 2000 examples
Running Animal vs Vehicle v2
Predicting 1984 examples
Running Synonyms
Predicting 1788 examples
Running A is COMP than B. Who is antonym(COMP)? B
Predicting 1984 examples
Running A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
Predicting 7856 examples
Running Question typo
Predicting 1000 examples
Running Question contractions
Predicting 1005 examples
Running Add random sentence to context
Predicting 1500 examples
Running Change name everywhere
Predicting 5500 examples
Running Change location everywhere
Predicting 5500 examples
Running There was a change in profession
Predicting 96

In [9]:
def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [10]:
suite.summary(format_example_fn=format_squad_with_context)

Vocabulary

A is COMP than B. Who is more / less COMP?
Test cases:      494
Fails (rate):    492 (99.6%)

Example fails:
C: Jennifer is cooler than Charles.
Q: Who is less cool?
A: Charles
P: Jennifer


----
C: Tyler is cleaner than Sarah.
Q: Who is less clean?
A: Sarah
P: Tyler


----
C: Robert is darker than Melissa.
Q: Who is less dark?
A: Melissa
P: Robert


----


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      497
Fails (rate):    497 (100.0%)

Example fails:
C: Zachary is hopeful about the project. Joseph is highly hopeful about the project.
Q: Who is least hopeful about the project?
A: Zachary
P: Joseph

C: Joseph is highly hopeful about the project. Zachary is hopeful about the project.
Q: Who is least hopeful about the project?
A: Zachary
P: Joseph

C: Joseph is highly hopeful about the project. Zachary is a little hopeful about the project.
Q: Who is least hopeful about the project?
A: Zachary
P: Joseph is highly hopeful about the 

In [11]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'A is COMP than B. Wh…

In [12]:
test = suite.tests['Question typo']
test.run(predconfs, overwrite=True)

Predicting 1000 examples


In [14]:
test.summary()

Test cases:      500
Fails (rate):    100 (20.0%)

Example fails:
Caris & Co. ("The Writers Guild of America strike that halted production of network programs for much of the 2007–08 season affected the network in 2007–08 and 2008–09, as various ABC shows that premiered in 2007, such as Dirty Sexy Money, Pushing Daisies, Eli Stone and Samantha Who?, did not live to see a third season; other series such as Boston Legal and the U.S. version of Life on Mars suffered from low viewership, despite the former, a spin off of The Practice, being a once-highlighted breakout series when it debuted in 2005. One of the network's strike-replacement programs during that time was the game show Duel, which premiered in December 2007. The program would become a minor success for the network during its initial six-episode run, which led ABC to renew Duel as a regular series starting in April 2008. However, Duel suffered from low viewership during its run as a regular series, and ABC canceled the program 