In [1]:
%reload_ext autoreload
%autoreload 2

import pandas as pd
import utils

### Goals
- Top priority is debug. Find out all inference bugs that cause low to none performance.
- Then find the low performant group and learn what's wrong.

### Learned
- Add few todos to obsidian driver notes. 
    - How to eval in vector space?
    - How to build `dataloader` for evaluation?
    - How to deal with `multiple choice` eval? 
- Full eval on `t0` as today's baseline for next stage experiments.

In [2]:
csv_path = '/workspaces/seed/paper/sanhMultitaskPromptedTraining2022a/evaluation_result/20230112/T0_results.csv'
new_csv_path = '/workspaces/seed/paper/sanhMultitaskPromptedTraining2022a/evaluation_result/20230113/T0.csv'

df = pd.read_csv(csv_path)
new_df = pd.read_csv(new_csv_path)

# replace subset_name which is NaN with 'all'
df['subset_name'] = df['subset_name'].fillna('all')
new_df['subset_name'] = new_df['subset_name'].fillna('all')

In [17]:
def report(df):
    return df.groupby(['dataset_name', 'subset_name']) \
        .agg({'accuracy': ['max', 'min', 'mean', 'std', 'count']}) \
        .rename(columns={'count': 'num_prompts'}) \
        .sort_values(by=('accuracy', 'std'), ascending=False)

def peak(df, dataset_name, subset_name):
    return df[(df['dataset_name'] == dataset_name) & (df['subset_name'] == subset_name)].sort_values(by='prompt_name')

In [18]:
report(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,accuracy,accuracy,accuracy,accuracy,accuracy
Unnamed: 0_level_1,Unnamed: 1_level_1,max,min,mean,std,num_prompts
dataset_name,subset_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
super_glue,cb,0.803571,0.0,0.469048,0.394331,15
super_glue,wic,0.62,0.0,0.227,0.292833,10
super_glue,wsc.fixed,0.692308,0.009615,0.521154,0.208841,10
anli,all,0.445,0.0,0.258,0.2043,15
super_glue,copa,0.96,0.32,0.80969,0.19117,12
super_glue,rte,0.875,0.735,0.825,0.047022,10
winogrande,winogrande_xl,0.63,0.55,0.595,0.032787,5
hellaswag,all,0.035,0.02,0.0275,0.006455,4


In [19]:
report(new_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,accuracy,accuracy,accuracy,accuracy,accuracy
Unnamed: 0_level_1,Unnamed: 1_level_1,max,min,mean,std,num_prompts
dataset_name,subset_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
super_glue,wsc.fixed,0.8,0.0,0.52,0.246306,10
super_glue,copa,0.95,0.35,0.750183,0.168953,12
super_glue,wic,0.6,0.0,0.43,0.161933,10
super_glue,cb,0.8,0.25,0.723333,0.141253,15
winogrande,winogrande_xl,0.8,0.45,0.58,0.135093,5
anli,all,0.55,0.1,0.39,0.115264,15
super_glue,rte,0.85,0.65,0.765,0.066875,10
hellaswag,all,0.05,0.0,0.0125,0.025,4


### What's wrong with `cb`?
- `Never != never` That's the problem of comparing in text space. Even in tokenized space, case sensitivity is still there.
  - Naively process all t2t generated output and target to lower case.
  - Eval in text space is not right. Would have to rebuild later in vector space. 
- `cb` has no 0 now.

In [20]:
peak(df, 'super_glue', 'cb')

Unnamed: 0,model_name,dataset_name,subset_name,test_size,prompt_name,accuracy
16,bigscience/T0,super_glue,cb,56,GPT-3 style,0.767857
22,bigscience/T0,super_glue,cb,56,MNLI crowdsource,0.0
15,bigscience/T0,super_glue,cb,56,always/sometimes/never,0.0
11,bigscience/T0,super_glue,cb,56,based on the previous passage,0.803571
10,bigscience/T0,super_glue,cb,56,can we infer,0.785714
12,bigscience/T0,super_glue,cb,56,claim true/false/inconclusive,0.017857
17,bigscience/T0,super_glue,cb,56,consider always/sometimes/never,0.0
13,bigscience/T0,super_glue,cb,56,does it follow that,0.732143
21,bigscience/T0,super_glue,cb,56,does this imply,0.803571
18,bigscience/T0,super_glue,cb,56,guaranteed true,0.767857


In [21]:
peak(new_df, 'super_glue', 'cb')

Unnamed: 0,checkpoint,dataset_name,subset_name,test_size,time,prompt_name,accuracy
16,bigscience/T0,super_glue,cb,20,0.887944,GPT-3 style,0.8
22,bigscience/T0,super_glue,cb,20,1.005579,MNLI crowdsource,0.75
15,bigscience/T0,super_glue,cb,20,0.848107,always/sometimes/never,0.7
11,bigscience/T0,super_glue,cb,20,0.847646,based on the previous passage,0.8
10,bigscience/T0,super_glue,cb,20,0.866905,can we infer,0.75
12,bigscience/T0,super_glue,cb,20,0.908543,claim true/false/inconclusive,0.8
17,bigscience/T0,super_glue,cb,20,0.850977,consider always/sometimes/never,0.6
13,bigscience/T0,super_glue,cb,20,0.833466,does it follow that,0.75
21,bigscience/T0,super_glue,cb,20,0.852169,does this imply,0.8
18,bigscience/T0,super_glue,cb,20,0.851614,guaranteed true,0.75


### How about `wic`
- Same case sensitivity problem. Decide to rerun a full eval on `to_3b` on all dataset and reanalysis.
- `wic` has no 0, except for prompt: `similar_sense`. Take a look. I think it's just very unclear prompt. 
```
I expect to receive wages.
We were expecting a visit from our relatives.
Similar sense of expect?
yes
noun

To pick rags.
Don't always pick on your little brother.
Similar sense of pick?
no
he picks up his little brother and puts him down.

They kept a log of all transmission by the radio station.
An email log.
Similar sense of log?
yes
similar sense of log

The professionalization of warfare.
The professionalization of American sports.
Similar sense of professionalization?
yes
noun

He drank a mixture of beer and lemonade.
The mixture of sulphuric acid and water produces heat.
Similar sense of mixture?
no
noun
```

In [22]:
peak(df, 'super_glue', 'wic')

Unnamed: 0,model_name,dataset_name,subset_name,test_size,prompt_name,accuracy
59,bigscience/T0,super_glue,wic,200,GPT-3-prompt,0.0
62,bigscience/T0,super_glue,wic,200,GPT-3-prompt-with-label,0.515
58,bigscience/T0,super_glue,wic,200,affirmation_true_or_false,0.515
57,bigscience/T0,super_glue,wic,200,grammar_homework,0.0
63,bigscience/T0,super_glue,wic,200,polysemous,0.62
61,bigscience/T0,super_glue,wic,200,question-context,0.0
56,bigscience/T0,super_glue,wic,200,question-context-meaning,0.0
55,bigscience/T0,super_glue,wic,200,question-context-meaning-with-label,0.01
60,bigscience/T0,super_glue,wic,200,same_sense,0.61
64,bigscience/T0,super_glue,wic,200,similar-sense,0.0


In [23]:
peak(new_df, 'super_glue', 'wic')

Unnamed: 0,checkpoint,dataset_name,subset_name,test_size,time,prompt_name,accuracy
59,bigscience/T0,super_glue,wic,20,0.252208,GPT-3-prompt,0.45
62,bigscience/T0,super_glue,wic,20,0.259192,GPT-3-prompt-with-label,0.4
58,bigscience/T0,super_glue,wic,20,0.306139,affirmation_true_or_false,0.55
57,bigscience/T0,super_glue,wic,20,0.2608,grammar_homework,0.45
63,bigscience/T0,super_glue,wic,20,0.32057,polysemous,0.5
61,bigscience/T0,super_glue,wic,20,0.2521,question-context,0.45
56,bigscience/T0,super_glue,wic,20,0.243914,question-context-meaning,0.45
55,bigscience/T0,super_glue,wic,20,0.268794,question-context-meaning-with-label,0.45
60,bigscience/T0,super_glue,wic,20,0.315969,same_sense,0.6
64,bigscience/T0,super_glue,wic,20,0.439067,similar-sense,0.0


### `hellaswag` is elephant in the room. Consistently bad performance across all prompts
Preview `hellaswag`
```
input_text: How does this sentence end?
[header] How to recover from an emotional affair [title] Forgive yourself. [step] While forgiving others can be challenging, it's often even harder to forgive yourself. Remember that if you had known the path of your actions and their consequences, you probably would not have done what you did.

(a)  Take some time to live in the past and let go of those emotions. For example, if you had experienced a miscarriage, forgiveness would be easy.

(b)  Forgiveness in someone else can only serve to make it harder. [substeps] Cheating can be very emotional, and can even be worse.

(c)  To begin forgiving yourself, admit that you messed up or made a mistake. Making mistakes is part of being human and no one is exempt from it.

(d)  In the moment, forgive yourself for all the things you could have done differently. [substeps] Though you may have felt wronged, forgiving yourself for your actions will take time and effort to live.

Hint: the topic of the sentence is Family Life
target: to begin forgiving yourself, admit that you messed up or made a mistake. making mistakes is part of being human and no one is exempt from it.
LM output: forgiveness

input_text: How does this sentence end?
A doctor in a lab coat talks about the lenses too, while people are showing how to use them. another news anchor

(a)  talks about contacts lenses and how robotic they can be.

(b)  also talks about the same lenses and how it has become a dangerous trend among teenagers.

(c)  is interviewed about the incident.

(d)  talks about the lens and drink is an advertisement that the lens is called printed in a foreign language.

Hint: the topic of the sentence is Putting in contact lenses
target: also talks about the same lenses and how it has become a dangerous trend among teenagers.
LM output: contact, lens, put
```
- I don't know how to handle multiple choice evaluation.

In [31]:
peak(df, 'hellaswag', 'all')

Unnamed: 0,model_name,dataset_name,subset_name,test_size,prompt_name,accuracy
65,bigscience/T0,hellaswag,all,200,Predict ending with hint,0.02
66,bigscience/T0,hellaswag,all,200,Randomized prompts template,0.03
67,bigscience/T0,hellaswag,all,200,complete_first_then,0.025
68,bigscience/T0,hellaswag,all,200,if_begins_how_continues,0.035


### `anli`
- Common sense eval set. Max=0.44 for now. It is just hard dataset. 

```
input_text: "Be Right Back" is the first episode of the second series of British science fiction anthology series "Black Mirror". It was written by series creator and showrunner Charlie Brooker, directed by Owen Harris and first aired on Channel 4 on 11 February 2013. Using only the above description and what you know about the world, ""Be Right Back" has existed for over 6 years" is definitely correct, incorrect, or inconclusive?
target: correct
LM output: incorrect

input_text: Club Atlético Unión de Mar del Plata is an Argentine sports club from Mar del Plata, Buenos Aires Province. The club was founded on December 1, 1926, and its main sports are football and basketball. In football, Unión currently plays in the Torneo Argentino A, which is the regionalised third division of the Argentine football league system. Using only the above description and what you know about the world, "Club Atlético Unión de Mar del Plata has been around for 100 years" is definitely correct, incorrect, or inconclusive?
target: incorrect
LM output: correct

input_text: Bantiger TV Tower is a 196 metre tall tower used for FM- and TV-transmission at on the Bantiger mountain, a mountain east of Berne situated in the municipality of Bolligen. The Bantiger TV Tower was built between 1991 and 1996 as replacement of a 100 metres tall radio tower, built in 1954. Using only the above description and what you know about the world, "Bantiger TV Tower is used for AM-, FM-, and TV-transmissions.  " is definitely correct, incorrect, or inconclusive?
target: incorrect
LM output: correct
```


### `winogrande`
- entity resolution

```
input_text: In the sentence below, does the _ stand for Patricia or Felicia?
Patricia decided to buy Felicia dinner because they had been through a lot and _ just inherited some money.
target: patricia
LM output: felicia

input_text: In the sentence below, does the _ stand for south or north?
The clothing in the north was warmer than the clothing in the south because there was more snow in the _ .
target: north
LM output: north

input_text: In the sentence below, does the _ stand for transporter or plane?
Timmy bought a transporter for his cat so he could take him on the plane but the _ was too small.
target: transporter
LM output: plane
```

In [35]:
peak(df, 'anli', 'all').sort_values(by='accuracy', ascending=True)

Unnamed: 0,model_name,dataset_name,subset_name,test_size,prompt_name,accuracy
25,bigscience/T0,anli,all,200,MNLI crowdsource,0.0
35,bigscience/T0,anli,all,200,always/sometimes/never,0.0
37,bigscience/T0,anli,all,200,consider always/sometimes/never,0.0
34,bigscience/T0,anli,all,200,guaranteed/possible/impossible,0.0
38,bigscience/T0,anli,all,200,claim true/false/inconclusive,0.035
31,bigscience/T0,anli,all,200,take the following as truth,0.07
39,bigscience/T0,anli,all,200,guaranteed true,0.395
26,bigscience/T0,anli,all,200,should assume,0.405
30,bigscience/T0,anli,all,200,justified in saying,0.41
29,bigscience/T0,anli,all,200,based on the previous passage,0.42


In [41]:
peak(df, 'winogrande', 'winogrande_xl').sort_values(by='accuracy', ascending=True)

Unnamed: 0,model_name,dataset_name,subset_name,test_size,prompt_name,accuracy
54,bigscience/T0,winogrande,winogrande_xl,200,Replace,0.55
53,bigscience/T0,winogrande,winogrande_xl,200,fill in the blank,0.575
51,bigscience/T0,winogrande,winogrande_xl,200,stand for,0.6
50,bigscience/T0,winogrande,winogrande_xl,200,does underscore refer to,0.62
52,bigscience/T0,winogrande,winogrande_xl,200,underscore refer to,0.63


### lab

In [10]:
checkpoint = 'bigscience/T0'
t2t = utils.build_t2t(checkpoint)

In [40]:
dataset_name = 'winogrande'
subset_name = 'winogrande_xl'
raw_dataset = utils.load_raw_dataset(dataset_name, subset_name)

[2023-01-13 03:03:53,182] [datasets.builder] [builder.py:785] Found cached dataset winogrande (/workspaces/seed/cache/hf_dataset/winogrande/winogrande_xl/1.1.0/a826c3d3506aefe0e9e9390dcb53271070536586bab95849876b2c1743df56e2)


In [42]:
prompt_name = 'stand for'
prompt = utils.get_prompt(dataset_name, subset_name, prompt_name)
input_text, target_text = utils.preprocess_dataset(raw_dataset, prompt, cutoff=10)

[2023-01-13 03:04:34,403] [datasets.arrow_dataset] [arrow_dataset.py:3930] Loading cached shuffled indices for dataset at /workspaces/seed/cache/hf_dataset/winogrande/winogrande_xl/1.1.0/a826c3d3506aefe0e9e9390dcb53271070536586bab95849876b2c1743df56e2/cache-5aca6c830a1dfa33.arrow


In [43]:
for i, t in zip(input_text, target_text):
    print('input_text:', i)
    print('target:', t)
    print('LM output:', t2t(i)[0])
    print()

input_text: In the sentence below, does the _ stand for Patricia or Felicia?
Patricia decided to buy Felicia dinner because they had been through a lot and _ just inherited some money.
target: patricia
LM output: felicia

input_text: In the sentence below, does the _ stand for south or north?
The clothing in the north was warmer than the clothing in the south because there was more snow in the _ .
target: north
LM output: north

input_text: In the sentence below, does the _ stand for transporter or plane?
Timmy bought a transporter for his cat so he could take him on the plane but the _ was too small.
target: transporter
LM output: plane

input_text: In the sentence below, does the _ stand for diner or food truck?
It was easier for the diner to follow their budget than the food truck because the _ had more money to spend.
target: diner
LM output: food truck

input_text: In the sentence below, does the _ stand for headphone or clock?
John could not hear his alarm clock when he was sleep