# Estimating surprisal from language models
Take sentences from CommitmentBank, MegaAttitudes, and stimuli from experiment, mask the attitude predicate, and get predicted probability of occurrence for the target verb. Then, calculate from that the surprisal of the verb.

In [1]:
from transformers import pipeline
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [2]:
# This makes the display show more info
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Contents
1. [Read in the three datasets](#Read-in-the-three-datasets)
2. [Masking out the correct verb](#Masking-out-the-correct-verb)
    1. [Remaining cases](#Remaining-cases)
    2. [Proposed Solution](#Proposed-Solution)
        1. [Step 1. Create a new column with list of pos tagged verbs from Sentence](#Step-1.-Create-a-new-column-with-list-of-pos-tagged-verbs-from-Sentence)
        2. [Step 2. Lemmatize VerbList](#Step-2.-Lemmatize-VerbList)
        3. [TROUBLESHOOT NEEDED](#TROUBLESHOOT-NEEDED)
    3. [Combine the datafriends together again](#Combine-the-dataframes-together-again)
    4. [Mask out the VerbToken from Sentence](#Mask-out-the-VerbToken-from-Sentence)
4. [Masked language modeling to estimate surprisal](#Masked-language-modeling-to-estimate-surprisal)

# Read in the three datasets
- Subset the dfs to just the relevant columns: ID, Verb, Sentence
- Make sure that the column names are consistent across the tree dfs

In [3]:
# CommitmentBank
# raw url: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/CommitmentBank-All.csv
cb = pd.read_csv("../data/CommitmentBank-ALL.csv")[["uID","Verb","Target"]].drop_duplicates()
cb = cb.rename(columns={"Target": "Sentence","uID":"ID"})

In [4]:
# MegaVeridicality
# raw URL: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/mega-veridicality-v2.csv
mv = pd.read_csv("../data/mega-veridicality-v2.csv")[["verb","frame","voice","sentence"]].drop_duplicates()
mv = mv.rename(columns={"verb": "Verb", "sentence":"Sentence"})
mv["ID"] = mv[['frame', 'voice']].apply(lambda x: '_'.join(x), axis=1)
mv = mv.drop(columns=["frame","voice"])

In [5]:
# Arousal/Valence Study
# raw URL: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/1_sliderprojection/exp1_test-trials.csv
vs = pd.read_csv("../data/1_sliderprojection/exp1_test-trials.csv")[["Word","utterance","exp"]]
vs = vs[vs["exp"]=="stim"].drop_duplicates().drop(columns={"exp"})
vs = vs.rename(columns={"Word": "Verb","utterance":"Sentence"})
vs["ID"] = ""

In [6]:
# Combine them together into one df
df = pd.concat([cb,mv,vs])

In [7]:
df.head()

Unnamed: 0,ID,Verb,Sentence
0,BNC-1,admit,They were still close enough to shore for him to return her to the police if she admitted she was not an experienced ocean sailor.
9,BNC-1002,say,Indeed it could be said that they had prospered.
17,BNC-1003,say,He might have said to her that some time in the middle of the nineteenth century a cult had grown up around the idea of the home.
29,BNC-1005,say,Of course she could say it was for the children as people always did... It was true up to a point.
37,BNC-1006,say,Robyn swallowed and took a deep breath trying to compose herself so that when he returned she could say that it was all right she felt fine now.


# Getting the correct verb token
What we need to do is mask out the correct verb in each of the sentences. We have the correct verb in the Verb column. We can easily use apply() with str.replace() to switch the verb with [MASK]. The problem is that the verbs in the sentences are inflected tokens, while the verbs in Verb are lemmatized.


For some of the verbs, we don't need to worry about this problem because there is morphological overlap between the Verb Token and the Verb Lemma. 


Solution:
1. Create a new verb token column
2. Regex + literal string interpolation to match works in cases where the Verb matches morphologically

In [None]:
# Frustratingly, this isn't working
# df["VerbToken"] = df['Sentence'].str.extract(fr'({df["Verb"]}\w*)')

Find a match in the Sentence column for the verb from the Verb column using a regex re.search() returns a match object, so you have to call .group() to get the string that is matched. In cases where there is no match, a NoneType object is returned and you can't call .group() on that. 

In [8]:
df["VerbToken"] = df.apply(lambda x: re.search(fr'({x["Verb"]}\w*)',x['Sentence']), axis=1)

# In some cases there is nothing captured, it returns a NoneType and causes the code to fail
# because NoneType has no method .group()
df["VerbToken"] = df["VerbToken"].apply(lambda x: x.group() if x is not None else x)


# Remaining cases

In [11]:
# cases where the above solution did not work
empty = df[df["VerbToken"].isnull()]

In [12]:
len(empty)/len(df)*100

9.060509554140127

## Proposed Solution
Overarching: lemmatize Sentence, find the verb lemma that matches the respective Verb column. But we actually need the actual verb token not the lemma, because to replace the correct verb in Sentence with [Mask], we will need to extract the relevant token in order to do a successful str.replace().

More concrete:
1. Make a new column with POS tag verbs from Sentence
2. Lemmatize the verbs from the new column
3. Here there be dragons

In [None]:
# code from: https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258

# initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

# def lemmatize_sentence(sentence):
#     #tokenize the sentence and find the POS tag for each token
#     nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
#     #tuple of (token, wordnet_tag)
#     wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
#     lemmatized_sentence = []
#     for word, tag in wordnet_tagged:
#         if tag is None:
#             #if there is no available tag, append the token as is
#             lemmatized_sentence.append(word)
#         else:        
#             #else use the tag to lemmatize the token
#             lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
#     return " ".join(lemmatized_sentence)

Failed code attempt to do it all in one

```
def lemmatize_verb_from_sentence(sentence,verb):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lw = []
#     for i in range(0,len(empty)-1):
#     for v in empty["Verb"].values:
#         verb_from_empty = empty["Verb"].values[i]
    for word, tag in wordnet_tagged:
        if tag is None:
            continue
        elif tag != 'v':
            continue
        else:
            lemma = lemmatizer.lemmatize(word, tag)
            if lemma != verb:
                # Go to the next word/tag pair to find the relevant verb
                break
            elif lemma == verb:
                print("{verb}: {word} {lemma}".format(verb=verb,word=word,lemma=lemma))
                lw.append(word)
                lw.append(lemma)
#     print(lw)
    return ' '.join(lw)
```

### Step 1. Create a new column with list of pos tagged verbs from Sentence

In [11]:
def get_verb(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    verbs = []
    for i in nltk_tagged:
        if 'VB' in i[1]:
            verbs.append(i)
    return verbs

In [12]:
empty["VerbList"] = empty["Sentence"].apply(lambda x: get_verb(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbList"] = empty["Sentence"].apply(lambda x: get_verb(x))


In [13]:
# this is possibly not necessary
# empty["VerbTagged"] = empty["Verb"].apply(lambda x: nltk.pos_tag([x]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbTagged"] = empty["Verb"].apply(lambda x: nltk.pos_tag([x]))


### Step 2. Lemmatize VerbList

In [14]:
def lemmatize_from_nltk_tagged_list(nltk_tagged):
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

In [15]:
empty["VerbListLemmatized"] = empty["VerbList"].apply(lambda x: lemmatize_from_nltk_tagged_list(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbListLemmatized"] = empty["VerbList"].apply(lambda x: lemmatize_from_nltk_tagged_list(x))


In [16]:
empty.head()

Unnamed: 0,ID,Verb,Sentence,VerbToken,VerbList,VerbTagged,VerbListLemmatized
9,BNC-1002,say,Indeed it could be said that they had prospered.,,"[(be, VB), (said, VBD), (had, VBD), (prospered, VBN)]","[(say, VB)]","[be, say, have, prosper]"
17,BNC-1003,say,He might have said to her that some time in the middle of the nineteenth century a cult had grown up around the idea of the home.,,"[(have, VB), (said, VBD), (had, VBD), (grown, VBN)]","[(say, VB)]","[have, say, have, grow]"
575,BNC-1145,tell,She could also have told this was Tina's mother before Mrs Darne went off down the passage that led to the Headmaster's Flat.,,"[(have, VB), (told, VBN), (was, VBD), (went, VBD), (led, VBD)]","[(tell, NN)]","[have, tell, be, go, lead]"
716,BNC-1187,think,They may have thought they were putting it out of its misery - a lifetime beautifying the lorry-route to the A1.,,"[(have, VB), (thought, VBN), (were, VBD), (putting, VBG), (beautifying, VBG)]","[(think, NN)]","[have, think, be, put, beautify]"
733,BNC-1194,think,Perhaps he thought that her own wishes would hardly be considered in the matter.,,"[(thought, VBD), (be, VB), (considered, VBN)]","[(think, NN)]","[think, be, consider]"


In [17]:
empty["VerbListLemmatizedTagged"] = empty["VerbListLemmatized"].apply(lambda x: nltk.pos_tag(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbListLemmatizedTagged"] = empty["VerbListLemmatized"].apply(lambda x: nltk.pos_tag(x))


In [408]:
empty.head()

Unnamed: 0,ID,Verb,Sentence,VerbToken,VerbList,VerbTagged,VerbListLemmatized,VerbListLemmatizedTagged
9,BNC-1002,say,Indeed it could be said that they had prospered.,,"[(be, VB), (said, VBD), (had, VBD), (prospered, VBN)]","[(say, VB)]","[be, say, have, prosper]","[(be, VB), (say, VBN), (have, VBP), (prosper, NN)]"
17,BNC-1003,say,He might have said to her that some time in the middle of the nineteenth century a cult had grown up around the idea of the home.,,"[(have, VB), (said, VBD), (had, VBD), (grown, VBN)]","[(say, VB)]","[have, say, have, grow]","[(have, VBP), (say, VBN), (have, VBP), (grow, NNS)]"
575,BNC-1145,tell,She could also have told this was Tina's mother before Mrs Darne went off down the passage that led to the Headmaster's Flat.,,"[(have, VB), (told, VBN), (was, VBD), (went, VBD), (led, VBD)]","[(tell, NN)]","[have, tell, be, go, lead]","[(have, VB), (tell, NN), (be, VB), (go, VBN), (lead, JJ)]"
716,BNC-1187,think,They may have thought they were putting it out of its misery - a lifetime beautifying the lorry-route to the A1.,,"[(have, VB), (thought, VBN), (were, VBD), (putting, VBG), (beautifying, VBG)]","[(think, NN)]","[have, think, be, put, beautify]","[(have, VB), (think, NN), (be, VB), (put, VBN), (beautify, VB)]"
733,BNC-1194,think,Perhaps he thought that her own wishes would hardly be considered in the matter.,,"[(thought, VBD), (be, VB), (considered, VBN)]","[(think, NN)]","[think, be, consider]","[(think, NN), (be, VB), (consider, JJR)]"


### TROUBLESHOOT NEEDED
what i want to do is: 
1. pair each element of the lists together so that each token and it's corresponding lemma are together (zip() should be able to do that, i think we would get a list of lists)
2. Then it should be easy to search through the list using indexing for a match with the corresponding verb from Verb on the verb lemma. the match should allow us to return the correct Token,Lemma list
3. Once we have that list, there are several solutions, to just index into it to get the VerbToken that matches the verb lemma in the Verb column"

In [28]:
l = [[x,y] for x,y in zip(list(empty["VerbList"]),list(empty["VerbListLemmatizedTagged"]))]

In [31]:
l[0][1]

[('be', 'VB'), ('say', 'VBN'), ('have', 'VBP'), ('prosper', 'NN')]

In [32]:
l[0][0]

[('be', 'VB'), ('said', 'VBD'), ('had', 'VBD'), ('prospered', 'VBN')]

In [424]:
empty["Grouped"].values[2][1][i][1]
empty["VerbListLemmatizedTagged"].values[2][1][i][1]

'NN'

In [None]:
empty["Grouped2"] = empty.Grouped[]

In [426]:
len(empty)

569

In [None]:

for x, y in zip(xs, ys):
    print x, y


In [None]:

def search_two_cols(col1,col2)
    for x,y in zip(col1,col2):
        l = []

        if col1 is in col2.tolist():
            l.append(col1)
            

In [397]:
empty.head()

Unnamed: 0,ID,Verb,Sentence,VerbToken,VerbList,VerbTagged
9,BNC-1002,say,Indeed it could be said that they had prospered.,,"[(be, VB), (said, VBD), (had, VBD), (prospered, VBN)]","[(say, VB)]"
17,BNC-1003,say,He might have said to her that some time in the middle of the nineteenth century a cult had grown up around the idea of the home.,,"[(have, VB), (said, VBD), (had, VBD), (grown, VBN)]","[(say, VB)]"
575,BNC-1145,tell,She could also have told this was Tina's mother before Mrs Darne went off down the passage that led to the Headmaster's Flat.,,"[(have, VB), (told, VBN), (was, VBD), (went, VBD), (led, VBD)]","[(tell, NN)]"
716,BNC-1187,think,They may have thought they were putting it out of its misery - a lifetime beautifying the lorry-route to the A1.,,"[(have, VB), (thought, VBN), (were, VBD), (putting, VBG), (beautifying, VBG)]","[(think, NN)]"
733,BNC-1194,think,Perhaps he thought that her own wishes would hardly be considered in the matter.,,"[(thought, VBD), (be, VB), (considered, VBN)]","[(think, NN)]"


In [None]:
def lemmatize_verb_from_list(sentence,verb):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lw = []
#     for i in range(0,len(empty)-1):
#     for v in empty["Verb"].values:
#         verb_from_empty = empty["Verb"].values[i]
    for word, tag in wordnet_tagged:
        if tag is None:
            continue
        elif tag != 'v':
            continue
        else:
            lemma = lemmatizer.lemmatize(word, tag)
            if lemma != verb:
                # Go to the next word/tag pair to find the relevant verb
                break
            elif lemma == verb:
                print("{verb}: {word} {lemma}".format(verb=verb,word=word,lemma=lemma))
                lw.append(word)
                lw.append(lemma)
#     print(lw)
    return ' '.join(lw)

## Combine the dataframes together again

## Mask out the VerbToken from Sentence

- Once the issues above are worked out, then the rest of this should be pretty straightforward.

- For discussion about which model is best, check out the following twitter thread: https://twitter.com/bruno_nicenboim/status/1379168059311656963

- Probably we should use GPT3, not BERT.

In [357]:
df["Masked"] = df.apply(lambda x: x['Sentence'].replace(x["VerbToken"],"[MASK]"),axis=1)

# Masked language modeling to estimate surprisal

- Info on fill-mask pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.FillMaskPipeline
- Info on particular models: https://huggingface.co/models


In [5]:
unmasker = pipeline('fill-mask', model='bert-large-uncased')

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
# using indexing and key to get the relevant output score
unmasker("Dana was [MASK] that Mars has no water.",targets="surprised")[0]['score']

NameError: name 'unmasker' is not defined

In [36]:
def mlm_over_df(input_df):
    for row in input_df.itterows():
        sentence = f"{s}".format(s=input_df["sentence"])
        verb = f"{v}".format(v=input_df["verb"])
        mask_fill = unmasker(sentence, targets=verb)
        input_df["mlm_score"] = mask_fill[0]['score']
    return input_df

In [None]:
df["Sentence_Masked"]