# Pre-Processing data for language modeling
Take sentences from CommitmentBank, MegaAttitudes, and stimuli from experiment, mask the attitude predicate, and get predicted probability of occurrence for the target verb. Then, calculate from that the surprisal of the verb.

In [1]:
from transformers import pipeline
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [2]:
# This makes the display show more info
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Contents
1. [Read in the three datasets](#Read-in-the-three-datasets)
2. [Masking out the correct verb](#Masking-out-the-correct-verb)
    1. [Remaining cases](#Remaining-cases)
    2. [Proposed Solution](#Proposed-Solution)
        1. [Step 1. Create a new column with list of pos tagged verbs from Sentence](#Step-1.-Create-a-new-column-with-list-of-pos-tagged-verbs-from-Sentence)
        2. [Step 2. Lemmatize VerbList](#Step-2.-Lemmatize-VerbList)
        3. [TROUBLESHOOT NEEDED](#TROUBLESHOOT-NEEDED)
    3. [Combine the datafriends together again](#Combine-the-dataframes-together-again)
    4. [Mask out the VerbToken from Sentence](#Mask-out-the-VerbToken-from-Sentence)
4. [Masked language modeling to estimate surprisal](#Masked-language-modeling-to-estimate-surprisal)

# Read in the three datasets
- Subset the dfs to just the relevant columns: ID, Verb, Sentence
- Make sure that the column names are consistent across the tree dfs

In [3]:
# CommitmentBank
# raw url: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/CommitmentBank-All.csv
cb = pd.read_csv("../../data/CommitmentBank-ALL.csv")[["uID","Verb","Target"]].drop_duplicates()
cb = cb.rename(columns={"Target": "Sentence","uID":"ID"})
len(cb)

1200

In [4]:
# MegaVeridicality
# raw URL: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/mega-veridicality-v2.csv
mv = pd.read_csv("../../data/mega-veridicality-v2.csv")[["verb","frame","voice","sentence"]].drop_duplicates()
mv = mv.rename(columns={"verb": "Verb", "sentence":"Sentence"})
mv["ID"] = mv[['frame', 'voice']].apply(lambda x: '_'.join(x), axis=1)
mv = mv.drop(columns=["frame","voice"])
len(mv)

5026

In [5]:
# Arousal/Valence Study
# raw URL: https://raw.githubusercontent.com/khuyen-le/projectivity-factors/master/data/1_sliderprojection/exp1_test-trials.csv
vs = pd.read_csv("../../data/1_sliderprojection/exp1_test-trials.csv")[["Word","utterance","exp"]]
vs = vs[vs["exp"]=="stim"].drop_duplicates().drop(columns={"exp"})
vs = vs.rename(columns={"Word": "Verb","utterance":"Sentence"})
vs["ID"] = "projection"
len(vs)

54

In [6]:
# Combine them together into one df
df = pd.concat([cb,mv,vs])

In [7]:
1200 + 5026 + 54

6280

In [8]:
len(df)

6280

# Getting the correct verb token
What we need to do is mask out the correct verb in each of the sentences. We have the correct verb in the Verb column. We can easily use apply() with str.replace() to switch the verb with [MASK]. The problem is that the verbs in the sentences are inflected tokens, while the verbs in Verb are lemmatized.


For some of the verbs, we don't need to worry about this problem because there is morphological overlap between the Verb Token and the Verb Lemma. 


Solution:
1. Create a new verb token column
2. Regex + literal string interpolation to match works in cases where the Verb matches morphologically

In [9]:
# Frustratingly, this isn't working
# df["VerbToken"] = df['Sentence'].str.extract(fr'({df["Verb"]}\w*)')

Find a match in the Sentence column for the verb from the Verb column using a regex re.search() returns a match object, so you have to call .group() to get the string that is matched. In cases where there is no match, a NoneType object is returned and you can't call .group() on that. 

In [10]:
df["Token"] = df.apply(lambda x: re.search(fr'({x["Verb"]}\w*)',x['Sentence']), axis=1)

# In some cases there is nothing captured, it returns a NoneType and causes the code to fail
# because NoneType has no method .group()
df["Token"] = df["Token"].apply(lambda x: x.group() if x is not None else x)

## Mask out the VerbToken

In [11]:
nonempty = df[~df["Token"].isnull()]

In [12]:
nonempty["Masked"] = nonempty.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nonempty["Masked"] = nonempty.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)


In [13]:
len(nonempty)

5711

In [14]:
5711+569

6280

In [15]:
len(df)

6280

# Remaining cases

In [16]:
# cases where the above solution did not work
empty = df[df["Token"].isnull()]
len(empty)

569

In [17]:
len(empty)/len(df)*100

9.060509554140127

## Proposed Solution
Overarching: lemmatize Sentence, find the verb lemma that matches the respective Verb column. But we actually need the actual verb token not the lemma, because to replace the correct verb in Sentence with [Mask], we will need to extract the relevant token in order to do a successful str.replace().

More concrete:
1. Make a new column with POS tag verbs from Sentence
2. Lemmatize the verbs from the new column
3. Here there be dragons

### Step 1. Create a new column with list of pos tagged verbs from Sentence

In [18]:
def get_verb(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    verbs = []
    for i in nltk_tagged:
        if 'VB' in i[1]:
            verbs.append(i)
    return verbs

In [19]:
empty["VerbList"] = empty["Sentence"].apply(lambda x: get_verb(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbList"] = empty["Sentence"].apply(lambda x: get_verb(x))


### Step 2. Lemmatize VerbList

In [20]:
# code from: https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258

# initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_from_nltk_tagged_list(nltk_tagged):
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

empty["VerbListLemmatized"] = empty["VerbList"].apply(lambda x: lemmatize_from_nltk_tagged_list(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["VerbListLemmatized"] = empty["VerbList"].apply(lambda x: lemmatize_from_nltk_tagged_list(x))


### pull just the Verb Token and Lemma from the VerbListLemmatized

In [21]:
# solution by brandon papineau
check_list = []
iter_list = []
for index,row in empty.iterrows():
    inner_list = []
    if row["Verb"] not in check_list:
        check_list.append(row["Verb"])
    for i in row["VerbListLemmatized"]:
        if i in check_list:
            lemma = i
            locator = row["VerbListLemmatized"].index(i)
            tagged = row["VerbList"][locator]
            inner_list.append([lemma,tagged])
    iter_list.append(inner_list)
empty["LemmaTokenPair"] = iter_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty["LemmaTokenPair"] = iter_list


In [22]:
len(empty)

569

# Separate out the sucessful cases

In [23]:
good = empty[empty.astype(str)["LemmaTokenPair"] != "[]"]
len(good)

425

In [24]:
good["Token"] = good['LemmaTokenPair'].apply(lambda x: x[0][1][0])
len(good)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good["Token"] = good['LemmaTokenPair'].apply(lambda x: x[0][1][0])


425

### Mask out the verb token

In [25]:
good["Masked"] = good.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)
len(good)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good["Masked"] = good.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)


425

# Separate out the unsucessful cases

Several cases aren't getting caught because:
1. The word isn't correctly tagged as a verb, so not ending up in VerbList in the first place
    - example: 'thought' in item 901
2. The word isn't lemmatized correctly, so the match with Verb isn't happening
    - examples: 'felt'
3. Orthographic differences/errors
    - examples: 'realize'/'realise', 'facinate' / 'fascinate'
4. Cases where the Verb has a particle ---> the majority of cases
    - 'flip_out' vs. 'flip'

In [26]:
missing = empty[empty.astype(str)["LemmaTokenPair"] == "[]"]
len(missing)

144

## Look at cases of Type 4
Solution: same as before but try a str.constains or something

In [27]:
type_4 = missing.loc[missing["Verb"].str.contains("_")].drop(columns={"LemmaTokenPair"})
len(type_4)

101

In [28]:
type_4a = type_4["Verb"].str.split("_",expand=True)
type_4 = type_4a.merge(type_4, left_index = True, right_index = True).rename(columns={0:"VerbSplit"})

In [29]:
# run BPap's code again
check_list = []
iter_list = []
for index,row in type_4.iterrows():
    inner_list = []
    if row["VerbSplit"] not in check_list: # search for check on the result of splitting the Verb column
        check_list.append(row["VerbSplit"])
    for i in row["VerbListLemmatized"]:
        if i in check_list:
            lemma = i
            locator = row["VerbListLemmatized"].index(i)
            tagged = row["VerbList"][locator]
            inner_list.append([lemma,tagged])
    iter_list.append(inner_list)
type_4["LemmaTokenPair"] = iter_list

In [30]:
# Split up the LemmaTokenPair column into two columns
type_4 = type_4.LemmaTokenPair.apply(pd.Series).merge(type_4, left_index = True, right_index = True)
len(type_4)

101

## Cases where that worked

In [31]:
t4_nonnull = type_4.loc[~type_4[0].isnull()]
len(t4_nonnull)

82

In [32]:
t4_nonnull = t4_nonnull[0].apply(pd.Series).merge(t4_nonnull, left_index = True, right_index = True)
t4_nonnull = t4_nonnull.rename(columns={"0_x":"Lemma"})
t4_nonnull["Token"] = t4_nonnull["1_x"].apply(lambda x: x[0])

### Mask out 

In [33]:
t4_nonnull["Masked"] = t4_nonnull.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)

In [34]:
len(t4_nonnull)

82

## Missed cases

In [35]:
t4_null = type_4.loc[type_4[0].isnull()]
len(t4_null)

19

In [36]:
t4_null = t4_null.drop(columns=[0])
len(t4_null)

19

In [37]:
t4_null["Token"] = t4_null.apply(lambda x: re.search(fr'({x["VerbSplit"]}\w*)',x['Sentence']), axis=1)
t4_null["Token"] = t4_null["Token"].apply(lambda x: x.group() if x is not None else x)

### Mask Out

In [38]:
t4_null["Masked"] = t4_null.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)

In [39]:
len(t4_null)

19

## Everything else

In [40]:
every_else = missing.loc[~missing["Verb"].str.contains("_")].drop(columns={"LemmaTokenPair"})
len(every_else)

43

In [41]:
every_else = every_else[["ID","Verb","Sentence"]]
every_else["Token"] = ""

In [42]:
# orthographic differences/errors
every_else["Verb"].loc[every_else["Verb"]=="facinate"] = "fascinate"
every_else["Verb"].loc[every_else["Verb"]=="realize"] = "realise"

every_else["Token"] = every_else.apply(lambda x: re.search(fr'({x["Verb"]}\w*)',x['Sentence']), axis=1)
every_else["Token"] = every_else["Token"].apply(lambda x: x.group() if x is not None else x)

In [43]:
# Tokenize all the irregular conjugations
every_else["Token"].loc[every_else["Verb"]=="understand"] = "understood"
every_else["Token"].loc[every_else["Verb"]=="feel"] = "felt"
every_else["Token"].loc[every_else["Verb"]=="think"] = "thought"
every_else["Token"].loc[every_else["Verb"]=="hope"] = "hoping"
every_else["Token"].loc[every_else["Verb"]=="see"] = "saw"
every_else["Token"].loc[every_else["Verb"]=="spellbind"] = "spellbound"
every_else["Token"].loc[every_else["Verb"]=="sing"] = "sung"
every_else["Token"].loc[every_else["Verb"]=="swear"] = "swore"
every_else["Token"].loc[every_else["Verb"]=="bear"] = "borne"
every_else["Token"].loc[every_else["Verb"]=="choose"] = "chosen"
every_else["Token"].loc[every_else["Verb"]=="undertake"] = "undertook"
every_else["Token"].loc[every_else["Verb"]=="uphold"] = "upheld"
every_else["Token"].loc[every_else["Verb"]=="satisfy"] = "satisfied"
every_else["Token"].loc[every_else["Verb"]=="teach"] = "taught"
every_else["Token"].loc[every_else["Verb"]=="foretell"] = "foretold"
every_else["Token"].loc[every_else["Verb"]=="curse"] = "curst"
every_else["Token"].loc[every_else["Verb"]=="send"] = "sent"
every_else["Token"].loc[every_else["Verb"]=="teach"] = "taught"
every_else["Token"].loc[every_else["Verb"]=="weep"] = "wept"
every_else["Token"].loc[every_else["Verb"]=="fight"] = "faught"
every_else["Token"].loc[every_else["Verb"]=="forbid"] = "forbade"
len(every_else)

43

### MASK OUT

In [44]:
every_else["Masked"] = every_else.apply(lambda x: x['Sentence'].replace(x["Token"],"[MASK]"),axis=1)

In [45]:
len(every_else)

43

# Combine the dataframes together again

In [46]:
print(f"nonempty: {len(nonempty)}")
print(f"good: {len(good)}")
print(f"t4_nonnull: {len(t4_nonnull)}")
print(f"t4_null: {len(t4_null)}")
print(f"every_else: {len(every_else)}")
print(f"total: {len(nonempty) + len(good) + len(t4_nonnull) + len(t4_null) + len(every_else)}")

nonempty: 5711
good: 425
t4_nonnull: 82
t4_null: 19
every_else: 43
total: 6280


In [47]:
nonempty = nonempty[["ID","Verb","Sentence","Masked","Token"]]
good = good[["ID","Verb","Sentence","Masked","Token"]]
t4_nonnull = t4_nonnull[["ID","Verb","Sentence","Masked","Token"]]
t4_null = t4_null[["ID","Verb","Sentence","Masked","Token"]]
every_else = every_else[["ID","Verb","Sentence","Masked","Token"]]

In [48]:
len(nonempty) + len(good) + len(t4_nonnull) + len(t4_null) + len(every_else)

6280

In [49]:
d = pd.concat([nonempty,good,t4_nonnull,t4_null,every_else])
len(d)

6280

# Save to CSV

In [50]:
d.to_csv("../../data/data_for_lm.csv")