## Is a pre-trained LSTM model aware of relative clause island constraints?

Part of the code for tokenizing sentences and calculating token-by-token surprisal was adapted from https://github.com/kuribayashi4/surprisal_reading_time_en_ja, and edited in order to ensure that the code fits the purpose of this project and it runs without an issue in my environment.

### Load repository for calculating surprisals

In [None]:
!git clone https://github.com/kuribayashi4/surprisal_reading_time_en_ja.git

Cloning into 'surprisal_reading_time_en_ja'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 41 (delta 10), reused 26 (delta 5), pack-reused 0[K
Unpacking objects: 100% (41/41), 3.17 MiB | 4.89 MiB/s, done.


In [None]:
%cd /content/surprisal_reading_time_en_ja

/content/surprisal_reading_time_en_ja


In [None]:
#before executing this code, I changed the package name of "mecab" to "mecab-python3"
!pip install -r requirements.txt
!pip install unidic-lite

In [None]:
#access files saved in Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#import necessary packages
import MeCab
import unidic
import unicodedata
import mojimoji
import torch
import sentencepiece as spm
import matplotlib.ticker as plticker
import pandas as pd
import japanize_matplotlib
import numpy as np

from torch.nn import CrossEntropyLoss
from fairseq.models.transformer_lm import TransformerLanguageModel
from fairseq.models.lstm_lm import LSTMLanguageModel
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure

### Load the model

In [None]:
JA_MODEL_PATH = '/content/drive/MyDrive/Classes/LIGN 167 - FA22'

pretrained_lm = LSTMLanguageModel.from_pretrained(JA_MODEL_PATH,
                                                  'checkpoint_last_pretrained.pt',
                                                  data_name_or_path='japanese-dict',
                                                  bpe='sentencepiece',
                                                  sentencepiece_model='japanese-dict/spm/japanese_gpt2_unidic.model')

In [None]:
#load word divider and SentencePiece tokenizer
wakati = MeCab.Tagger("-Owakati")
sp = spm.SentencePieceProcessor()
sp.Load("japanese-dict/spm/japanese_gpt2_unidic.model")

True

In [None]:
def concat_bos(tensor, bos):
    return torch.cat([torch.tensor([bos]), tensor])

loss_fct = CrossEntropyLoss(ignore_index=-1, reduce=False)

def batch_surprisal(df, lm):
  dataset = pd.DataFrame(columns = ["island", "sent_number", "token", "surprisal"])
  sent_number = 0
  for i, row in df.iterrows():
    sent_number += 1
    tokens = []
    sent = unicodedata.normalize('NFKC', mojimoji.han_to_zen(row['sentence']))
    sent_wakati = wakati.parse(sent).strip()
    pieces = ' '.join(sp.EncodeAsPieces(sent_wakati))
    input_ids = lm.binarize(pieces)
    bos = lm.src_dict.bos()
    input_ids_with_special_token = concat_bos(input_ids, bos)

    results = lm.models[0](input_ids_with_special_token.view(1,-1))
    surprisals = loss_fct(results[0][0][:-1], input_ids)
    surprisals = surprisals.data.tolist()
    assert len(surprisals) == len(input_ids)

    for idx in input_ids:
      tokens.append(lm.src_dict[idx].strip('▁'))

    if row['island'] == 1:
      to_append = {'island': list(np.repeat(1, len(tokens), axis=0)),
                   'sent_number': list(np.repeat(sent_number, len(tokens), axis=0)),
                   'token': tokens,
                   'surprisal': surprisals}
      dataset = dataset.append(pd.DataFrame(to_append))
    elif row['island'] == 0:
      to_append = {'island': list(np.repeat(0, len(tokens), axis=0)),
                   'sent_number': list(np.repeat(sent_number, len(tokens), axis=0)),
                   'token': tokens,
                   'surprisal': surprisals}
      dataset = dataset.append(pd.DataFrame(to_append))

  return dataset



Import test data, and compute token-by-token surprisal using the language model

In [None]:
data = pd.read_csv('/content/drive/MyDrive/[file path]/test_stimuli.csv')  # change path names before running

In [None]:
#get and save the result
result = batch_surprisal(data, pretrained_lm)
result.to_csv('result_LSTM.csv')

After exporting the result, I manually coded the region of interest (i.e., all the words following the noun that underwent long-distance relativization), and reimported the edited version.

Example:

Because the book that __ wrote was featured in the news, <ins>the professor looks proud.</ins>

\[The book that __ wrote was featured in the news\] <ins>the professor looks proud.</ins>


In [None]:
result_edited = pd.read_csv('/content/drive/MyDrive/[file path]/result_LSTM.csv') # change path names before running

In [None]:
cond_list = []
for i, row in result_edited.iterrows():
  if row['sent_number'] < 9:
    cond_list.append('noext')
  elif row['sent_number'] > 16:
    cond_list.append('ext_isl')
  else:
    cond_list.append('ext_noisl')

result_edited['condition'] = cond_list

In [None]:
#get mean surprisal on the critical region by group (island=1/non-island=0)
critical = result_edited[result_edited['critical']==1]
critical.groupby('condition')['surprisal'].mean()

condition
ext_isl      4.817421
ext_noisl    4.859113
noext        4.627623
Name: surprisal, dtype: float64

The results suggest that the pre-trained LSTM LM is no more surprised to see the sign of long-distance extraction with an island violation than the one without an island violation. In other words, the LM doesn't seem to be aware of the relative clause island constraint.