## Is a pre-trained GPT-2 model aware of island constraints?

Tokenizing sentences and calculating token-by-token surprisal uses helper functions of `minicons` package (https://github.com/kanishkamisra/minicons).

### Load required packages

In [None]:
!pip install transformers==4.34.1

In [None]:
!pip install sentencepiece==0.1.96

In [None]:
!pip install minicons

In [None]:
!pip install torch==2.1.0



In [None]:
# need to be run for testing the BERT model
!pip install fugashi
!pip install ipadic

In [None]:
from minicons import scorer
import torch
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
from transformers import pipeline,AutoTokenizer,TextDataset,DataCollatorForLanguageModeling,Trainer,TrainingArguments,AutoModelWithLMHead
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import json
from scipy import stats

In [None]:
#access files saved in Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### GPT-2 pre-trained models
* [GPT-2 xsmall](https://huggingface.co/rinna/japanese-gpt2-xsmall): 37M parameters
* [GPT-2 small](https://huggingface.co/colorfulscoop/gpt2-small-ja): 110M parameters
* [GPT-2 medium](https://huggingface.co/rinna/japanese-gpt2-medium): 336M parameters
* [GPT-2 large](https://huggingface.co/rinna/japanese-gpt-1b): 1.3B parameters

In [None]:
pretrained_model_gpt2_xsmall = scorer.IncrementalLMScorer("rinna/japanese-gpt2-xsmall", 'cpu')

In [None]:
pretrained_model_gpt2_small = scorer.IncrementalLMScorer("colorfulscoop/gpt2-small-ja", 'cpu')

In [None]:
pretrained_model_gpt2_medium = scorer.IncrementalLMScorer("rinna/japanese-gpt2-medium", 'cpu')

In [None]:
pretrained_model_gpt2_large = scorer.IncrementalLMScorer("rinna/japanese-gpt-1b", 'cpu')

Downloading tokenizer_config.json:   0%|          | 0.00/283 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/153 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.66G [00:00<?, ?B/s]

In [None]:
def batch_surprisal_transformer(model, data):
  dataset = pd.DataFrame(columns = ['item', 'RC-licensor', 'gap', 'island', 'ext-type', 'token', 'surprisal'])
  for i, row in data.iterrows():
    tokens = []
    surprisals = []
    results = model.token_score(row['sentence'], surprisal = True, base_two = True)
    for i in results:
      for j in i:
        tokens.append(j[0])
        surprisals.append(j[1])
    to_append = {'item': list(np.repeat(row['item'], len(tokens), axis=0)),
                 'RC-licensor': list(np.repeat(row['RC-licensor'], len(tokens), axis=0)),
                 'gap': list(np.repeat(row['gap'], len(tokens), axis=0)),
                 'island': list(np.repeat(row['island'], len(tokens), axis=0)),
                 'ext-type': list(np.repeat(row['ext-type'], len(tokens), axis=0)),
                 'token': tokens,
                 'surprisal': surprisals}
    dataset = dataset.append(pd.DataFrame(to_append))
  return dataset

### Import test data

In [None]:
data = pd.read_csv('/path/to/test_stimuli.csv') # change path names before running

### Compute token-by-token surprisal: Xsmall

In [None]:
result_xs = batch_surprisal_transformer(pretrained_model_gpt2_xsmall, data)
result_xs.to_csv('/path/to/result_gpt2_xs.csv') # change path names before running

### Compute token-by-token surprisal: Small

In [None]:
result_sm = batch_surprisal_transformer(pretrained_model_gpt2_small, data)
result_sm.to_csv('/path/to/result_gpt2_sm.csv') # change path names before running

### Compute token-by-token surprisal: Medium

In [None]:
result_md = batch_surprisal_transformer(pretrained_model_gpt2_medium, data)
result_md.to_csv('/path/to/result_gpt2_md.csv') # change path names before running

### Compute token-by-token surprisal: Large

In [None]:
result_lg = batch_surprisal_transformer(pretrained_model_gpt2_large, data)
result_lg.to_csv('/path/to/result_gpt2_lg.csv') # change path names before running

After exporting the result, I manually coded the region of interest (i.e., all the words following the noun that underwent long-distance relativization), and reimported the edited version (where the critical region is indicated with the "critical" column).

In [None]:
# change path names before running
xs_data = pd.read_csv('/path/to/result_gpt2_xs.csv')
sm_data = pd.read_csv('/path/to/result_gpt2_sm.csv')
md_data = pd.read_csv('/path/to/result_gpt2_md.csv')
lg_data = pd.read_csv('/path/to/result_gpt2_lg.csv')

In [None]:
def get_licensing_interaction(data):
  ext_types = ['subject', 'object']
  results = {}
  for ext_type in ext_types:
    res = []
    data_sub = data[data['ext-type']==ext_type]
    data_sub_isl = data_sub[data_sub['island']==1]
    data_sub_nonisl = data_sub[data_sub['island']==0]
    cond_a_isl = data_sub_isl[(data_sub_isl['RC-licensor']==0) & (data_sub_isl['gap']==0)]['surprisal'].item()
    cond_b_isl = data_sub_isl[(data_sub_isl['RC-licensor']==1) & (data_sub_isl['gap']==0)]['surprisal'].item()
    cond_c_isl = data_sub_isl[(data_sub_isl['RC-licensor']==0) & (data_sub_isl['gap']==1)]['surprisal'].item()
    cond_d_isl = data_sub_isl[(data_sub_isl['RC-licensor']==1) & (data_sub_isl['gap']==1)]['surprisal'].item()
    cond_a_nonisl = data_sub_nonisl[(data_sub_nonisl['RC-licensor']==0) & (data_sub_nonisl['gap']==0)]['surprisal'].item()
    cond_b_nonisl = data_sub_nonisl[(data_sub_nonisl['RC-licensor']==1) & (data_sub_nonisl['gap']==0)]['surprisal'].item()
    cond_c_nonisl = data_sub_nonisl[(data_sub_nonisl['RC-licensor']==0) & (data_sub_nonisl['gap']==1)]['surprisal'].item()
    cond_d_nonisl = data_sub_nonisl[(data_sub_nonisl['RC-licensor']==1) & (data_sub_nonisl['gap']==1)]['surprisal'].item()
    nonisl_res = round((cond_b_nonisl - cond_a_nonisl) - (cond_d_nonisl - cond_c_nonisl), 2)
    isl_res = round((cond_b_isl - cond_a_isl) - (cond_d_isl - cond_c_isl), 2)
    print(ext_type.upper(), ' non-island licensing interaction: ', nonisl_res)
    print(ext_type.upper(), ' island licensing interaction: ', isl_res)
    res.extend([nonisl_res, isl_res])
    results[ext_type] = res
  return results

In [None]:
xs_critical = xs_data[xs_data['critical'] == 1]
xs_data_summary = xs_critical.groupby(['RC-licensor', 'gap', 'island', 'ext-type'])['surprisal'].mean().to_frame().reset_index()
xs_result = get_licensing_interaction(xs_data_summary)

SUBJECT  non-island licensing interaction:  2.69
SUBJECT  island licensing interaction:  2.91
OBJECT  non-island licensing interaction:  2.98
OBJECT  island licensing interaction:  2.62


In [None]:
sm_critical = sm_data[sm_data['critical'] == 1]
sm_data_summary = sm_critical.groupby(['RC-licensor', 'gap', 'island', 'ext-type'])['surprisal'].mean().to_frame().reset_index()
sm_result = get_licensing_interaction(sm_data_summary)

SUBJECT  non-island licensing interaction:  3.61
SUBJECT  island licensing interaction:  3.01
OBJECT  non-island licensing interaction:  3.97
OBJECT  island licensing interaction:  3.26


In [None]:
md_critical = md_data[md_data['critical'] == 1]
md_data_summary = md_critical.groupby(['RC-licensor', 'gap', 'island', 'ext-type'])['surprisal'].mean().to_frame().reset_index()
md_result = get_licensing_interaction(md_data_summary)

SUBJECT  non-island licensing interaction:  2.6
SUBJECT  island licensing interaction:  2.31
OBJECT  non-island licensing interaction:  3.45
OBJECT  island licensing interaction:  2.75


In [None]:
lg_critical = lg_data[lg_data['critical'] == 1]
lg_data_summary = lg_critical.groupby(['RC-licensor', 'gap', 'island', 'ext-type'])['surprisal'].mean().to_frame().reset_index()
lg_result = get_licensing_interaction(lg_data_summary)

SUBJECT  non-island licensing interaction:  0.81
SUBJECT  island licensing interaction:  0.45
OBJECT  non-island licensing interaction:  1.21
OBJECT  island licensing interaction:  0.34
