<a href="https://colab.research.google.com/github/UMWordLab/multilingual_amaze/blob/main/Multilingual_A_maze_Alternative_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BERT-based Multilingual A-maze Alternative Generation

## 1. Preliminaries
Please run the following cells to install and import the necessary libraries.

In [1]:
%%capture

!pip install minicons
!pip install transformers
!pip install wordfreq
!pip install unicodedata

In [15]:
%%capture

from minicons import scorer
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForMaskedLM

from wordfreq import get_frequency_dict, zipf_frequency
import unicodedata

import math
import random

from google.colab import files
import csv
import io

## 2. Selecting a Minicons language model
Please run the following cell and input the language model you would like to use for the experiment. It should be a masked language model, like BERT.


In [4]:
langmodel = input("What minicons language model would you like to use?\nYou can select any from this list: https://huggingface.co/models\nThe name of the model can be copied using the clipboard icon next to the name on the webpage.\n")
print(langmodel, "selected as model.")
model = scorer.IncrementalLMScorer(langmodel, 'cpu')
tokenizer = BertTokenizer.from_pretrained(langmodel)

What minicons language model would you like to use?
You can select any from this list: https://huggingface.co/models
The name of the model can be copied using the clipboard icon next to the name on the webpage.
onlplab/alephbert-base
onlplab/alephbert-base selected as model.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/288 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/545k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/504M [00:00<?, ?B/s]

## 3. Selecting frequency information

Please run the following cell to specify how you would like collect frequency information for the experiment, and to define a frequency band for the purpose of computing similarly-frequent words.

In [16]:
strict_scripts = {
    # map 2-letter ISO codes to the Unicode script tags included in character names
    # add more languages here as desired!
    "ar": "ARABIC",
    "he": "HEBREW"
}

def script_check(word, script):
  # ensure first non-punctuation character in token has appropriate script
  return script in unicodedata.name(word.strip()[0])

freq_type = input("What type of frequency information would you like to use?\nYou can select from the following options:\n- wf: uses the wordfreq package, which provides multi-corpus frequency estimates for over 40 languages.\n- csv: requires upload of a csv specifying your own vocabulary and frequency counts.\n")

if freq_type == "wf":
  # Use the wordfreq package
  lang_code = input("What wordfreq language would you like to use?\nYou can select any from the list here: https://pypi.org/project/wordfreq/. Use the two letter ISO code to reference your language.\n")
  freq_dict_raw = get_frequency_dict(lang=lang_code, wordlist = "best")
  script = strict_scripts.get(lang_code, None)
  if script:
    freq_dict = dict((x, zipf_frequency(x, lang=lang_code)) for x,y in freq_dict_raw.items() if script_check(x, script)) # convert to Zipf scale (base-10 logarithm of frequency per billion words)
  else:
    freq_dict = dict((x, zipf_frequency(x, lang=lang_code)) for x,y in freq_dict_raw.items()) # convert to Zipf scale (base-10 logarithm of frequency per billion words)
  freq_window = float(input("wordfreq reports frequencies on the Zipf scale, the base-10 logarithm of frequency per billion words.\nWhat is the window of frequency on this scale that you would like to use to consider words 'similar' frequency?\nE.g., with a window of 1 Zipf, the word 'glove', with a Zipf of about 4 (10 per million), could match the words:\n-'boast', Zipf of 3 (1 per million)\n-'floor', Zipf of 5 (100 per million)\n"))
elif freq_type == "csv":
  # Upload a csv
  print("Please upload the csv that contains the word-to-frequency mapping.\nIt should have two columns, labeled 'word' and 'frequency'.")
  uploaded = files.upload()
  freq_file = next(iter(uploaded))
  freq_window = int(input("Given the frequency values you used in your input data, what is the window of frequency on this scale that you would like to use to consider words 'similar' frequency?\nE.g., if your data provides frequencies per million, at a window of 10, the word 'glove', with a frequency of about 10 per million, could match:\n-'boast', 1 per million\n-'fever', 20 per million\n"))
else:
  raise ValueError("Invalid frequency type.")



What type of frequency information would you like to use?
You can select from the following options:
- wf: uses the wordfreq package, which provides multi-corpus frequency estimates for over 40 languages.
- csv: requires upload of a csv specifying your own vocabulary and frequency counts.
wf
What wordfreq language would you like to use?
You can select any from the list here: https://pypi.org/project/wordfreq/. Use the two letter ISO code to reference your language.
he
wordfreq reports frequencies on the Zipf scale, the base-10 logarithm of frequency per billion words.
What is the window of frequency on this scale that you would like to use to consider words 'similar' frequency?
E.g., with a window of 1 Zipf, the word 'glove', with a Zipf of about 4 (10 per million), could match the words:
-'boast', Zipf of 3 (1 per million)
-'floor', Zipf of 5 (100 per million)
0.5


##4. Providing your stimuli
Please run the following cells to upload your stimuli. They should be in a single-column CV, with the column labeled "sentences". More functionality to come.

In [7]:
print("Please upload your file that contains the stimuli sentences to be used for alternative generation.")
uploaded = files.upload()
stim_file = next(iter(uploaded))

def process_stimuli_file(filename):
  res = []
  with open(filename, mode='r', encoding='utf-8-sig') as csv_file:
      csv_reader = csv.DictReader(csv_file)
      for row in csv_reader:
          sent = row['sentences']
          res.append(sent)
  return res

sentences = process_stimuli_file(stim_file)
print("Stimuli saved. ")

Please upload your file that contains the stimuli sentences to be used for alternative generation.


Saving AMaze_Input_1.csv to AMaze_Input_1.csv
Stimuli saved. 


##5. Main Functions
- find_similar_frequency
- tokenization
- calculate_surprisal
- find_alternative

In [23]:
model = BertForMaskedLM.from_pretrained(langmodel)

# instead of random selection, provide a window for frequency selection
# iterates over freq_dict until it has assembled {goal} words within {window}
# or it has hit its maximum search count of {timeout} words
def find_similar_frequency(word, window, goal, timeout, verbose_mode):
  res = set()
  print('\nword: ', word)
  if word in freq_dict.keys():
    word_frq = freq_dict[word]
    if verbose_mode:
      print('\tFrequency found in list:', word_frq)
    n = 0
    words = list(freq_dict.items())
    random.shuffle(words)
    for w, f in words:
      if w != word and len(w)==len(word):
        if word_frq < (f + window) and word_frq > (f - window):
          if verbose_mode:
            print('\t\tfound match:',w,f)
          res.add(w)
      n += 1
      if n == timeout or len(res) == goal:
        break
    if len(res) < goal:
      # error handling - word exists in given freq list but not enough alternatives found within the window
      if verbose_mode:
        print('\tThere weren\'t enough words that matched the length and frequency in the given window.\n\tAdding samples with more relaxed length constraints.')
      n = 0
      for w, f in words:
        if w != word and len(w)>=len(word)-1 and len(w)<=len(word)+1:
          if word_frq < (f + window) and word_frq > (f - window):
            if verbose_mode:
              print('\t\tfound match:',w,f)
            res.add(w)
        n += 1
        if n == timeout or len(res) == goal:
          break
    if len(res) < goal:
      # error handling - word exists in given freq list but not enough alternatives found within the window, even after relaxing length
      if verbose_mode:
        print('\tThere weren\'t enough words that matched the relaxed length and frequency in the given window.\n\tAdding samples with more relaxed frequency constraints.')
      n = 0
      for w, f in words:
        if w != word and len(w)>=len(word)-1 and len(w)<=len(word)+1:
          if word_frq < (f + 2*window) and word_frq > (f - 2*window):
            if verbose_mode:
              print('\t\tfound match:',w,f)
            res.add(w)
        n += 1
        if n == timeout or len(res) == goal:
          break
  else:
    # error handling - word doesn't exist in given frequency list
    # complete random selection
    if verbose_mode:
      print('\tFrequency not found in list. Drawing a random sample based on length alone.')
    n = 0
    for w, f in words:
      if w != word and len(w)==len(word):
        if verbose_mode:
          print('\t\tfound match:',w,f)
        res.add(w)
      n += 1
      if n == timeout or len(res) == goal:
        break
  return list(res)


def tokenization(sentence, separation):
  # tokenize each sentence
  # for example, if we have a sentence consists of word AA B CCC DD
  # we get, [ [[MASK][MASK] B CCC DD], [AA [MASK] CCC DD], [AA B [MASK][MASK][MASK] DD], [AA B CCC [MASK][MASK]]
  masked_list = []
  inputs = tokenizer(sentence, add_special_tokens=True, return_tensors="pt") # we tokenize this sentence
  mask_index = 0 # we keep track of where the [MASK] is
  encoding = inputs['input_ids'].clone()
  for i in range(len(separation)): # note that we don't replace code#101[CLS] or code#102[SEP]
    masked_list.append(inputs['input_ids'].clone())
    # We replace every word with code#103 which is the [MASK]
    # note we +1 because we don't want to replace the [101] start of a sentence
    masked_list[0][0][mask_index + 1] = 103
    # increment mask_index to replace the next word with [MASK]
    mask_index += 1
  return masked_list

def calculate_surprisal(sentence, word, token, start_position, verbose_mode, window, goal, timeout, n_highest):
    inputs = tokenizer(sentence, is_split_into_words=True, add_special_tokens=True, return_tensors="pt") # create a placeholder for masked sentences
    inputs['input_ids'] = token  # replace placeholder with masked sentence
    outputs = model(**inputs) # let the model predict
    # find a list of similar frequency words
    similar = find_similar_frequency(word, window, goal, timeout, verbose_mode)
    surprisal_list = []
    # calculate surprisal of each word in similar[]
    print('\tCalculating surprisals...')
    for word in similar:
      i = 0
      prob = 0
      # tokenize the character (character -> id)
      embeddings = tokenizer.convert_tokens_to_ids(word)
      # actual position is the actual index
      # we + 1 because of start_of_sentence token in BERT
      actual_position = start_position + i
      try:
        word_weights = outputs[0][0][actual_position].squeeze().div(1.0).exp()
        # if it is the first character, we set the probability to the first one
        # else, we times current probability with previous one
        if i == 0:
          prob = (word_weights / sum(word_weights))[embeddings]
        else:
          prob = prob * (word_weights / sum(word_weights))[embeddings]
        i = i + 1
        # now we have the probability, we calculate surprisal
        surprisal_list.append(-1 * torch.log2(prob))
      except:
        surprisal_list.append(0.0)

    # now we have a list of surprisal, find the highest n
    max_indexes = sorted(range(len(surprisal_list)), key=lambda i: surprisal_list[i])[-n_highest:]
    max_words = [(similar[index], surprisal_list[index]) for index in max_indexes]
    if verbose_mode:
      print('\tBest candidates chosen:')
      for output in max_words:
        print('\t\t', output[0])
    return max_words

def find_alternative(sentence, split, window, verbose_mode=False, goal=20, timeout=100000, n_highest=5):
  result = []
  # we get a list of [MASK] at different word position of a sentence
  # detailed description in tokenization()
  token_list = tokenization(sentence, split)
  start_position = len(split)
  for i in range(1, len(split)):
    alternatives = calculate_surprisal(sentence, split[i], token_list[i], start_position, verbose_mode, window, goal, timeout, n_highest)
    result += [[split[i], n, alternative[0]] for n, alternative in enumerate(alternatives)] # alternative[1] is the surprisal, if you want it, but it's a tensor object
    start_position = start_position + 1
  return result

##6. Alternative Generation

This block runs the alternate generation and creates an output file under the name of your choosing.

Recommendations: 100 candidate foils, save 5.

But note: Evaluating 100 candidate foils takes about 4-5 minutes per sentence. Plan accordingly.

In [24]:
user_goal = int(input("How many possible frequency-matched foils do you want to sample? "))
user_n = int(input("How many alternative foils do you want to save for each word? "))

outfile_name = input("What is the name of your output file? ")
f = open(outfile_name, mode='a', encoding='utf-8-sig')
writer = csv.writer(f, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
counter = 1

print('\nBeginning generation...\n')

writer.writerow(['sentence_id', 'word_id', 'word', 'foil_id', 'foil'])
for i in range(len(sentences)):
  sentence = sentences[i]
  result = find_alternative(sentence, sentence.split(), freq_window, verbose_mode=True, goal=user_goal, n_highest=user_n)
  for output in result:
    writer.writerow([i+1, counter] + output)
    counter += 1
f.close()

How many possible frequency-matched foils do you want to sample? 100
How many alternative foils do you want to save for each word? 5
What is the name of your output file? out5

Beginning generation...


word:  גן
	Frequency found in list: 5.06
		found match: גב 4.8
		found match: דק 4.67
		found match: תל 5.5
		found match: אח 4.88
		found match: מר 5.2
		found match: בר 5.46
		found match: נס 4.73
		found match: זר 4.72
		found match: שב 4.94
		found match: חל 4.57
		found match: עמ 5.06
		found match: דף 5.0
		found match: אה 5.24
		found match: חן 5.04
		found match: קח 4.63
	There weren't enough words that matched the length and frequency in the given window.
	Adding samples with more relaxed length constraints.
		found match: צדק 4.87
		found match: הלב 5.16
		found match: גב 4.8
		found match: מאז 5.52
		found match: יצר 4.68
		found match: עלי 5.45
		found match: דק 4.67
		found match: הון 4.66
		found match: תל 5.5
		found match: יתר 4.97
		found match: שאל 4.76
		found match: 