### Import libraries

In [1]:
import json
import requests
import time
import csv
from tqdm import tqdm

### Load dataset

The data is downloaded from https://huggingface.co/datasets/albertxu/CrosswordQA/tree/main.

In [2]:
all_answers = []

with open('train.csv', 'r', newline='\n') as f:
    reader = csv.DictReader(f)
    for row in tqdm(reader):
        all_answers.append(row['answer'])

6420790it [00:11, 535958.75it/s]


In [3]:
diff_answers = list(set(all_answers))

diff_lengths = [len(a.split(' ')) for a in diff_answers]

In [4]:
num_of_len = [0 for _ in range(max(diff_lengths)+1)]

# the api does not accept phrases longer than 5 words
long_len = 5
long_answers = []

for i, l in enumerate(diff_lengths):
    num_of_len[l] += 1
    if l == long_len:
        long_answers.append(diff_answers[i])
    
print(num_of_len)

[0, 202286, 194293, 31784, 8157, 1079, 109, 14]


### Preview the long answers

Some spelling errors immediately appear. For example, common words are not split when they should be. This is different from the intuition provided in the paper, that split is more difficult for less common words.

In [5]:
long_answers[:10]

['as steady as a rock',
 'i think therefore i am',
 'how did you do that',
 'too good to be true',
 'your ownside of the bed',
 'the ballad of reading Gaol',
 'will you be my wife',
 'we have is the room',
 'they say you should live',
 'it is not the work']

### Use the Google Books ngram API to get the list of probablities

In [6]:
def get_probs(phrase):
    try:
        res = requests.get(("https://books.google.com/ngrams/json?"
                            f"content={phrase.replace(' ', '+')}&"
                            "year_start=1800&year_end=2019&"
                            "case_insensitive=true"))
        j = res.json()
    except:
        print(f"Request for \"{phrase}\" failed!")
    else:
        if len(j):
            return j[0]['timeseries']
        else:
            return []

In [7]:
possible_missplit = []

for a in tqdm(long_answers[:200]):
    time.sleep(1)
    if not get_probs(a):
        possible_missplit.append(a)
        
print(possible_missplit)

100%|██████████| 200/200 [05:24<00:00,  1.62s/it]

['your ownside of the bed', 'we have is the room', 'they say you should live', 'as it has no blade', 'the catcher in the eggnog', 'then windup in the hole', 'they do when i hitthesack', 'the coldman and the sea', 'runs out of thy me', 'eat out to help out', 'chemist who was reading a', 'fish and chips and dip', 'best speeches of all time', 'all down on the job', 'claps of the roman empire', 'and he will stick his', 'a room with a viewing', 'idont know the key to', 'the belle of the brawl', 'ounce in a blue moon', 'ill have a cold one', 'not just a pretty cafe', 'the flit of the bumblebee', 'springs already at the gate', 'they say the chemicals in', 'a hand in the bird', 'was admitted as new crime', 'titans of comfort and joy', 'to the best of onesbelief', 'the art of her deal', 'says she can stopany time', 'it is the next note', 'Noble Stroman of the mall', 'a day in the ussr', 'and badnews about the car', 'rosesdie and that is that', 'with onesnose in the air', 'shed by how much i', '




There are at least three reasons to the missplits shown above:
- Common words are not split.
- Misspelled words.
- The omission of the apostrophe.

At least we see that the proposed segmenter in the paper has its counterintuitive weakness, which is the failure on splitting common words. However, it seems that many phrases returned above are not missplits, indicating my heuristic is also not good enough. These phrases do not appear in the corpus simply because 5-grams have the sparsity issue. Maybe results will change for bigrams or trigrams. Missplits should also appear in the list of unigrams, but the full list is too long to go over by the API.