In [1]:
import torch
import random
import numpy as np

## Sentence scorer

Load the language model "GPT2", used to score the "readability" of sentences.
I chose the package `lm_scorer` for its ease of use. With a lower level API code could be optimized.

In [2]:
from lm_scorer.models.auto import AutoLMScorer as LMScorer



In [3]:
list(LMScorer.supported_model_names())

['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl', 'distilgpt2']

In [4]:
torch.cuda.is_available()

True

In [5]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
lmscorer = LMScorer.from_pretrained("gpt2-large", device=device, batch_size=1)



In [6]:
ts = lmscorer.tokens_score("I like this package, a lot.")
ts

([0.014608712866902351,
  0.006463255267590284,
  0.08234534412622452,
  0.00042812313768081367,
  0.09876105189323425,
  0.005750838667154312,
  0.3983594477176666,
  0.6796929240226746,
  0.0018101882888004184],
 [40, 588, 428, 5301, 11, 257, 1256, 13, 50256],
 ['I', 'Ġlike', 'Ġthis', 'Ġpackage', ',', 'Ġa', 'Ġlot', '.', '<|endoftext|>'])

## Build patterned words

For each letter of the alphabet, build the list of words that have the right "pattern" or "shape" to represents the Morse code of this letter (see readme.txt).

#### Build from scratch

In [7]:
patterned_words = []

import re
wordlist = "/usr/share/dict/words"
with open(wordlist) as infile:
    dico = infile.readlines()

dash_regexp = "[bdfghjklpqtyBDFGHJKLPQTY]"
dot_regexp  = "[aceimnorsuvwxz]"

with open("morse_code.txt") as infile:
    for line in infile:
        letter = line[0]
        code = line[2:-1]
        print(letter, code, "  ", end='')
        code_regexp = code.replace('.', dot_regexp)  .replace('-', dash_regexp)
        rexexp = re.compile('^' + code_regexp + "\n")
        words_current_letter = []
        for word in dico:
            if rexexp.match(word):
                words_current_letter.append(word.strip())
        patterned_words.append(words_current_letter)
        # optimization: instead of processing all words for each letter,
        # process all words only once, and for each word append it to the right list.
        # The current snippets executes almost instantly anyway, and only once.

A .-   B -...   C -.-.   D -..   E .   F ..-.   G --.   H ....   I ..   J .---   K -.-   L .-..   M --   N -.   O ---   P .--.   Q --.-   R .-.   S ...   T -   U ..-   V ...-   W .--   X -..-   Y -.--   Z --..   

Print the result

In [8]:
for i in range(len(patterned_words)):
    print(chr(i+97).capitalize(), end=': ')
    print(' '.join(patterned_words[i]))

A: ad ah at ay cf ch ct ed eh id if it mg ml my of oh op sh sq uh up wk wt
B: Baez Barr Bass Baum Bean Beau Benz Bern Bess Biro Boas Boer Bonn Bono Born Boru Bose Bran Brie Brno Bros Burr Dana Dane Dare Dave Dawn Dean Dena Deon Devi Diaz Diem Dina Dino Dion Dior Dona Donn Dora Drew Dunn Duse Fern Finn Fran Gaea Gaia Gaza Gena Gene Gere Gina Gino Giza Gore Gris Grus Guam Gwen Haas Hans Hera Hess Hiss Horn Howe Hume Huns Jain Jame Jami Jana Jane Java Jean Jeri Jess Jews Joan Joni Jose Jove Juan June Juno Kama Kane Kano Kans Kara Kari Karo Keri Kern Kerr Kiev Knox Kris Kroc Kwan Lana Lane Laos Lara Lars Laue Lean Lear Lena Leno Leon Leos Lesa Levi Lima Lina Lisa Liza Lois Lome Lora Lori Love Lowe Luce Luis Luna Luvs Lvov Paar Pace Parr Pena Penn Perm Peru Pisa Pius Pres Puzo Tami Tara Tass Teri Terr Tess Tina Toni Tran Tues Tums Yacc Ymir Yuan Yuma Yuri Yves baas bane bani bans bare barn bars base bass beam bean bear beau been beer bees berm bias bier bins boar boas bone boom boon boor bo

#### Re-read after manual selection, with commas

Read the result, manually select interesting words (or clean up too obscure words), and rebuild the list `patterned_words`.

In the input file, the 20th Morse codepoint '-' is handled specially: the only one-letter word among "b d f g h j k l p q t y" that seems usable is " 't ". So we merge this codepoint with the previous one, and we offer the two choices "can't won't" to encode the pair of codepoints "... -".

In [9]:
patterned_words = []
with open("patterned_words_en3_manual-selection.txt") as infile:
    lines = infile.readlines()
for line in lines:
    patterned_words.append(line.strip().split(' '))

For each word, try with and without a comma in front:

In [10]:
patterned_words = [patterned_words[0]] + [
    [' '+w for w in pw] + [', '+w for w in pw]
    for pw in patterned_words[1:] ]

Print the result

In [11]:
for i in range(len(patterned_words)):
    letter = i+97
    if i>=19: letter = letter+1
    print(f'{chr(letter).capitalize()} ({len(patterned_words[i])}): ', end='')
    print('|'.join(patterned_words[i]))

A (13): Ad|Ah|At|Ay|Ed|Eh|If|It|My|Of|Oh|Uh|Up
B (624):  Bean| Beau| Bonn| Dana| Dave| Dean| Dora| Gaia| Gwen| Hans| Hera| Horn| Hume| Jain| Jean| Jess| Jose| Jove| June| Kiev| Laos| Lara| Lars| Lear| Lena| Leon| Lisa| Lois| Luis| Peru| Pisa| Tina| Toni| Yves| bans| bare| barn| bars| base| bass| beam| bean| bear| beau| been| beer| bees| bias| bins| boar| boas| bone| boom| bore| born| boss| bows| brew| brim| buns| burn| burr| burs| buzz| dame| damn| dams| dare| darn| dawn| dean| dear| deem| deer| demo| dens| dice| dies| dime| dims| dine| dire| disc| diva| dive| docs| doer| does| dome| done| doom| door| dorm| dose| down| doze| draw| drew| drum| dues| dune| duos| face| fair| fame| fans| fare| farm| fawn| faze| fear| fees| fern| fine| fins| fire| firm| firs| five| foam| foes| fore| form| four| free| from| fume| furs| fuse| fuss| fuzz| gain| game| gave| gaze| gear| gems| gene| gens| germ| gins| give| gnaw| gnus| goes| gone| gore| gown| gram| grew| grim| grin| grow| gums| guns| guru| hair| h

## Main

Now that we have `lm_scorer` and `patterned_word`, let's write the main algorithm `draw_one_sentence` as described in the readme.

In [12]:
#Draw an index at random, with probability proportional to the weight at this index.
def weighted_random(probas):
    # assumes sum(probas)=1
    r = random.uniform(0,1)
    i = -1
    while r>0:
        i = i+1
        r = r - probas[i]
    return i

test = [0,0,0,0]
for i in range(10_000):
    r = weighted_random([.1, .3, .4, .2])
    test[r] = test[r] +1
test

[991, 2989, 3953, 2067]

In [13]:
def normalize(probas):
    s = sum(probas)
    return [p/s for p in probas]

In [14]:
all_sentences_drawn = []

In [15]:
def draw_one_sentence():
    result = ""
    for letter in range(len(patterned_words)):
        print(chr(97+letter+(1 if letter>=19 else 0)), end='')
        # recall that the 20th codepoint is special in patterned_words_en3_manual-selection.txt
    
        # score all possible words:
        probas = []
        for w in patterned_words[letter]:
            tokens_scores = lmscorer.tokens_score(result + w)
            tokens_scores = tokens_scores[0][0:-1] # remove the score of the '<|endoftext|>' token
            proba = np.prod(tokens_scores)
            probas.append(proba)
        probas = normalize(probas)
    
        # pick a word:
        chosen_index = weighted_random(probas)
        chosen_word = patterned_words[letter][chosen_index]
    
        # add it to the growing sentence:
        result = result + chosen_word
    print()
    scores = lmscorer.tokens_score(result)[0]
    ret = (sum([np.log(x) for x in scores]), result)
    all_sentences_drawn.append(ret)
    return ret

See how it works for one sentence:

In [16]:
%%time
draw_one_sentence()

abcdefghijklmnopqrsuvwxyz
CPU times: user 1min 47s, sys: 8.4 s, total: 1min 55s
Wall time: 1min 23s


(-177.95977919605784,
 "If your bags for a ride the nose is ugly, but when by do fly eggs that she won't set sail off back baby blue.")

Run it many times and sort to read the best results:

In [17]:
for i in range(300):
    draw_one_sentence()

abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijklmnopqrsuvwxyz
abcdefghijkl

In [18]:
for score, sentence in sorted(all_sentences_drawn, reverse=True):
    print(f'{score:.5} {sentence}')

-158.46 It does take two a side, the case is ugly but also by be fly able that she can't say much ill luck till then.
-166.88 If your bike has a side the rear is ugly, but when by to fly, adds that she won't say much, add good gold flow.
-169.64 At home, like her s says the news is ugly but also PG, be fly able that she won't say much off hand till then.
-169.72 It does take her a mile the race as edgy, but when by do fly eggs that she can't cut away off just hold them.
-170.17 It does take two a side, the same as myth, but when Lt Jo FTP adds that she can't say much, all just talk then.
-170.7 If Jean John has a mole, the news is ugly, but when BB Jr BBQ adds that she won't eat meat all beef till then.
-170.72 It does take two a side, the ever so myth fed upon by to fly eggs that she won't rip much off good lady blue.
-170.89 It does take too a wide the view on ugly, but when Ty to BBQ adds that she can't say much ill luck till then.
-171.12 If your bike has a make the rear is ugly bu

# Possible improvements
- Don't reparse and score the whole sentence each time a word is added: process only the added word.
- handle comma as one possible token: would it be faster? If a comma gets selected, don't advance the word counter pointing to patterned_words.
- score without the <|end of text|> token ? But lmscorer.tokenizer seems unable to delegate the tokenisation.