# Quordle Starting Words

Quordle starting strategy is different from Wordle: you don't want to split your possible words in half, you want to find the most common combinations of letters to get as far as you can out of the gate. This makes the search for best word easier.

In [1]:
!wget https://github.com/htrc/HTRC-Useful-Datasets/raw/master/token-frequencies/eng-vocab-1.txt.bz2

--2022-04-20 15:57:36--  https://github.com/htrc/HTRC-Useful-Datasets/raw/master/token-frequencies/eng-vocab-1.txt.bz2
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/htrc/HTRC-Useful-Datasets/master/token-frequencies/eng-vocab-1.txt.bz2 [following]
--2022-04-20 15:57:36--  https://raw.githubusercontent.com/htrc/HTRC-Useful-Datasets/master/token-frequencies/eng-vocab-1.txt.bz2
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 731024 (714K) [application/octet-stream]
Saving to: ‘eng-vocab-1.txt.bz2’


2022-04-20 15:57:36 (15.4 MB/s) - ‘eng-vocab-1.txt.bz2’ saved [731024/731024]



In [2]:
import pandas as pd
import string

truncate_at = 100000

words = pd.read_csv('eng-vocab-1.txt.bz2', compression='bz2', names=['word'])
words = words.iloc[:truncate_at]
words = words[words.word.astype(str).apply(len) == 5]
words = words[words.word.str.contains('^[a-z]+$')]
words

Unnamed: 0,word
22,which
39,their
44,there
46,other
51,would
...,...
99962,suger
99979,bronn
99980,tacts
99985,hunde


Strategy: use the 10 most likely characters in the first two words, selecting the words that represent those common letters in the most likely positions.

In [5]:
words[['pos1', 'pos2', 'pos3', 'pos4', 'pos5']] = words.word.str.split('', expand=True).iloc[:, 1:6]
words

Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5
22,which,w,h,i,c,h
39,their,t,h,e,i,r
44,there,t,h,e,r,e
46,other,o,t,h,e,r
51,would,w,o,u,l,d
...,...,...,...,...,...,...
99962,suger,s,u,g,e,r
99979,bronn,b,r,o,n,n
99980,tacts,t,a,c,t,s
99985,hunde,h,u,n,d,e


Adjustment: Wordle seems to be dictionary words, so 's' at the end as a plural and 'ed' as a verb form never seems to show up. When calculating character stats, include a random selection of those words.

Based on manual review of random sampling, I count 9/70 words that end with 's' and 4/60 that end with 'ed' are valid words.

In [6]:
random_exclude = []
random_exclude += words[words.word.str.endswith('ed')].sample(frac=(60-4)/60).word.tolist()
random_exclude += words[words.word.str.endswith('s')].sample(frac=(70-9)/70).word.tolist()
len(random_exclude)

2006

In [7]:
starters_to_generate = 2 # if for sedecordle, 3 or four would be better.
trunc_words = words[~words.word.isin(random_exclude)]
allchars = pd.concat([trunc_words[f'pos{x}'] for x in range(1, 6)])
charcounts = allchars.value_counts()
topx = charcounts.head(5*starters_to_generate).index.tolist()
topx

['e', 'a', 'r', 'o', 'n', 'i', 'l', 't', 's', 'c']

Adjustment: I would argue that if you have a vowel left after the first two words, you can likely discover it in a future guess. So let's drop some letters.

In [9]:
topx = charcounts[~charcounts.index.isin(list('eur'))].head(5*starters_to_generate).index.tolist()
topx_re = '[{}]'.format("".join(topx))
topx_re

'[aoniltscmd]'

In [10]:
pos_chars = words[['word']+[f'pos{x}' for x in range(1,6)]].melt(id_vars='word', var_name='pos', value_name='char')
a = pos_chars[~pos_chars.word.isin(random_exclude)].loc[pos_chars.char.str.isalpha(), ['pos', 'char']].value_counts().reset_index()
char_stats = a.groupby('char').apply(lambda x: x.set_index('pos').to_dict()[0]).to_dict()
char_stats['a']

{'pos1': 616, 'pos2': 1570, 'pos3': 802, 'pos4': 735, 'pos5': 822}

Assign a score for each word by adding the sums for for each at it's location.

In [11]:
def score_word(word):
    assert len(word) == 5
    chars = list(word)
    score = 0
    for pos, char in enumerate(chars):
        score += char_stats[char]['pos'+str(pos+1)] #colnames aren't zero index
    return score

words['score'] = words.word.apply(score_word)
words.iloc[:10]

Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score
22,which,w,h,i,c,h,2131
39,their,t,h,e,i,r,3007
44,there,t,h,e,r,e,3910
46,other,o,t,h,e,r,2518
51,would,w,o,u,l,d,2762
57,these,t,h,e,s,e,3753
78,under,u,n,d,e,r,2639
86,about,a,b,o,u,t,2233
87,state,s,t,a,t,e,4496
91,first,f,i,r,s,t,3228


Find words composed from top letters.

In [12]:
topx_words = words[words.word.str.contains(f"^{topx_re}+$")]

# for starter words, avoid duplicate characters
repeats = pos_chars[['word', 'char']].value_counts()
repeats = repeats[repeats > 1]
repeats = repeats.index.get_level_values(0).tolist()
print(repeats[:3])
topx_words = topx_words[~topx_words.word.isin(repeats)]
topx_words.head(3)

['aaaaa', 'nnnnn', 'ooooo']


Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score
925,claim,c,l,a,i,m,2899
943,lands,l,a,n,d,s,3401
1180,stand,s,t,a,n,d,2935


Now find the best word to accompany each word.

In [13]:
def best_word_match(word):
    exclude_chars = list(word)
    candidate_words = pos_chars[~pos_chars.char.isin(exclude_chars)].word.value_counts()
    candidate_words = candidate_words[candidate_words == 5].index.tolist()
    search = topx_words[topx_words.word.isin(candidate_words)]
    #return search
    if search.empty:
        return None
    else:
        return search.iloc[0]['word']

best_word_match('lanes')

'dicto'

In [14]:
topx_words['partner'] = topx_words.word.apply(best_word_match)
topx_words.head(4)

Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score,partner
925,claim,c,l,a,i,m,2899,
943,lands,l,a,n,d,s,3401,
1180,stand,s,t,a,n,d,2935,
1287,coast,c,o,a,s,t,3831,


In [15]:
partnered = topx_words.merge(topx_words[['score', 'word']], left_on='partner', right_on='word')[['word_x', 'word_y', 'score_x', 'score_y']]
partnered['total_score'] = partnered.score_x + partnered.score_y
# skip every second
partnered = partnered.sort_values('total_score', ascending=False)
partnered.iloc[::2]

Unnamed: 0,word_x,word_y,score_x,score_y,total_score
8,salim,contd,4165,3628,7793
32,conld,maist,3513,4023,7536
9,malis,contd,3782,3628,7410
19,colds,matin,3178,4184,7362
7,simla,contd,3537,3628,7165
20,scold,matin,2486,4184,6670
1,midst,lacon,2838,3665,6503
34,condi,smalt,3317,3181,6498
26,monti,scald,3549,2741,6290
35,molti,scand,3419,2831,6250


Some tail words which may not be in quordle dictionary. First focus on words at head of vocab, then manually exclude words not in dictionary.

In [16]:
words_not_in_quordle = ['latin', 'contd']
focus_on_top_n = 40000
include_words = words.word.loc[:focus_on_top_n].tolist()
final = partnered[partnered[['word_x', 'word_y']].isin(include_words).all(axis=1)]
final[~final[['word_x', 'word_y']].isin(words_not_in_quordle).any(axis=1)]

Unnamed: 0,word_x,word_y,score_x,score_y,total_score
19,colds,matin,3178,4184,7362
29,matin,colds,4184,3178,7362
20,scold,matin,2486,4184,6670
27,colts,admin,3510,2599,6109
10,admin,colts,2599,3510,6109
17,actin,molds,2787,3078,5865
14,molds,actin,3078,2787,5865
28,clots,admin,2640,2599,5239


Find high scoring remaining words - really, only one option.

In [17]:
# find high scoring remaining words
others = words[~words.word.str.contains(topx_re)].sort_values('score', ascending=False)
others[~others.word.isin(repeats)].head(10)

Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score
55840,burge,b,u,r,g,e,4140
22262,purge,p,u,r,g,e,4094
7112,burke,b,u,r,k,e,4064
52285,furze,f,u,r,z,e,3781
65657,kurve,k,u,r,v,e,3618
53983,kurze,k,u,r,z,e,3559
58986,grube,g,r,u,b,e,3420
7750,buyer,b,u,y,e,r,3375
70969,heury,h,e,u,r,y,3360
67160,huger,h,u,g,e,r,3295


In [18]:
others = words[~words.word.str.contains(topx_re)].sort_values('score', ascending=False)
others[~others.word.isin(repeats)].head(10)

Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score
55840,burge,b,u,r,g,e,4140
22262,purge,p,u,r,g,e,4094
7112,burke,b,u,r,k,e,4064
52285,furze,f,u,r,z,e,3781
65657,kurve,k,u,r,v,e,3618
53983,kurze,k,u,r,z,e,3559
58986,grube,g,r,u,b,e,3420
7750,buyer,b,u,y,e,r,3375
70969,heury,h,e,u,r,y,3360
67160,huger,h,u,g,e,r,3295


## Solver

Writing a little solver, that takes
 guesses, a regex of green chars, and the known yellow characters.

In [43]:
from functools import reduce
words_tried = ['acrid', 'merry', 'woman', 'about', 'botch', 'level', 'verse', 'fatty', 'colin', 'stare', 'scarf', 'bluff', 'shelf', 'bushy', 'scope', 'learn', 'scold'] #@param {type:'raw'}
yellow_chars = '' #@param {type: 'string' }
#@markdown For green char, use a '.' as to represent unknown spaces. e.g. `.ea..`
green_char_re = 'li.en'#@param {type: 'string' }
assert len(green_char_re) == 5
top_n = 120000 #@param {type: 'number' }

chars_tried = list(set("".join(words_tried)))
grey_chars = "".join([c for c in chars_tried if c not in (yellow_chars+green_char_re)])

full_regex = ''
for i, char in enumerate(green_char_re):
    if char != '.':
        full_regex += char
        continue
    exclude_chars = "".join(set([x[i] for x in words_tried if x[i] not in grey_chars]))
    if len(exclude_chars):
        full_regex += f"[^{exclude_chars}]"
    else:
        full_regex += char
full_regex = f"^{full_regex}$"
print(full_regex)

candidates = words[~words.word.str.contains(f'[{grey_chars}]') & words.word.str.contains(full_regex)]

if len(yellow_chars):
    yellow_contains = reduce(lambda x,y: x & candidates.word.str.contains(y), yellow_chars[1:], candidates.word.str.contains(yellow_chars[0]))
    candidates = candidates[yellow_contains]
candidates.loc[:top_n].sort_values('score', ascending=False)

^li[^le]en$


Unnamed: 0,word,pos1,pos2,pos3,pos4,pos5,score
7855,linen,l,i,n,e,n,4159
58814,liken,l,i,k,e,n,3453


In [None]:
from functools import reduce

In [None]:
yellow_chars = 'ae'