---
title: "Champagne for my real friends"
format: 
  html:
    code-fold: true
    toc: true
toc-depth: 2
jupyter: python3
---


# Champagne for my real friends
## Real pain for my sham friends

What a catchy slogan. Let's find some more -- and learn about classical Natural Language Processing along the way.

::: {.callout-tip collapse="true"}
## What is classical Natural Language Processing? Why not use ChatGPT?

So-called classical natural-language processing relies on computational representations of the kinds of grammar concepts used by linguists. Think back to middle-school English class. You learned about nouns and verbs and how to diagram a sentence. Natural language processing used to be a lot of that.

This differs from AI-based natural language processing, whose representation of language tends to be more of black box. That black box represeentation is good for a lot of things, but it's famously terrible at computational humor.

Here's what ChatGPT gave me when I asked it for _can you come up with some variants on the "champagne for my real friends, real pain for my sham friends?" template, but using different words than champagne, sham and pain?_.

1. "Laughter for my true friends, heartfelt tears for my false friends."
2. "Confetti for my loyal friends, bitter truth for my deceitful friends."
3. "Hugs for my genuine friends, icy distance for my deceptive friends."
4. "Sunshine for my authentic friends, thunderstorms for my treacherous friends."

Those are terrible! We can do better.
:::

## introduction to the CMU Pronouncing Dictionary

the CMU Pronouncing Dictionary is a longstanding natural language processing resource that lists words and their pronunciations. If we're gonna find more slogans like _champagne for my real friends, real pain for my sham friends_, we'll have to find words like _champagne_ that sound like compounds of a word like _sham_ and a word like _pain_. 

Another example of what we're looking for is _focus for my real friends, real cuss for my faux friends_. In this example, _focus_ is our **champagne word**, and it's composed of _faux_ (a **sham word**) and _cuss_ (a **pain word**).

Look below to see what the pronunciation of _champagne_ looks like, and how it's made up of the pronunciations of _sham_ and _pain_.

In [1]:
!pip install --disable-pip-version-check -q cmudict # install cmudict
import cmudict
cmudict_dict = cmudict.dict()

for word in ["champagne", "sham", "pain"]:
    print("pronunciation of {} is {}".format(word, cmudict_dict[word]))


pronunciation of champagne is [['SH', 'AE0', 'M', 'P', 'EY1', 'N']]
pronunciation of sham is [['SH', 'AE1', 'M']]
pronunciation of pain is [['P', 'EY1', 'N']]


In sum, we're looking for a **sham word** and a **pain word** that are both negative, but when combined, make a positive **champagne word**.

## Cleaning up the pronunciations

The pronunciations in our example up there almost match.

```
champagne: [['SH', 'AE0', 'M', 'P', 'EY1', 'N']]
sham:      [['SH', 'AE1', 'M']]
pain:                        [['P', 'EY1', 'N']]
```

But not quite. The problem is that the vowel in _sham_ (`AE1`) isn't quite the same as the first vowel in _champagne_ (`AE0`). The numbers represent stress. `1` is primary stress; `2` is secondary, and `0` is no stress. We don't really care about stress for our joke format, so we will "clean" the data to ignore stress.

In [2]:
def remove_stress(phon):
    return ''.join(i for i in phon if not i.isdigit())

def clean_phonemes(pron):
    pron = tuple(remove_stress(phon) for phon in pron) # remove stress        
    return pron

# map nouns to their "cleaned" pronunciations
word_pronunciations = {}
for word, prons in cmudict_dict.items():
    word_pronunciations[word] = [clean_phonemes(pron) for pron in prons]


## Okay, let's go.

Let's use "faux" as our sham word. So we're looking for phrases like _faux pain for my real friends, real pain for my faux friends_... except where _faux pain_ is a real word.

Due to an error in the CMU pronouncing dictionary, we have to specify that _faux_ is actually pronounced like _foe_. (Not like _fox_.)

Let's find our champagne words, but starting with the sound "faux".

In [3]:
SHAM_WORD = "faux"
SHAM_WORD_OVERRIDE = "foe" # None or "foe" # in case the pronunciation of the TARGET_WORD is wrong, as it oddly is for "faux"

In [4]:

SHAM_SYLLABLE = word_pronunciations[SHAM_WORD_OVERRIDE or SHAM_WORD][0]

def find_candidate_champagne_words(word_prons, sham_syllable=SHAM_SYLLABLE):
    """
    champagne words have to start with the same sounds as the cham word (but can't be identical, it has to be longer!)
    """
    word, prons = word_prons # destructuring
    return any([                                              # candidate words must:
               pron[:len(sham_syllable)] == sham_syllable and # start with same sounds
               pron != sham_syllable                          # but be different
            for pron in prons])

nouns_starting_with_sham_word = list(filter(find_candidate_champagne_words, word_pronunciations.items()))
print("Here are words (our candidate champagne words) that start with {}".format(SHAM_SYLLABLE))
for noun, prons in nouns_starting_with_sham_word:
    print(f" - {noun}: {prons}")

Here are words (our candidate champagne words) that start with ('F', 'OW')
 - faucette: [('F', 'OW', 'S', 'EH', 'T')]
 - faucheux: [('F', 'OW', 'SH', 'OW')]
 - faupel: [('F', 'OW', 'P', 'EH', 'L')]
 - fauteux: [('F', 'OW', 'T', 'OW')]
 - foal: [('F', 'OW', 'L')]
 - foale: [('F', 'OW', 'L')]
 - foale's: [('F', 'OW', 'L', 'Z')]
 - foaling: [('F', 'OW', 'L', 'IH', 'NG')]
 - foam: [('F', 'OW', 'M')]
 - foaming: [('F', 'OW', 'M', 'IH', 'NG')]
 - foams: [('F', 'OW', 'M', 'Z')]
 - foamy: [('F', 'OW', 'M', 'IY')]
 - fobel: [('F', 'OW', 'B', 'AH', 'L')]
 - fobel's: [('F', 'OW', 'B', 'AH', 'L', 'Z')]
 - fobes: [('F', 'OW', 'B', 'Z')]
 - focaccia: [('F', 'OW', 'K', 'AA', 'CH', 'IY', 'AH')]
 - focal: [('F', 'OW', 'K', 'AH', 'L')]
 - focus: [('F', 'OW', 'K', 'AH', 'S'), ('F', 'OW', 'K', 'IH', 'S')]
 - focused: [('F', 'OW', 'K', 'AH', 'S', 'T'), ('F', 'OW', 'K', 'IH', 'S', 'T')]
 - focuses: [('F', 'OW', 'K', 'AH', 'S', 'IH', 'Z'), ('F', 'OW', 'K', 'IH', 'S', 'IH', 'Z')]
 - focusing: [('F', 'OW', 'K'

## a first try:

For each of our candidate **champagne words** (_folklore_, _folder_, etc.), we're going to check and see if it contains a **pain word**. That is, we're asking if _klore_ or _lder_ are words -- even if they're spelled differently.

We do this really naively, by looping through every pronunciation of every single word in the dictionary, to see if matches the second half of the pronunciation of our candidate **champagne word**.

In [5]:
# naive version # don't erase!!

for candidate_word, candidate_prons in nouns_starting_with_sham_word:
    for pron in candidate_prons:
        for word, prons in word_pronunciations.items():
            if (pron[len(SHAM_SYLLABLE):] in prons) or (pron[len(SHAM_SYLLABLE)-1:] in prons):
                print("{} for my real friends, real {} for my {} friends".format(candidate_word, word, SHAM_WORD))

faucette for my real friends, real set for my faux friends
faucette for my real friends, real sette for my faux friends
faucheux for my real friends, real chau for my faux friends
faucheux for my real friends, real schau for my faux friends
faucheux for my real friends, real show for my faux friends
faupel for my real friends, real pehl for my faux friends
faupel for my real friends, real pell for my faux friends
faupel for my real friends, real pelle for my faux friends
fauteux for my real friends, real toe for my faux friends
fauteux for my real friends, real tow for my faux friends
fauteux for my real friends, real towe for my faux friends
foal for my real friends, real ohl for my faux friends
foal for my real friends, real ol' for my faux friends
foal for my real friends, real ole for my faux friends
foale for my real friends, real ohl for my faux friends
foale for my real friends, real ol' for my faux friends
foale for my real friends, real ole for my faux friends
foale's for my r

folklore for my real friends, real clore for my faux friends
folkman for my real friends, real oakman for my faux friends
folkrock for my real friends, real croc for my faux friends
folkrock for my real friends, real crock for my faux friends
folkrock for my real friends, real kroc for my faux friends
folkrock for my real friends, real krock for my faux friends
folkrock for my real friends, real krok for my faux friends
folks for my real friends, real oak's for my faux friends
folks for my real friends, real oakes for my faux friends
folks for my real friends, real oaks for my faux friends
folks for my real friends, real oaks' for my faux friends
folks for my real friends, real ochs for my faux friends
folks' for my real friends, real oak's for my faux friends
folks' for my real friends, real oakes for my faux friends
folks' for my real friends, real oaks for my faux friends
folks' for my real friends, real oaks' for my faux friends
folks' for my real friends, real ochs for my faux fri

Very cool! Those are kind of right.

_focusing for my real friends, real kissing for my faux friends_ has got a ring to it, but kind of has the wrong valence. Who wishes kisses on their faux friends?

But let's look closer.


 > foamy for my real friends, real me for my faux friends
 
This is kind of funny, but _foamy_ is an adjective. That means that the phrase doesn't make a lot of sense. Let's try to skip those.

 > _focus for my real friends, real cos for my faux friends_
 >
 > _focus for my real friends, real kiss for my faux friends_
 >
 > _focus for my real friends, real kos for my faux friends_
 
 
Some of the proposals are doubled. And, _cos_ and _kos_ aren't words I've heard of, we should try to eliminate those.


## Let's build a fancier version:

above, we noticed a few problems:

- non-nouns
- weird, very rare words
- multiple pronunciations of the same sounds
- overprecise vowel matches mean we miss some slant rhymes that should be ok.

In [6]:
#| echo: false
#| code-fold: true
import itertools

from os import environ
NSFW_WORDS_ALLOWED = environ.get("NSFW_WORDS_ALLOWED")
import base64

# helper functions for masking NSFW terms.
def mask_nsfw(string):
    return base64.b64encode(string.encode()).decode()
    
def unmask_nsfw(b64string):
    return base64.b64decode(b64string).decode()

test_input_str = 'asdfasdf'
assert unmask_nsfw(mask_nsfw(test_input_str)) == test_input_str

### champagne and pain words have to be nouns

Let's use Wordnet to filter just to words that are nouns.

unfortunately this could cost us gems like 

 - "folkrock for my real friends, real croc for my faux friends" 
 - "chamonix for real friends, real money for my sham friends"
 - "midrash for my real friends, real rash for my mid friends"
 
because _folkrock_, _chamonix_ and _midrash_ aren't in Wordnet. So we add them back here as special extra nouns.

In [7]:
#| echo: false
#| code-fold: true

from nltk.corpus import wordnet
from nltk import download as nltk_download; nltk_download('wordnet', quiet=True);

EXTRA_NOUNS = ["folkrock", # best genre evar
               "chamonix", # an area in France
               'poo', # missing from wordnet :(
              ]


# add pronunciations for some words that are missing from CMU
cmudict_pronunciations = cmudict.dict()
EXTRA_PRONUNCIATIONS = {
    'fomites': [cmudict_pronunciations['foe'][0] + cmudict_pronunciations['mites'][0]],
    'fomite': [cmudict_pronunciations['foe'][0] + cmudict_pronunciations['mite'][0]],
    'shamrocks': [cmudict_pronunciations['sham'][0] + cmudict_pronunciations['rocks'][0]],
    'midrash': [cmudict_pronunciations['mid'][0] + ['R'] + cmudict_pronunciations['gosh'][0][1:]],
    'badges': [cmudict_pronunciations["bad"][0] + cmudict_pronunciations['badges'][0][2:]],
    'meh': [['M', 'EH1']]
}
if NSFW_WORDS_ALLOWED:
    (unmask_nsfw('aml6eg=='), [cmudict_pronunciations['badges'][0][2:]]),

for word, prons in EXTRA_PRONUNCIATIONS.items():
    cmudict_pronunciations[word] += prons
    
EXTRA_NOUNS += EXTRA_PRONUNCIATIONS.keys()


In [8]:
def is_noun(word):
    return any(synset.pos() for synset in wordnet.synsets(word, pos=wordnet.NOUN)) or word in EXTRA_NOUNS

In [9]:
#| echo: false
#| code-fold: true

assert is_noun('mite')      # testing basic case
assert is_noun('might')     # testing words that can be multiple PoS
assert is_noun('shamrocks') # testing plurals
assert is_noun('mites')     # testing plurals
assert is_noun('poo')       # testing manual additions
assert not is_noun('photographed')     # testing handling of inflected verbs
assert not is_noun('focused')          # testing handling of inflected verbs

### let's ignore very rare words

First we download a list of word frequencies; we'll ignore anything that occurs less than 500,000 times in the corpus. Then, we download a list of nouns; we'll ignore anything that isn't a noun. (We add back a few nouns I like.)

In [10]:

# ignore words that occur less than this many times in the Google Web Trillion Word Corpus 
# many of these less-frequent words are very bizarre, but appear in the CMU Pronouncing Dictionary nevertheless.
# the champagne and sham words can be rare, but they must be nouns.
# the pain word must be a more common noun.

# download file of "The 1/3 million most frequent words, all lowercase, with counts"
!wget -nc --quiet https://norvig.com/ngrams/count_1w.txt
import csv
with open("count_1w.txt") as f:
    word_frequencies = dict([(row[0], int(row[1])) for row in csv.reader(f, delimiter="\t")])

MIN_FREQUENCY = 100_000


### let's be a little more chill about vowels

We also don't care about all the vowel distinctions that the dictionary uses. We'll implement "vowel reductions" that mirror some of the ways American English speakers can change vowels in fast or casual speech, so that near-identical words are okay.

Also we ignore very uncommon words and non-nouns, which don't fit in the joke format.

::: {.callout-tip collapse="true"}
What are these vowels?
:::

In [11]:
#| echo: true
#| code-fold: true

# finding examples of each vowel, so we can find which vowels to combine
try:
    import pandas as pd
    has_pandas = True
except:
    has_pandas = False
    
vowels = [ph for ph, types in cmudict.phones() if types[0] == 'vowel']
vowel_examples = []
for vowel in vowels:
    for word, prons in sorted(cmudict_dict.items(), key=lambda word_prons: -word_frequencies.get(word_prons[0], 0)):
        if "'" in word or '.' in word: continue
        if not is_noun(word): continue
        if 'R' in prons[0]: continue
        if len(word) < 3 or len(word) >= 6: continue

        if vowel in clean_phonemes(prons[0]):
            vowel_examples.append((vowel, word, prons[0]))
            break
if has_pandas:
    display(pd.DataFrame(vowel_examples, columns=["vowel", "example", "pronunciation"]).set_index('vowel'))
else:
    for vowel, example, pron in vowel_examples:
        print(vowel, example, pronunciation)
    

Unnamed: 0_level_0,example,pronunciation
vowel,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,was,"[W, AA1, Z]"
AE,have,"[HH, AE1, V]"
AH,one,"[W, AH1, N]"
AO,law,"[L, AO1]"
AW,out,"[AW1, T]"
AY,time,"[T, AY1, M]"
EH,web,"[W, EH1, B]"
ER,first,"[F, ER1, S, T]"
EY,page,"[P, EY1, JH]"
IH,will,"[W, IH1, L]"


In [12]:

VOWEL_REDUCTIONS = {
    "AH": "AH", "AA": "AH", "AO": "AH", "UH": "AH", "IH": "AH", "EH": "AH", 
    "OW": "AH" # special for chamonix / money
}

VOWEL_STRICTNESS = 2
# 1 = least; compare only first char of vowel symbol 
# 2 = medium; perform some vowel reductions
# 3 = most; compare vowels as is (with stress removed)

def is_vowel(phon):
    return phones_dict[remove_stress(phon)][0] == 'vowel'

phones_dict = dict(cmudict.phones())
def reduce_phonemes(pron, vowel_strictness=VOWEL_STRICTNESS):
    if vowel_strictness < 3:
        pron = [(phon[0] if (vowel_strictness == 1) else VOWEL_REDUCTIONS.get(remove_stress(phon), phon)) if is_vowel(phon) else phon for phon in pron]
    pron = clean_phonemes(pron) # remove stress        
    return pron

Here's how the vowel strictness works:


In [13]:
#| echo: true
#| code-fold: true

# `assert` lines are tests to make sure I didn't F it up.
print("pronunciation of 'ruck', as is    {}"  .format(          tuple(cmudict_dict["ruck"][0])))
print("pronunciation of 'ruck', cleaned  {}".format( reduce_phonemes(cmudict_dict["ruck"][0], vowel_strictness=2)))
assert reduce_phonemes(cmudict.dict()["ruck"][0], vowel_strictness=2)[1] == "AH"

print("pronunciation of 'wreck', as is   {}"  .format(          tuple(cmudict_dict["wreck"][0])))
print("pronunciation of 'wreck', cleaned {}".format( reduce_phonemes(cmudict_dict["wreck"][0], vowel_strictness=2)))
assert reduce_phonemes(cmudict.dict()["wreck"][0], vowel_strictness=2)[1] == "AH"

assert reduce_phonemes(cmudict.dict()["ruck"][0], vowel_strictness=1)[1] == "A"
assert reduce_phonemes(cmudict.dict()["wreck"][0], vowel_strictness=1)[1] == "E"

assert reduce_phonemes(cmudict.dict()["ruck"][0], vowel_strictness=3)[1] == "AH"
assert reduce_phonemes(cmudict.dict()["wreck"][0], vowel_strictness=3)[1] == "EH"


pronunciation of 'ruck', as is    ('R', 'AH1', 'K')
pronunciation of 'ruck', cleaned  ('R', 'AH', 'K')
pronunciation of 'wreck', as is   ('R', 'EH1', 'K')
pronunciation of 'wreck', cleaned ('R', 'AH', 'K')


In [14]:
#| echo: false
#| code-fold: true

# map nouns to their "cleaned" pronunciations

# champagne words are allowed to be uncommon.
strict_noun_pronunciations = {} # the sham word has to exactly match the start of the champagne word, because 
                                # the joke format relies on ambiguity between "champagne" and "sham pain".

strict_word_pronunciations = {} # candidate sham words (which are adjectives, not nouns) and can be uncommon.

# but pain words have to be common.
strict_common_noun_pronunciations = {}
reduced_common_noun_pronunciations = {} # the pain word can match a reduced pronunciation of the back half of the
                                        # champagne word.
for word, prons in cmudict_pronunciations.items():
    strict_word_pronunciations[word] = [reduce_phonemes(pron, vowel_strictness=3) for pron in prons]
    if is_noun(word): 
        strict_noun_pronunciations[word] = [reduce_phonemes(pron, vowel_strictness=3) for pron in prons]
        if word_frequencies.get(word, 0) > MIN_FREQUENCY or word in EXTRA_NOUNS:
            reduced_common_noun_pronunciations[word] = [reduce_phonemes(pron) for pron in prons]
            strict_common_noun_pronunciations[word]  = [reduce_phonemes(pron, vowel_strictness=3) for pron in prons]

In [15]:
#| echo: true
#| code-fold: true

def get_best_pain_word(champagne_prons, sham_syllable, quiet=True):
    """
    find the most common word that matches the back half of the champagne word
    if we can't find one, find one that matches the back half of the champagne word, 
        but repeating the last phoneme of the sham word 
        (e.g. chamonix for my real friends, real money for my sham friends)
        this is "gemination" -- we double up the 
    if we still can't find one, try with reduced pronunciations
    """

    # this lets us prefer better vowel matches to worse ones, i.e.
    # ok: folkrock for my real friends, real crook for my faux friends
    # better: folkrock for my real friends, real croc for my faux friends
    # previously it prefers "crook" because crook is more common than croc

    # this also lets us prefer ungeminated matches over geminated ones
    
    for champagne_pron in champagne_prons:
        if sham_syllable != champagne_pron[:len(sham_syllable)] and \
           reduce_phonemes(sham_syllable, vowel_strictness=2) != champagne_pron[:len(sham_syllable)]: 
            # print("{} doesn't match {}, skipping".format(sham_syllable, champagne_pron)) # just for debug.
            continue
        for (pronunciations, syllable_offset), score in zip(list(itertools.product(
            [strict_common_noun_pronunciations, reduced_common_noun_pronunciations], 
            [0, 1])
                                                   ), range(4, 0, -1)):
            # gemination is only allowed for consonant-final sham words.
            # "photograph for my real friends, real autograph for my faux friends" should be invalid
            if syllable_offset == 1 and is_vowel(sham_syllable[-1]): continue
                
            for pain_word, pain_prons in sorted(pronunciations.items(), key=lambda word_prons: -word_frequencies.get(word[0], 0)):
                if (champagne_pron[len(sham_syllable)-syllable_offset:] in pain_prons):
                    return pain_word, score
    return None, None

get_best_pain_word(strict_noun_pronunciations["photos"], strict_noun_pronunciations["foe"][0])[0]

'toes'

In [16]:
from collections import defaultdict

def find_champagne_real_pain_phrases(sham_word, sham_word_override=None, real_word="real", quiet=True, include_score=False):
    # sham syllables can be any PoS and need to have the extras added.
    # e.g. "mid" and "faux" are adjectives; "meh" isn't in the CMU pron dictionary.
    sham_syllable = strict_word_pronunciations[sham_word_override or sham_word][0]
    
    nouns_starting_with_sham_word = list(filter(
        lambda candidate_word: find_candidate_champagne_words(candidate_word, sham_syllable=sham_syllable), 
        strict_noun_pronunciations.items()
    ))
    if not quiet:
        print("Here are words (our candidate champagne words) that start with {}".format(sham_syllable))
        for noun, prons in nouns_starting_with_sham_word:
            print(f" - {noun}: {prons}")
    champagne_pain_pairs = set()
    for candidate_champagne_word, _ in nouns_starting_with_sham_word:
        for candidate_champagne_prons in zip(strict_common_noun_pronunciations.get(candidate_champagne_word, []), 
                reduced_common_noun_pronunciations.get(candidate_champagne_word, [])):
            best_pain_word, score = get_best_pain_word(candidate_champagne_prons, sham_syllable)
            if best_pain_word:
                champagne_pain_pairs.add((candidate_champagne_word, best_pain_word, score))    
        
    return [(f"{candidate_champagne_word} for my {real_word} friends, {real_word} {pain_word} for my {sham_word} friends" + (f" (score: {score})" if include_score else ''))
            for candidate_champagne_word, pain_word, score in sorted(champagne_pain_pairs, key=lambda x: -x[2])]


In [17]:
for champagne_phrase in find_champagne_real_pain_phrases(SHAM_WORD, sham_word_override=SHAM_WORD_OVERRIDE):
    print(champagne_phrase)

fomites for my real friends, real mites for my faux friends
focus for my real friends, real cus for my faux friends
folio for my real friends, real leo for my faux friends
phobos for my real friends, real bus for my faux friends
foam for my real friends, real mm for my faux friends
photons for my real friends, real tonnes for my faux friends
focusing for my real friends, real kissing for my faux friends
photo for my real friends, real toe for my faux friends
fomite for my real friends, real might for my faux friends
photos for my real friends, real toes for my faux friends
phony for my real friends, real knee for my faux friends
photon for my real friends, real ton for my faux friends
folkrock for my real friends, real crock for my faux friends
phoney for my real friends, real knee for my faux friends
focuses for my real friends, real kisses for my faux friends
focus for my real friends, real kis for my faux friends
folds for my real friends, real olds for my faux friends
phone for my 

# a worked example

Let's walk through this step by step.

Chamonix is a province in France. It has two very different pronunciations.

In [18]:
cmudict.dict()["chamonix"]

[['CH', 'AE1', 'M', 'AH0', 'N', 'IH0', 'K', 'S'],
 ['SH', 'AE0', 'M', 'OW0', 'N', 'IY0']]

Here's what the pronunciation looks like with the stresses removed.

In [19]:
strict_word_pronunciations["chamonix"]

[('CH', 'AE', 'M', 'AH', 'N', 'IH', 'K', 'S'),
 ('SH', 'AE', 'M', 'OW', 'N', 'IY')]

In [20]:
sham_word = "sham"
sham_syllable = strict_word_pronunciations[sham_word][0]
sham_syllable

('SH', 'AE', 'M')

See how `('SH', 'AE', 'M')` is the first half of the second pronunciation, `('SH', 'AE', 'M', 'OW', 'N', 'IY')`

Now, we're looking to see if there are any _pain words_ that are pronounced like any of these options for the pronunciation of the back half of _chamonix_.

First, we're looking for something that matches `('OW', 'N', 'IY')` or `('M', 'OW', 'N', 'IY')` -- the back half without any vowel reductions. `('OW', 'N', 'IY')` would be preferable, since it doesn't require double-up the `M` in both the sham word (_sham_) and the pain word (_money_, in this case).

Here are those strict pronunciations that we're looking for.

In [21]:
for chamonix_pronunciation in strict_word_pronunciations["chamonix"]:
    print(chamonix_pronunciation[len(sham_syllable):])
    print(chamonix_pronunciation[len(sham_syllable)-1:])    

('AH', 'N', 'IH', 'K', 'S')
('M', 'AH', 'N', 'IH', 'K', 'S')
('OW', 'N', 'IY')
('M', 'OW', 'N', 'IY')


Or, if we can't find something that matches the strict, we'd look for something that matches the back half of _chamonix_ with reductions.

Here's what those reduced pronunciations look like:

In [22]:
for chamonix_pronunciation in reduced_common_noun_pronunciations["chamonix"]:
    print(chamonix_pronunciation[len(sham_syllable):])
    print(chamonix_pronunciation[len(sham_syllable)-1:])

('AH', 'N', 'AH', 'K', 'S')
('M', 'AH', 'N', 'AH', 'K', 'S')
('AH', 'N', 'IY')
('M', 'AH', 'N', 'IY')


With those eight acceptable pronunciations in mind, we loop through every noun in the English language, to see if any nouns are pronounced that way. We find three!

In [23]:
strict_noun_pronunciations["money"]

[('M', 'AH', 'N', 'IY')]

In [24]:
strict_noun_pronunciations["onyx"]

[('AA', 'N', 'IH', 'K', 'S')]

In [25]:
strict_noun_pronunciations["annex"]

[('AE', 'N', 'EH', 'K', 'S'), ('AH', 'N', 'EH', 'K', 'S')]

None of those three words match either strict pronunciation of _chamonix_. So we get no result from `get_best_pain_word` here.

In [26]:
get_best_pain_word(strict_noun_pronunciations["chamonix"], strict_noun_pronunciations["sham"][0])

(None, None)

But they do match the reduced pronunciation of _chamonix_. 

In [27]:
get_best_pain_word(reduced_common_noun_pronunciations["chamonix"], strict_noun_pronunciations["sham"][0])

('money', 3)

_Money_ is picked becuase it is the more common than either _onyx_ or _annex_.

In [28]:
for word in ["money", "onyx", "annex"]:
    print("frequency of {: >5} is {:,.0f}".format(word, word_frequencies[word]))

frequency of money is 190,205,072
frequency of  onyx is 2,315,135
frequency of annex is 8,465,905


# more fun examples

In [29]:
for champagne_phrase in find_champagne_real_pain_phrases("sham", quiet=False):
    print(champagne_phrase)


Here are words (our candidate champagne words) that start with ('SH', 'AE', 'M')
 - chamonix: [('CH', 'AE', 'M', 'AH', 'N', 'IH', 'K', 'S'), ('SH', 'AE', 'M', 'OW', 'N', 'IY')]
 - champagne: [('SH', 'AE', 'M', 'P', 'EY', 'N')]
 - champagnes: [('SH', 'AE', 'M', 'P', 'EY', 'N', 'Z')]
 - champlain: [('SH', 'AE', 'M', 'P', 'L', 'EY', 'N')]
 - shamble: [('SH', 'AE', 'M', 'B', 'AH', 'L')]
 - shambles: [('SH', 'AE', 'M', 'B', 'AH', 'L', 'Z')]
 - shampoo: [('SH', 'AE', 'M', 'P', 'UW')]
 - shampoos: [('SH', 'AE', 'M', 'P', 'UW', 'Z')]
 - shamrock: [('SH', 'AE', 'M', 'R', 'AA', 'K')]
 - shamrocks: [('SH', 'AE', 'M', 'R', 'AA', 'K', 'S')]
champagnes for my real friends, real pains for my sham friends
shamrocks for my real friends, real rocks for my sham friends
champlain for my real friends, real plain for my sham friends
shamrock for my real friends, real roc for my sham friends
shampoo for my real friends, real poo for my sham friends
champagne for my real friends, real pain for my sham friends

In [30]:
for champagne_phrase in find_champagne_real_pain_phrases("mid"):
    print(champagne_phrase)


midstream for my real friends, real stream for my mid friends
midwest for my real friends, real west for my mid friends
midwife for my real friends, real wife for my mid friends
midair for my real friends, real air for my mid friends
midweek for my real friends, real week for my mid friends
midterm for my real friends, real term for my mid friends
midterms for my real friends, real terms for my mid friends
midrash for my real friends, real rush for my mid friends
midpoint for my real friends, real point for my mid friends
mideast for my real friends, real east for my mid friends
middling for my real friends, real ling for my mid friends
midwives for my real friends, real wives for my mid friends
midwinter for my real friends, real winter for my mid friends
midnight for my real friends, real knight for my mid friends
midline for my real friends, real line for my mid friends
midway for my real friends, real way for my mid friends
midday for my real friends, real day for my mid friends
mi

In [31]:
for champagne_phrase in find_champagne_real_pain_phrases("bad", quiet=False):
    print(champagne_phrase)


Here are words (our candidate champagne words) that start with ('B', 'AE', 'D')
 - badges: [('B', 'AE', 'JH', 'IH', 'Z'), ('B', 'AE', 'D', 'JH', 'IH', 'Z')]
 - badlands: [('B', 'AE', 'D', 'L', 'AE', 'N', 'D', 'Z')]
 - badminton: [('B', 'AE', 'D', 'M', 'IH', 'N', 'T', 'AH', 'N')]
 - badmintons: [('B', 'AE', 'D', 'M', 'IH', 'N', 'T', 'AH', 'N', 'Z')]
 - badness: [('B', 'AE', 'D', 'N', 'AH', 'S')]
badlands for my real friends, real lands for my bad friends
badges for my real friends, real jaws for my bad friends
badness for my real friends, real nes for my bad friends


In [32]:
for champagne_phrase in find_champagne_real_pain_phrases("meh", quiet=True):
    print(champagne_phrase)


messes for my real friends, real says for my meh friends
memory for my real friends, real murray for my meh friends
marrow for my real friends, real rho for my meh friends
menus for my real friends, real news for my meh friends
marriage for my real friends, real ridge for my meh friends
married for my real friends, real read for my meh friends
medic for my real friends, real dick for my meh friends
mesquite for my real friends, real skeet for my meh friends
medics for my real friends, real dicks for my meh friends
melange for my real friends, real lange for my meh friends
merits for my real friends, real ritz for my meh friends
mary for my real friends, real re for my meh friends
merit for my real friends, real rut for my meh friends
meadows for my real friends, real doze for my meh friends
maris for my real friends, real rus for my meh friends
metrics for my real friends, real tricks for my meh friends
marriages for my real friends, real ridges for my meh friends
mecca for my real fri

## Future enhancements

### make sure that the pain word has a negative valence

There are two ways to do this:

 - dictionary-based sentiment analysis methodology like AFINN.
 - asking GPT4.
 
The benefit of this would be to exclude phrases like:
 
> "fomite for my real friends, real might for my faux friends" 

which is not a very good phrase, because we're wishing something bad on our real friends, and something good on our faux friends. That's the opposite of what we want.

> "fomites for my real friends, real mites for my faux friends"

is better because "mites" are bad. It's still not ideal, since "fomites" are bad, and we're wishing infection-carrying objects on our friends.

This sentiment/valence rating could be used instead of frequency to decide which pain word to use. Right now, for "folkrock", we generate _"folkrock for my real friends, real crock for my faux friends"_. This is not ideal because "crock" has only a very vague negative valence (implicitly it's a _crock of shit_, I guess). It would be better if it was "croc" instead as the pain word -- because it's mean to wish a crocodile on someone.

for singular, count nouns, we should add "a" (or "an") before hand, so it makes a bit more sense.

> _**a** phoney for my real friends, **a** real knee for my faux friends_

this would be tricky, because we don't want to do this for count nouns (like _champagne_). 

### metrics for my real friends

compile all the examples here and get some humans to rate them on how funny they are, and see if we can't figure out some more rules to filter out the crappy ones and generate more funny ones.

### more phonology

The vowel reduction logic above is pretty rough. Additionally, diphthongs are often close enough to single vowels to count as a rhyme. You could imagine _Champlain for my real friends, real playin' for my sham friends_ working -- but the code as of now wouldn't be able to equate _plain_ and _playin'_. 

In [33]:
#| echo: true
#| code-fold: true

# regression testing
assert not any(['photograph' in phrase and 'autograph' in phrase for phrase in find_champagne_real_pain_phrases("faux", sham_word_override="foe")])
assert any(['shamrocks' in phrase for phrase in find_champagne_real_pain_phrases("sham")])
assert any(['money' in phrase and 'chamonix' in phrase for phrase in find_champagne_real_pain_phrases("sham")])
assert any(['midrash' in phrase for phrase in find_champagne_real_pain_phrases("mid")])
assert any(['folkrock' in phrase for phrase in find_champagne_real_pain_phrases("faux", sham_word_override="foe")])
assert any(['fomites' in phrase for phrase in find_champagne_real_pain_phrases("faux", sham_word_override="foe")])
if NSFW_WORDS_ALLOWED:
    assert any(['badges' in phrase and unmask_nsfw('aml6eg==') in phrase for phrase in find_champagne_real_pain_phrases("bad")])
else:
    assert not any(['badges' in phrase and unmask_nsfw('aml6eg==') in phrase for phrase in find_champagne_real_pain_phrases("bad")])



AssertionError: 

In [None]:
#| echo: true
#| code-fold: true


# debugging:
def why_is_this_word_absent(word):
    if not is_noun(word):
        print(f"is_noun({word}) is false")
    if word in word_frequencies and word_frequencies.get(word) < MIN_FREQUENCY:
        print("{}'s frequency is {}, which is below MIN_FREQUENCY".format(word, word_frequencies.get(word), MIN_FREQUENCY))
    if word not in word_frequencies:
        print(f"{word} not in word frequencies")
    if not cmudict_pronunciations[word]:
        print(f"{word} not in cmudict")
why_is_this_word_absent('poo')
why_is_this_word_absent('shamrocks')
why_is_this_word_absent('midrash')
why_is_this_word_absent('badges')



In [None]:
#| echo: true
#| code-fold: true

# debugging
candidate_champagne_word = "midrash"
sham_syllable = word_pronunciations["mid"][0]
for champagne_prons in zip(strict_common_noun_pronunciations.get(candidate_champagne_word, []), 
                reduced_common_noun_pronunciations.get(candidate_champagne_word, [])):
    for champagne_pron in champagne_prons:
        if sham_syllable != champagne_pron[:len(sham_syllable)] and \
           reduce_phonemes(sham_syllable, vowel_strictness=2) != champagne_pron[:len(sham_syllable)]: 
            print("{} doesn't match {}, skipping".format(sham_syllable, champagne_pron))
            continue
        for pronunciations, syllable_offset in list(itertools.product(
            [strict_common_noun_pronunciations, reduced_common_noun_pronunciations], 
            [0, 1])
                                                   ):
            for pain_word, pain_prons in sorted(pronunciations.items(), key=lambda word_prons: -word_frequencies.get(word[0], 0)):
                if (champagne_pron[len(sham_syllable)-syllable_offset:] in pain_prons):
                    print(pain_word)
