<a href="https://colab.research.google.com/github/jakelang1348/RhymePredictorRNN/blob/main/Rhyme_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Libraries included

In [None]:
import re
import json
import sys
import collections
import os
import random
import numpy as np
import torch
import torchvision
import torch.nn as nn
from datetime import datetime
from sklearn.model_selection import train_test_split
from collections import Counter
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, TensorDataset

Data pre-preprocessing. Converts initial large dataset containing all english words into a dataset of only single syllable english words.
Link to dataset I used for initial pre-processing: https://github.com/open-dict-data/ipa-dict/blob/master/data/en_US.txt.
Creates new file to output_file_path containing only single syllable words

In [None]:
def is_single_syllable(ipa):
    vowel_pattern = r'[aeiouɑɔɛɪʌʊæɐɜɞɘɵʉɨɤəɚɝʰ]'
    vowels = re.findall(vowel_pattern, ipa, re.IGNORECASE)
    return len(vowels) == 1

def filter_single_syllable_words(input_file_path, output_file_path):
    with open(input_file_path, 'r') as input_file, open(output_file_path, 'w') as output_file:
        for line in input_file:
            word, ipa = line.strip().split('\t')
            if is_single_syllable(ipa):
                output_file.write(line)

input_file_path = '/content/drive/MyDrive/ling441/en_US.txt' #path to dataset from github repo
output_file_path = '/content/drive/MyDrive/ling441/en_US_single_syllable.txt' #path to the new file

filter_single_syllable_words(input_file_path, output_file_path)

print("Single syllable words filtered and saved to the new file.")


Single syllable words filtered and saved to the new file.


Second data preprocessing step. Generates sets of endings from the last vowel phoneme in a word onward, to find groups of words that should rhyme

In [None]:
#splits file into groups of english words and phonetic forms
def split_file(filepath):
    with open(filepath, 'r') as file:
      lines = file.readlines()
    english_word = []
    phonetic_word = []
    for line in lines:
      parts = line.strip().split('\t')
      word, phonetic = parts
      english_word.append(word)
      phonetic_word.append(phonetic)
    return english_word, phonetic_word


#maps endings of words from the last vowel sequence onward to words which contain that vowel sequence
def get_phoneme_endings_groups(words, ipa):
    endings = {}
    vowels = 'aeiouɑɔɛɪʌʊæɐɜɞɘɵʉɨɤəɚɝ'

    for word, phonetic in zip(words, ipa):
        last_vowel_index = max(i for i, char in enumerate(phonetic) if char in vowels)
        ending = phonetic[last_vowel_index:]
        if ending not in endings:
            endings[ending] = []
        endings[ending].append(word)
    return endings

filepath = '/content/drive/MyDrive/ling441/en_US_single_syllable.txt'

word, ipa = split_file(filepath)
endings = get_phoneme_endings_groups(word, ipa)

Print a sample to make sure it properly groups words. Can change sample size if necessary

In [None]:
def print_sample_endings(phoneme_endings_groups, num_samples=20):
    print(f"Printing {num_samples} samples from the phoneme endings groups:\n")
    for i, (ending, words) in enumerate(phoneme_endings_groups.items()):
        if i >= num_samples:
            break
        print(f"Ending '{ending}': {words}\n")

print_sample_endings(endings, num_samples=10)

Printing 10 samples from the phoneme endings groups:

Ending 'əz/': ["'cause", "'twas", 'buzz', 'fuzz', 'luz', 'twas']

Ending 'ɔɹs/': ["'course", 'borse', 'bourse', 'coarse', 'corse', 'course', 'force', 'forse', 'hoarse', 'horse', 'morse', 'morss', 'norse', 'nourse', 'sorce', 'source', 'vorce']

Ending 'uz/': ["'cuse", "blue's", 'blues', "blues'", 'boos', 'booz', 'booze', 'brews', 'bruise', 'bruse', 'buse', 'buus', 'chews', 'choose', 'clews', 'clues', "crew's", 'crewes', 'crews', 'cruise', 'cruse', 'cruz', 'cruze', 'cues', 'dews', "do's", 'drewes', 'drews', 'druse', 'druze', 'dues', 'ewes', 'flus', 'foos', 'fuse', 'glues', 'goos', 'groos', 'guse', 'hewes', 'hews', 'hoos', 'hues', 'huse', 'jews', "jews'", 'joos', 'kloos', 'koos', 'kruse', 'kuse', 'kuze', "leu's", 'loos', 'lose', "lou's", 'luiz', 'luse', 'meuse', 'mewes', 'moos', 'muise', 'muse', "news'", 'oohs', 'ooze', 'pews', "pru's", "pugh's", 'pughs', "q.'s", 'q.s', "q's", 'queues', 'roos', 'roose', 'rues', 'ruse', 'schmooze', 'scr

Randomly generates rhymes and non-rhymes. Does so by randomly picking a set, and if 'rhyme' is true, then randomly gets another word from that set. If rhyme is false, randomly chooses a word from a different set. Then shuffles and creates training and testing datasets

In [None]:
def generate_full_ipa_pairs(phoneme_endings_groups, num_pairs, rhyme=True):
    pairs = []
    endings = list(phoneme_endings_groups.keys())
    word1, word2 = "", ""
    for _ in range(num_pairs):
        if rhyme:
            #choose random ending group
            ending = random.choice(endings)
            if len(phoneme_endings_groups[ending]) > 1:
                word1, word2 = random.sample(phoneme_endings_groups[ending], 2)
        else:
            #choose two different ending group
            ending1, ending2 = random.sample(endings, 2)
            word1 = random.choice(phoneme_endings_groups[ending1])
            word2 = random.choice(phoneme_endings_groups[ending2])

        #get the full ipa form
        ipa1 = next((ipa for w, ipa in zip(word, ipa) if w == word1), None)
        ipa2 = next((ipa for w, ipa in zip(word, ipa) if w == word2), None)

        if ipa1 and ipa2:
            pairs.append((word1, word2, rhyme, ipa1, ipa2))

        if len(pairs) >= num_pairs:
            break

    return pairs

#create num_pairs rhyming pairs and num_pairs non-rhyming pairs
rhyming_pairs_full_ipa = generate_full_ipa_pairs(endings, 5000, rhyme=True)
non_rhyming_pairs_full_ipa = generate_full_ipa_pairs(endings, 5000, rhyme=False)

#randomly shuffle the data
random.shuffle(rhyming_pairs_full_ipa)
random.shuffle(non_rhyming_pairs_full_ipa)

#get index for splitting
split_index_rhyme = int(len(rhyming_pairs_full_ipa) * 0.8)
split_index_non_rhyme = int(len(non_rhyming_pairs_full_ipa) * 0.8)
#create rhyme sets
train_set_rhyme = rhyming_pairs_full_ipa[:split_index_rhyme]
test_set_rhyme = rhyming_pairs_full_ipa[split_index_rhyme:]
#create non-rhyme sets
train_set_non_rhyme = non_rhyming_pairs_full_ipa[:split_index_non_rhyme]
test_set_non_rhyme = non_rhyming_pairs_full_ipa[split_index_non_rhyme:]
#combine
balanced_train_set = train_set_rhyme + train_set_non_rhyme
balanced_test_set = test_set_rhyme + test_set_non_rhyme
#shuffle again
random.shuffle(balanced_train_set)
random.shuffle(balanced_test_set)

#for displaying sets
num_train_rhyme, num_train_non_rhyme = len(train_set_rhyme), len(train_set_non_rhyme)
num_test_rhyme, num_test_non_rhyme = len(test_set_rhyme), len(test_set_non_rhyme)

train_sample_balanced = balanced_train_set[:5]
test_sample_balanced = balanced_test_set[:5]

(num_train_rhyme, num_train_non_rhyme), (num_test_rhyme, num_test_non_rhyme), train_sample_balanced, test_sample_balanced



((4000, 4000),
 (1000, 1000),
 [('helmke', 'gulped', False, '/ˈhɛɫmk/', '/ˈɡəɫpt/'),
  ('haug', 'reust', False, '/ˈhɔɡ/', '/ˈɹust/'),
  ('mum', 'swum', True, '/ˈməm/', '/ˈswəm/'),
  ('minced', 'winced', True, '/ˈmɪnst/', '/ˈwɪnst/'),
  ('thoughts', "trust's", False, '/ˈθɔts/', '/ˈtɹəsts/')],
 [('notched', 'buerge', False, '/ˈnɑtʃt/', '/ˈbjuɹdʒ/'),
  ('luke', 'glueck', True, '/ˈɫuk/', '/ˈɡɫuk/'),
  ('walck', 'paulk', True, '/ˈwɔɫk/', '/ˈpɔɫk/'),
  ('obst', 'cusp', False, '/ˈɑbst/', '/ˈkəsp/'),
  ('broom', 'fume', True, '/ˈbɹum/', '/ˈfjum/')])

Prepares data for use within RNN as torch tensors

In [None]:
def create_char_encoding(data):
    chars = set(ch for pair in data for ipa in pair[3:5] for ch in ipa)
    char_to_int = {ch: i + 1 for i, ch in enumerate(sorted(chars))}
    return char_to_int

def words_to_sequences(ipas, char_to_int, max_length):
    sequences = []
    for ipa in ipas:
        encoded = [char_to_int.get(ch, 0) for ch in ipa]
        encoded += [0] * (max_length - len(encoded))
        sequences.append(encoded)
    return sequences

char_to_int = create_char_encoding(rhyming_pairs_full_ipa + non_rhyming_pairs_full_ipa)

max_ipa_length = max(len(ipa) for pair in rhyming_pairs_full_ipa + non_rhyming_pairs_full_ipa for ipa in pair[3:5])

train_sequences = [(words_to_sequences(pair[3:5], char_to_int, max_ipa_length), pair[2]) for pair in balanced_train_set]
test_sequences = [(words_to_sequences(pair[3:5], char_to_int, max_ipa_length), pair[2]) for pair in balanced_test_set]

train_data = [(torch.tensor(words, dtype=torch.long), torch.tensor(label, dtype=torch.float)) for words, label in train_sequences]
test_data = [(torch.tensor(words, dtype=torch.long), torch.tensor(label, dtype=torch.float)) for words, label in test_sequences]

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)


Model definition

In [None]:
class Rhyme_Predictor(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers):
        super(Rhyme_Predictor, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        x1, x2 = x[:, 0, :], x[:, 1, :]
        embedded1 = self.embedding(x1)
        embedded2 = self.embedding(x2)
        _, (hidden1, _) = self.rnn(embedded1)
        _, (hidden2, _) = self.rnn(embedded2)
        hidden = torch.cat((hidden1[-1], hidden2[-1]), dim=1)
        out = self.fc(hidden)
        return torch.sigmoid(out)

vocab_size = len(char_to_int) + 1
embedding_dim = 100
hidden_dim = 128
output_dim = 1
num_layers = 1

model = Rhyme_Predictor(vocab_size, embedding_dim, hidden_dim, output_dim, num_layers)


Training/Testing Loop

In [None]:
lr = .001 #most consistent lr
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

def train(model, train_loader, test_loader, criterion, optimizer, num_epochs=10): #10 epochs default
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for words, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(words)
            loss = criterion(outputs.squeeze(), labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

            predicted = (outputs.squeeze() > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        train_accuracy = correct / total
        train_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.4f}")

word_to_ipa = {word: ipa for word, ipa in zip(word, ipa)} #for printing results
ipa_to_word = {ipa: word for ipa, word in zip(ipa, word)} #for printing results


def test(model, test_loader, num_examples=10):
    model.eval()
    correct = 0
    total = 0
    correct_examples = []
    incorrect_examples = []

    with torch.no_grad():
        for words, labels in test_loader:
            outputs = model(words)
            predicted = (outputs.squeeze() > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            #get examples to print
            for i in range(words.size(0)):
                phonetic1 = ''.join([list(char_to_int.keys())[list(char_to_int.values()).index(x)] for x in words[i, 0] if x != 0])
                phonetic2 = ''.join([list(char_to_int.keys())[list(char_to_int.values()).index(x)] for x in words[i, 1] if x != 0])
                word1 = ipa_to_word.get(phonetic1, "N/A")
                word2 = ipa_to_word.get(phonetic2, "N/A")
                example = (phonetic1, word1, phonetic2, word2, labels[i], predicted[i])

                if len(correct_examples) < num_examples and predicted[i] == labels[i]:
                    correct_examples.append(example)
                elif len(incorrect_examples) < num_examples and predicted[i] != labels[i]:
                    incorrect_examples.append(example)

    accuracy = correct / total

    #print examples
    print("\nCorrectly Predicted Examples:")
    for phonetic1, word1, phonetic2, word2, label, pred in correct_examples:
        print(f"Words: {word1} ({phonetic1}), {word2} ({phonetic2}) - Label: {label.item()}, Prediction: {pred.item()}")

    print("\nIncorrectly Predicted Examples:")
    for phonetic1, word1, phonetic2, word2, label, pred in incorrect_examples:
        print(f"Words: {word1} ({phonetic1}), {word2} ({phonetic2}) - Label: {label.item()}, Prediction: {pred.item()}")

    return accuracy

In [None]:
train(model, train_loader, test_loader, criterion, optimizer, num_epochs=25)

Epoch 1/25, Loss: 0.6922, Train Accuracy: 0.5186
Epoch 2/25, Loss: 0.6856, Train Accuracy: 0.5509
Epoch 3/25, Loss: 0.6650, Train Accuracy: 0.6109
Epoch 4/25, Loss: 0.6247, Train Accuracy: 0.6500
Epoch 5/25, Loss: 0.5910, Train Accuracy: 0.6833
Epoch 6/25, Loss: 0.5560, Train Accuracy: 0.7081
Epoch 7/25, Loss: 0.5296, Train Accuracy: 0.7285
Epoch 8/25, Loss: 0.5029, Train Accuracy: 0.7401
Epoch 9/25, Loss: 0.4802, Train Accuracy: 0.7654
Epoch 10/25, Loss: 0.4555, Train Accuracy: 0.7815
Epoch 11/25, Loss: 0.4356, Train Accuracy: 0.7997
Epoch 12/25, Loss: 0.4125, Train Accuracy: 0.8146
Epoch 13/25, Loss: 0.3886, Train Accuracy: 0.8271
Epoch 14/25, Loss: 0.3627, Train Accuracy: 0.8434
Epoch 15/25, Loss: 0.3412, Train Accuracy: 0.8589
Epoch 16/25, Loss: 0.3184, Train Accuracy: 0.8661
Epoch 17/25, Loss: 0.2978, Train Accuracy: 0.8782
Epoch 18/25, Loss: 0.2736, Train Accuracy: 0.8876
Epoch 19/25, Loss: 0.2527, Train Accuracy: 0.9022
Epoch 20/25, Loss: 0.2455, Train Accuracy: 0.9032
Epoch 21/

In [None]:
test_accuracy = test(model, test_loader, num_examples=10)
print(f"\nTest Accuracy: {test_accuracy}")



Correctly Predicted Examples:
Words: notched (/ˈnɑtʃt/), buerge (/ˈbjuɹdʒ/) - Label: 0.0, Prediction: 0.0
Words: luque (/ˈɫuk/), glueck (/ˈɡɫuk/) - Label: 1.0, Prediction: 1.0
Words: walck (/ˈwɔɫk/), paulk (/ˈpɔɫk/) - Label: 1.0, Prediction: 1.0
Words: obst (/ˈɑbst/), cusp (/ˈkəsp/) - Label: 0.0, Prediction: 0.0
Words: proved (/ˈpɹuvd/), moved (/ˈmuvd/) - Label: 1.0, Prediction: 1.0
Words: bengt (/ˈbɛŋkt/), tramps (/ˈtɹæmpz/) - Label: 0.0, Prediction: 0.0
Words: lex (/ˈɫɛks/), utz (/ˈəts/) - Label: 0.0, Prediction: 0.0
Words: yorks (/ˈjɔɹks/), corks (/ˈkɔɹks/) - Label: 1.0, Prediction: 1.0
Words: swooned (/ˈswund/), pruned (/ˈpɹund/) - Label: 1.0, Prediction: 1.0
Words: eastes (/ˈists/), priests (/ˈpɹists/) - Label: 1.0, Prediction: 1.0

Incorrectly Predicted Examples:
Words: broome (/ˈbɹum/), fume (/ˈfjum/) - Label: 1.0, Prediction: 0.0
Words: jure (/ˈdʒʊɹ/), twas (/ˈtwəz/) - Label: 0.0, Prediction: 1.0
Words: wreak (/ˈɹik/), rang (/ˈɹæŋ/) - Label: 0.0, Prediction: 1.0
Words: kuhns (

Description of the task:
The task is to create a RNN which predicts whether two words rhyme or not. The dataset was from https://github.com/open-dict-data/ipa-dict/blob/master/data/en_US.txt. It is a dataset containing english words and their IPA transcriptions, along with stress signfiers. This dataset was cut down into just single-syllable words to focus my training. That dataset of single-syllable words and their phonetic transcriptions was then grouped into sets of phoneme endings, where each phoneme ending was calculated from the last vowel sequence (or in this case, simply the last vowel sound) onward to the end of the word. Each set contained every word that ended with the same vowel sequence (as every word within that set would necessarily rhyme with one another). This made the preparation of data for the model trivial; creating equal numbers of random pairs of words that rhymed and did not rhyme by simply grabbing word within the same group (rhyme) or grabbing words in different group (non-rhymes). This data was converted into tensors, which was used as the input for the model. It trained on a random subset of 80% of the data, and tested on a random subset of 20% of the data. The functions were set up to be as tunable as possible, so you could set different training sizes, different hyperparameters, etc.

Discussion of results:
The results I've achieved were a bit underwhelming. For such a trivial task as rhyme recognition (which is a relatively simple rule-based pattern), I found that I couldn't get the model to perform much better than 70% accuracy on testing, no matter how I tuned the hyperparameters, even with the training accuracy achieving results of around 99% at the last epochs. A learning rate of .001 appeared to give the most consistent results. Increasing the sample size did indeed increase accuracy, albeit marginally. What perplexes me the most is that sometimes the model will predict two words as being rhymes even when the two words don't share a single common phoneme. Overall, though, an average of around 70% accuracy shows that the model was able to make some connections and do better than simply random guessing.
EDIT: I realized right around the time I was going to submit that I was inputting the english words to the model instead of the IPA words. I fixed this issue and to my surprise, the accuracy did not increase by much--I achieved a max accuracy of around 76% even using the phonetic transcriptions. So, I think it is quite interesting that the model performs almost as well using the spellings as an input compared to using the pronunciations as an input.

Challenges Faced:
Overall, the biggest challenge I faced with this project was creating the dataset. The preprocessing for this project was not a simple task. Not only did I need to modify the training dataset I used, but then also had to modify the new dataset I created from the first one, group the words, and then find a way to create the actual trainig/testing dataset, which was a significant time consumption. Overall, the preprocessing took around 2-3x the time that actually creating/training/testing/tuning the model took. Not only was it simply time consuming to convert the original dataset into something more appropriate for this mode, but it was also strenuous in terms of design choices and determining the best way to create rhyme and non-rhyme pairs. I struggled with trying to find the best way to ensure my training and testing data included varied phonemes and phoneme groups, so I settled on simply using randomness to create my dataset and assumed that increasing the size of the dataset would produce varied results. However, this method provides no guarantee of all phonemes showing up, nor does it guarantee that any one specific phoneme doesn't show up in larger or smaller proportions than it actual appears in the dataset.

Discussion of Resources Used:
The first resource used was a dataset of english words and their phonetic transcriptions, including stress marks to denote which syllable in each word was stressed: https://github.com/open-dict-data/ipa-dict/blob/master/data/en_US.txt.
I then modified this dataset and created a new one which is only the single-syllable words from that dataset. That would be the dataset I used to collect groupings of rhymes and non-rhymes, which was the actual data converted into tensors for the RNN. I used pyTorch for the RNN as that is the main ML library I am used to at this point.

As a further step:
I think the first next step I would take in this project is changing the way preprocessing is done. While I think collecting random groupings was the easiest method and achieved fine results, I think that using something less-than-random would overall produce better results. It's possible that none of the most common phoneme endings were included in the training dataset and it's also possible that all of the least used phoneme endings were included in the training dataset, and then the testing dataset could have been only the most common phonetic endings, and vice-versa. So, I think the best next step would be to rethink how to preprocess this data using a more mathematical/probabalistic approach. Apart from that, I would certainly love to see how well this model performs with multi-syllabic words, so that is another obvious next step that could be taken.