Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4
----

Names & Sections
----
Names: __Tijana Cosic (4120) & Max Breslauer-Friedman (4120)__

Task 4: Compare your generated sentences (15 points)
----

In this task, you'll analyze one of the files that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of the following two options.

Option 1: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things:
1. the percentage of words produced by each model that are valid english words.
2. the percentage of words produced by each model that are valid english words *and* were not seen at train time.

For this task, a word is defined as "characters between _ " or "characters between spaces" (if you replaced your underscores with spaces when you printed out your new sentences).


Make sure to turn in any necessary supporting files along with your submission.


In [1]:
# your imports here
import nltk
nltk.download('words')
import csv
import keras
from keras.preprocessing.text import Tokenizer
import neurallm_utils as nutils
import random

[nltk_data] Downloading package words to /Users/maxnbf/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /Users/maxnbf/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# code here!

# abstract into util functions
NGRAM = 3 # The ngram language model you want to train
EMBEDDING_SAVE_FILE_WORD = "spooky_embedding_word.txt" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = "spooky_embedding_char.txt" # The file to save your word embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on

# reads in the data tokenized by char and by word
data_by_char = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
data_by_word = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)

# tokenizes by words
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(data_by_word)

# loads the english words corpus
english_words = set(nltk.corpus.words.words())

# converts the words that were seen during training into a set
trained_words = set(word_tokenizer.word_index.keys())

def is_valid(word):
    '''
    Checks if a word is a valid English word.
    Parameters:
    word (str): The word to be checked.
    Returns:
    bool: True if the word is a valid English word, False otherwise.
    '''
    
    return word.lower() in english_words

def is_valid_and_unseen(word):
    '''
    Checks if a word is both a valid English word and unseen during training.
    Parameters:
    word (str): The word to be checked.
    Returns:
    bool: True if the word is both a valid English word and unseen during training, False otherwise.
    '''
    
    return word.lower() in english_words and word.lower() not in trained_words

def read_sentences(file_path):
    '''
    Reads the sentences from the specified file path.
    Parameters:
    file_path (str): The path of the file.
    Returns:
    list: A list of sentences in the file.
    '''
    
    with open(file_path, 'r') as file:
        return file.read().split('\n')
    
def process_sentences(sentences, model_type, trained_words, unseen_words=False):
    '''
    Processes the sentences and extracts the words.
    Parameters:
    sentences (list): The sentences to be processed.
    model_type (str): The model that the sentences are from.
    trained_words (set): The words that were seen during training.
    Returns:
    lists of tuples: list of valid word tuples, list of invalid word tuples. 
    '''
    
    # initializes empty lists for the valid and invalid words
    valid_words = []
    invalid_words = []
    
    valid_unseen_count = 0
    
    # for every sentence, splits it into words and checks if the words are valid or not
    for sentence in sentences:
        words = sentence.split()
        
        for word in words:
            if is_valid(word):
                valid_words.append((model_type, word))
                # if unseen_words == True, counts the number of words that are valid words
                # and also unseen at train time
                if unseen_words:
                    if is_valid_and_unseen(word):
                        valid_unseen_count += 1
            else:
                invalid_words.append((model_type, word))
    
    return valid_words, invalid_words, valid_unseen_count

# loads the sentences from the text files
neural_sentences = read_sentences('char_model_sentences.txt')
vanilla_sentences = read_sentences('spooky_vanilla_3_char.txt')

# processes the sentences from both models
neural_valid, neural_invalid, valid_unseen_count_neural = process_sentences(neural_sentences, 'neural', trained_words,
                                                                            unseen_words=True)
vanilla_valid, vanilla_invalid, valid_unseen_count_vanilla = process_sentences(vanilla_sentences, 'vanilla', trained_words,
                                                                               unseen_words=True)

# writes the valid words to a CSV file
with open('valid_words_lms.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['model', 'sequence'])
    writer.writerows(neural_valid + vanilla_valid)

# writes the invalid words to a CSV file
with open('invalid_words_lms.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['model', 'sequence'])
    writer.writerows(neural_invalid + vanilla_invalid)

# initializes the total word count for the neural words
total_words_neural = 0

# loops through each sentence and counts the words
for sentence in neural_sentences:
    words = sentence.split()
    total_words_neural += len(words)

# initializes the total word count for the vanilla words
total_words_vanilla = 0

# loops through each sentence and counts the words
for sentence in vanilla_sentences:
    words = sentence.split()
    total_words_vanilla += len(words)

# calculates and prints the percentage of valid words for both models
neural_valid_percentage = round((len(neural_valid) / total_words_neural * 100), 2)
vanilla_valid_percentage = round((len(vanilla_valid) / total_words_vanilla * 100), 2)

print('Neural Valid Words Percentage:', neural_valid_percentage, '%')
print('Vanilla Valid Words Percentage:', vanilla_valid_percentage, '%')

# calculates and prints the percentage of valid AND unseen words for both models
neural_valid_unseen_percentage = round((valid_unseen_count_neural / total_words_neural * 100), 2)
vanilla_valid_unseen_percentage = round((valid_unseen_count_vanilla / total_words_vanilla * 100), 2)

print('Neural Valid & Unseen Words Percentage:', neural_valid_unseen_percentage, '%')
print('Vanilla Valid & Unseen Words Percentage:', vanilla_valid_unseen_percentage, '%')

Neural Valid Words Percentage: 44.53 %
Vanilla Valid Words Percentage: 43.12 %
Neural Valid & Unseen Words Percentage: 5.2 %
Vanilla Valid & Unseen Words Percentage: 5.24 %


3. How did you determine what a valid english word is? __A valid English word is any word that is part of NLTK's word corpus. If it is, then it was considered a valid word; if not, then it was considered an invalid word.__

4. Gather the sequences of characters that are determined not to be words. Sampling at minimum 100 of these sequences, how many of them *should have* been counted as words in your opinion? __5 sequences.__

In [3]:
# more code here, as needed!

# initializes an empty list to store the invalid words
invalid_words = []

# opens the invalid words CSV file and extracts the invalid words
with open('invalid_words_lms.csv', newline='') as csvfile:
    csv_reader = csv.reader(csvfile)
    for row in csv_reader:
        invalid_words.append(row[1])

# randomly samples 100 invalid words
sample_invalid = random.sample(invalid_words, k=100)

print(sample_invalid)

['aneandent', 'hich', ',', 'thich', 'triagot', 'brumpled', 'throake', 'smaysidge.', 'lience', 'metionst', 'makented', 'wers.', 'ned', 'mented', 'hoaterd', ',', 'siturs', 'disless', 'usidiumad', 'itue', 'ankgualog', 'nows', 'sity', 'bled', 'ack', 'dournow', 'themakint.', 'dusubjectishat', 'whow', 'wilen', ',', ',', 'hicher', 'ang', 'earsupone,', 'thads', 'tadenespird', 'arber,', 'ond', 'oure', 'whicusearsespectillut', 'faing', 'wasoung;', 'harbe', 'arfaverressend', 'therld', ',', 'woodvively', 'foll', ',', 'gand', 'nottander', 'afenal', 'oung', 'ment', 'antmor', 'disold', 'lor', 'aticke', 'im', 'mom', 'coment', '.', 'stessacted', 'furneasepoes', 'curan,', 'postaked', 'beforedge', 'ang', 'somake', 'ficapproresten', 'theal', 'ehind', 'hemoreauthe', 'erp', 'whis', 'weacelthat', 'hameakingle', 'aguarges', 'hadear', 'terety', 'uposs', 'lairaw;', "'s", 'fifte', 'suirent', 'relwas', ',', 'ins', 'dently,', 'wic', 'youlleganclocrucint.', 'usleat', 'tharn', 'dify', ',', 'invisompad', 'thous', 'fl

Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __two__ columns: `model`, `sequence`. `model` will have the value `neural` or `vanilla`. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.