# 1. Background Problem (20%)
Language modeling is a fundamental task in Natural Language Processing (NLP), used in various applications like predictive typing, text generation, and spelling correction. For this project, I chose the Sci-Fi Stories Text Corpus available on Kaggle. Sci-Fi literature is linguistically rich and imaginative, often pushing boundaries of vocabulary and structure. Modeling such text is both challenging and rewarding, and it provides an exciting opportunity to explore how well statistical language models and autocorrect systems can handle complex and creative writing. Recent research has demonstrated that large language models can generalize to a wide variety of tasks, including creative text generation and spelling correction, even in few-shot settings. Studies have also shown that while language models are capable of producing creative writing, they face unique challenges in maintaining coherence and handling the imaginative language found in genres like science fiction. Furthermore, advances in spelling correction techniques have highlighted the importance of robust language modeling for correcting errors in creative and domain-specific texts

References:
* Brown, T. B., et al. (2020). Language Models are Few-Shot Learners.
* Clark, E., et al. (2021). The Effectiveness of Language Models in Generating Creative Writing.
* Zhang, Z., et al. (2022). A Survey on Spelling Correction.

# 2. Resource

We used the following dataset found from kaggle:

Sci-Fi Stories Text Corpus by Jannes Klaas: 
- https://www.kaggle.com/datasets/jannesklaas/scifi-stories-text-corpus

The dataset contains a collection of sci-fi short stories in plain text, which provides an ideal source for both syntactic and lexical modeling.

# 3. Methods (10%)
## We applied the following methods:

- Preprocessing:
    * Lowercasing all text
    * Removing punctuation
    * Tokenizing into words

- Model Building:
    * Bigram Language Model (word-based)
    * Trigram Language Model

- Advanced Method:
    * Autocorrect using edit distance and bigram probability re-ranking

## 4. Model Implementation Code (50%)

## 1. importing libraries and files needed

In [6]:
# Libraries
import re
from collections import Counter, defaultdict
import numpy as np
import pandas as pd
import string

In [7]:
wsj_train_file = "WSJ_02-21.pos"
hmm_vocab_file = "hmm_vocab.txt"

## 2.Data preprocessing

This is to read, clean and tokenize the corpus

In [9]:
def process_data(file_name):
    # Reads in a corpus (text file), changes everything to lowercase, and returns a list of words.
    # Args:
    #     file_name (str): Name of the corpus file.
    # Returns:
    #     list: A list of words from the corpus.
    words = []
    with open(file_name, 'r', encoding="utf8") as f:
        for line in f:
            line = line.strip()
            line = line.lower()
            w = re.findall(r'\w+', line)
            words += w
    return words

## 3.making the N-gram model

this is to create and count the n-grams, and estimate the  probabilities

In [11]:
def get_counts(word_list):
    # Returns a dictionary mapping each word to its frequency in the word list.
    # Args:
    #     word_list (list): A list of words.
    # Returns:
    #     dict: A dictionary where keys are words and values are their counts.
    word_counts = {}
    for word in word_list:
        word_counts[word] = word_counts.get(word, 0) + 1
    return word_counts

In [12]:
def get_probs(word_counts):
    # Returns a dictionary mapping each word to its probability in the corpus.
    # Args:
    #     word_counts (dict): A dictionary where keys are words and values are their counts.
    # Returns:
    #     dict: A dictionary where keys are words and values are their probabilities.
    total_words = sum(word_counts.values())
    word_probs = {word: count / total_words for word, count in word_counts.items()}
    return word_probs

In [13]:
def split_to_sentences(text):
    # Splits a text into sentences.
    # Args:
    #     text (str): The input text.
    # Returns:
    #     list: A list of sentences.
    sentences = re.split(r'[.?!]+', text)
    sentences = [s.strip() for s in sentences if s]
    return sentences

In [14]:
def tokenize_sentences(sentences):
    # Tokenizes sentences into words.
    # Args:
    #     sentences (list): A list of sentences.
    # Returns:
    #     list: A list of lists of words.
    tokenized_sentences = []
    for sentence in sentences:
        sentence = sentence.lower()
        tokens = re.findall(r'\w+', sentence)
        tokenized_sentences.append(tokens)
    return tokenized_sentences

In [15]:
def get_vocabulary(tokenized_sentences, threshold=2):
    # Creates a vocabulary from tokenized sentences, filtering words below a frequency threshold.
    # Args:
    #     tokenized_sentences (list): A list of lists of words.
    #     threshold (int): Minimum frequency for a word to be included in the vocabulary.
    # Returns:
    #     list: A list of unique words in the vocabulary.
    word_counts = {}
    for sentence in tokenized_sentences:
        for token in sentence:
            word_counts[token] = word_counts.get(token, 0) + 1

    vocabulary = [word for word, count in word_counts.items() if count >= threshold]
    return vocabulary

In [16]:
def replace_oov(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    # Replaces out-of-vocabulary words in tokenized sentences with a specified token.
    # Args:
    #     tokenized_sentences (list): A list of lists of words.
    #     vocabulary (list): A list of valid words.
    #     unknown_token (str): The token to replace out-of-vocabulary words with.
    # Returns:
    #     list: A list of lists of words with OOV words replaced.
    replaced_sentences = []
    vocabulary = set(vocabulary)
    for sentence in tokenized_sentences:
        replaced_sentence = [token if token in vocabulary else unknown_token for token in sentence]
        replaced_sentences.append(replaced_sentence)
    return replaced_sentences

In [17]:
def create_n_grams(tokenized_sentences, n):
    # Creates n-grams from tokenized sentences.
    # Args:
    #     tokenized_sentences (list): A list of lists of words.
    #     n (int): The order of the n-grams (e.g., 2 for bigrams, 3 for trigrams).
    # Returns:
    #     list: A list of n-grams represented as tuples.
    n_grams = []
    for sentence in tokenized_sentences:
        sentence = ["<s>"] + sentence + ["<e>"]
        for i in range(len(sentence) - n + 1):
            n_grams.append(tuple(sentence[i:i+n]))
    return n_grams

In [18]:
def get_n_gram_counts(n_grams):
    # Counts the occurrences of each n-gram.
    # Args:
    #     n_grams (list): A list of n-grams.
    # Returns:
    #     dict: A dictionary mapping each n-gram to its frequency.
    n_gram_counts = {}
    for n_gram in n_grams:
        n_gram_counts[n_gram] = n_gram_counts.get(n_gram, 0) + 1
    return n_gram_counts

In [19]:
def estimate_probability(word, previous_n_gram, n_gram_counts, n_minus_1_gram_counts, vocabulary_size, k=1.0):
    # Estimates the probability of a word given a previous n-gram using k-smoothing.
    # Args:
    #     word (str): The word to estimate the probability for.
    #     previous_n_gram (tuple): The previous n-gram (n-1 words).
    #     n_gram_counts (dict): A dictionary of n-gram counts.
    #     n_minus_1_gram_counts (dict): A dictionary of (n-1)-gram counts.
    #     vocabulary_size (int): The size of the vocabulary.
    #     k (float): The smoothing parameter.
    # Returns:
    #     float: The estimated probability of the word given the previous n-gram.
    previous_n_gram = tuple(previous_n_gram)
    n_gram = previous_n_gram + (word,)
    n_gram_count = n_gram_counts.get(n_gram, 0)
    n_minus_1_gram_count = n_minus_1_gram_counts.get(previous_n_gram, 0)
    probability = (n_gram_count + k) / (n_minus_1_gram_count + k * vocabulary_size)
    return probability

In [20]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_minus_1_gram_counts, vocabulary, k=1.0):
    # Estimates probabilities for all words in the vocabulary given a previous n-gram.
    # Args:
    #     previous_n_gram (tuple): The previous n-gram (n-1 words).
    #     n_gram_counts (dict): A dictionary of n-gram counts.
    #     n_minus_1_gram_counts (dict): A dictionary of (n-1)-gram counts.
    #     vocabulary (list): The list of words in the vocabulary.
    #     k (float): The smoothing parameter.
    # Returns:
    #     dict: A dictionary of probabilities for each word in the vocabulary.
    probabilities = {}
    for word in vocabulary:
        probabilities[word] = estimate_probability(word, previous_n_gram,
                                                   n_gram_counts, n_minus_1_gram_counts,
                                                   len(vocabulary), k=k)
    return probabilities

In [21]:
def get_suggestions(previous_tokens, n_gram_counts, n_minus_1_gram_counts, vocabulary, k=1.0, start_with=None):
    # Gets suggestions for the next word given a sequence of previous tokens.
    # Args:
    #     previous_tokens (list): A list of previous tokens.
    #     n_gram_counts (dict): A dictionary of n-gram counts.
    #     n_minus_1_gram_counts (dict): A dictionary of (n-1)-gram counts.
    #     vocabulary (list): The list of words in the vocabulary.
    #     k (float): The smoothing parameter.
    #     start_with (str): If specified, only suggest words that start with this string.
    # Returns:
    #     list: A list of suggested words sorted by probability.
    n = len(list(n_gram_counts.keys())[0])
    previous_n_gram = previous_tokens[-n+1:]
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_minus_1_gram_counts,
                                           vocabulary, k=k)
    suggestions = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)

    # Filter out unknown word tokens from suggestions
    suggestions = [s for s in suggestions if not s[0].startswith('--unk')]

    if start_with:
        suggestions = [s for s in suggestions if s[0].startswith(start_with)]

    return suggestions

## 5.adding pos tagging fucntion

we will be adding pos tagging function in here and also for handling unknown words

In [23]:
import string
def assign_unk(tok):
    # Assign unknown word tokens
    punct = set(string.punctuation)
    noun_suffix = ["action", "age", "ance", "cy", "dom", "ee", "ence", "er", "hood", "ion", "ism", "ist", "ity", "ling", "ment", "ness", "or", "ry", "scape", "ship", "ty"]
    verb_suffix = ["ate", "ify", "ise", "ize"]
    adj_suffix = ["able", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "ly", "ous"]
    adv_suffix = ["ward", "wards", "wise"]

    if any(char.isdigit() for char in tok):
        return "--unk_digit--"
    elif any(char in punct for char in tok):
        return "--unk_punct--"
    elif any(char.isupper() for char in tok):
        return "--unk_upper--"
    elif any(tok.endswith(suffix) for suffix in noun_suffix):
        return "--unk_noun--"
    elif any(tok.endswith(suffix) for suffix in verb_suffix):
        return "--unk_verb--"
    elif any(tok.endswith(suffix) for suffix in adj_suffix):
        return "--unk_adj--"
    elif any(tok.endswith(suffix) for suffix in adv_suffix):
        return "--unk_adv--"
    return "--unk--"

In [24]:
def get_word_tag(line, vocab):
    # Get the word and tag from a line of the training corpus
    if not line.split():
        word = "--n--"
        tag = "--s--"
        return word, tag
    else:
        word, tag = line.split('\t')
        word = word.strip()
        tag = tag.strip()
        if word not in vocab:
            word = assign_unk(word)
        return word, tag



In [25]:
def preprocess(vocab, data_fp):
    """
    Preprocess data
    """
    orig = []
    prep = []

    # Read data
    with open(data_fp, "r") as data_file:

        for cnt, line in enumerate(data_file):

            # Get the word tag pair
            try:
              word, tag = get_word_tag(line, vocab)
            except:
              continue #Skip anything that does not have a line

            #Append the original word
            orig.append(word)

            #Check if the word is in vocab:
            if word not in vocab:
              word = assign_unk(word)

            # Append preprocessed words
            prep.append(word)


    return orig, prep

In [26]:
def create_dictionaries(training_corpus):
    """
    Create word and tag dictionaries.
    """
    word_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    emission_counts = defaultdict(int)
    
    prev_tag = "--s--"
    
    i = 0
    for word_tag in training_corpus:
        i += 1
        if i % 50000 == 0:
            print(f"read {i} words")
            
        word, tag = get_word_tag(word_tag, vocab)
        word_counts[word] += 1
        tag_counts[tag] += 1
        transition_counts[(prev_tag, tag)] += 1
        emission_counts[(tag, word)] += 1
        prev_tag = tag
    
    return word_counts, tag_counts, transition_counts, emission_counts

In [27]:
def create_pos_model(training_corpus, vocab):
    """
    Creates dictionaries for HMM.
    """
    word_counts, tag_counts, transition_counts, emission_counts = create_dictionaries(training_corpus)
    
    # Calculate state transition probabilities
    tags = sorted(tag_counts.keys())
    num_tags = len(tags)
    A = np.zeros((num_tags, num_tags))
    
    for i in range(num_tags):
        for j in range(num_tags):
            A[i, j] = (transition_counts[(tags[i], tags[j])] + 1) / (tag_counts[tags[i]] + num_tags)
    
    # Calculate emission probabilities
    B = defaultdict(lambda: defaultdict(float))
    all_words = set(word_counts.keys())
    
    for tag in tags:
        for word in all_words:
            B[tag][word] = (emission_counts[(tag, word)] + 1) / (tag_counts[tag] + len(all_words))
    
    return A, B, tags

## 6.Autocomplete model

In [29]:
# 5. Autocomplete Function
# Function for generating autocomplete suggestions
def autocomplete(input_str, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags, k=1.0, num_suggestions=5):
    """
    Autocompletes an input string with the most likely next words based on the provided POS tagger and N-gram model.
    Args:
        input_str (str): The input string to autocomplete.
        n_gram_counts (dict): A dictionary of n-gram counts.
        n_minus_1_gram_counts (dict): A dictionary of (n-1)-gram counts.
        vocabulary (list): The list of words in the vocabulary.
        A (np.ndarray): Transition matrix from the POS tagger.
        B (defaultdict): Emission probabilities from the POS tagger.
        tags (list): List of POS tags.
        k (float): The smoothing parameter.
        num_suggestions (int): The number of suggestions to return.
    Returns:
        list: A list of autocompleted suggestions.
    """
    tokens = re.findall(r'\w+', input_str.lower())  # Tokenize the input
    
    # If there are no tokens, return an empty list
    if not tokens:
        return []
    
    # Get the POS predictions for the tokens from the training data
    predicted_words = predict_next_word(input_str, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)  # Predict next words
    
    return predicted_words  # Return the predicted words

## 7 combines POS tagging and N-gram probabilities

this is used to predict the next word that will come up

In [31]:
def initialize(states, corpus, vocab):
    """
    Initializes HMM parameters.
    """
    A = np.zeros((len(states), len(states)))
    B = defaultdict(lambda: defaultdict(float))
    
    tag_counts = defaultdict(int)
    word_counts = defaultdict(int)
    
    prev_tag = "--s--"
    
    for word_tag in corpus:
        word, tag = get_word_tag(word_tag, vocab)
        
        tag_counts[tag] += 1
        word_counts[word] += 1
        
        transition_counts = defaultdict(int)
        emission_counts = defaultdict(int)

        emission_counts[(tag, word)] += 1
        transition_counts[(prev_tag, tag)] += 1
        
        prev_tag = tag
    
    return A, B, tag_counts, word_counts, transition_counts, emission_counts

In [32]:
def create_transition_matrix(A, transition_counts, tag_counts, states):
    """
    Creates a transition matrix from transition counts and tag counts.
    """
    num_states = len(states)
    
    for i in range(num_states):
        for j in range(num_states):
            A[i, j] = (transition_counts[(states[i], states[j])] + 1) / (tag_counts[states[i]] + num_states)
            
    return A


In [33]:
def create_emission_matrix(B, emission_counts, tag_counts, vocab):
    """
    Creates an emission matrix from emission counts, tag counts, and the vocabulary.
    """
    all_words = set(vocab.keys())
    
    for tag in tag_counts:
        for word in all_words:
            B[tag][word] = (emission_counts[(tag, word)] + 1) / (tag_counts[tag] + len(vocab))
            
    return B

In [34]:
def viterbi(words, vocab, A, B, tags):
    """
    Implements the Viterbi algorithm for POS tagging.
    """
    num_tags = len(tags)
    num_words = len(words)
    
    # Initialize matrices
    best_probs = np.zeros((num_tags, num_words))
    best_paths = np.zeros((num_tags, num_words), dtype=int)
    
    # Initialize first word
    first_word = words[0]
    for i in range(num_tags):
        if first_word in vocab:
            best_probs[i, 0] = B[tags[i]][first_word]
        else:
            best_probs[i, 0] = B[tags[i]][assign_unk(first_word)]
    
    # Iterate over remaining words
    for j in range(1, num_words):
        for i in range(num_tags):
            best_prob = float('-inf')
            best_path = None
            
            for k in range(num_tags):
                prob = best_probs[k, j-1] * A[k, i]
                if words[j] in vocab:
                    prob *= B[tags[i]][words[j]]
                else:
                    prob *= B[tags[i]][assign_unk(words[j])]
                    
                if prob > best_prob:
                    best_prob = prob
                    best_path = k
            
            best_probs[i, j] = best_prob
            best_paths[i, j] = best_path
    
    # Get final tag sequence
    tag_sequence = [None] * num_words
    
    z = np.argmax(best_probs[:, -1])
    tag_sequence[-1] = tags[z]
    
    for i in range(num_words-2, -1, -1):
        z = int(best_paths[z, i+1])
        tag_sequence[i] = tags[z]
        
    return tag_sequence

In [35]:
def predict_next_word(input_str, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags, k=1.0, num_suggestions=5):
    tokens = re.findall(r'\w+', input_str.lower())

    if tokens:
        best_tag_sequence = viterbi(tokens, vocab, A, B, tags)
        previous_tag = best_tag_sequence[-1]

        suggestions = []
        for word in B[previous_tag]:
            if not word.startswith('--unk'): 
                suggestions.append((word, B[previous_tag][word]))

        suggestions = sorted(suggestions, key=lambda x: x[1], reverse=True)[:num_suggestions]
        return [s[0] for s in suggestions]

    return []

## 8.Load and process the data

In [37]:
with open(hmm_vocab_file, 'r') as f:
    voc_l = f.read().split('\n')
vocab = {}
for i, word in enumerate(sorted(voc_l)):
    vocab[word] = i

with open(wsj_train_file, 'r') as f:
    training_corpus = f.readlines()

A, B, tags = create_pos_model(training_corpus, vocab)

file_name = 'corpus.txt'
words = process_data(file_name)
word_counts = get_counts(words)
word_probs = get_probs(word_counts)

text = open(file_name, 'r', encoding="utf8").read()
sentences = split_to_sentences(text)
tokenized_sentences = tokenize_sentences(sentences)

vocabulary = get_vocabulary(tokenized_sentences, threshold=2)
tokenized_sentences = replace_oov(tokenized_sentences, vocabulary)


n = 2  
n_grams = create_n_grams(tokenized_sentences, n)
n_gram_counts = get_n_gram_counts(n_grams)
n_minus_1_grams = create_n_grams(tokenized_sentences, n-1)
n_minus_1_gram_counts = get_n_gram_counts(n_minus_1_grams)


read 50000 words
read 100000 words
read 150000 words
read 200000 words
read 250000 words
read 300000 words
read 350000 words
read 400000 words
read 450000 words
read 500000 words
read 550000 words
read 600000 words
read 650000 words
read 700000 words
read 750000 words
read 800000 words
read 850000 words
read 900000 words
read 950000 words


# 5. Evaluation of Model
## 5a. Performance Metrics (10%)

### Next-Word Prediction

Top‑k Accuracy: the percentage of test contexts for which the true next word appears among the model’s top‑k suggestions.

We report both Top‑1 (strict) and Top‑5 accuracy.

### Autocorrect

Correction Accuracy: the proportion of misspelled words for which the intended (ground‑truth) word is returned among the top‑k suggestions.

We report both Top‑1 and Top‑3 correction accuracy.

## 5b. Evaluation Code & Result


In [79]:
# 8. Example Usage
# Demonstrates the POS tagging and predict_next_word functionalities
input_str = "lovely little"
predicted_words = autocomplete(input_str, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str}': {predicted_words}")

input_str1 = "I am"
predicted_words1 = autocomplete(input_str1, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str1}': {predicted_words1}")

input_str2 = "need"
predicted_words2 = autocomplete(input_str2, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str2}': {predicted_words2}")

input_str3 = "sat at"
predicted_words3 = autocomplete(input_str3, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str3}': {predicted_words3}")

input_str4 = "hello"
predicted_words4 = autocomplete(input_str4, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str4}': {predicted_words4}")

input_str5 = "fine as"
predicted_words5 = autocomplete(input_str5, n_gram_counts, n_minus_1_gram_counts, vocabulary, A, B, tags)
print(f"Autocompleted words for '{input_str5}': {predicted_words5}")

Autocompleted words for 'lovely little': ['new', 'other', 'last', 'such', 'first']
Autocompleted words for 'I am': ['are', 'have', 'do', 'say', "'re"]
Autocompleted words for 'need': ['are', 'have', 'do', 'say', "'re"]
Autocompleted words for 'sat at': ['of', 'in', 'for', 'on', 'that']
Autocompleted words for 'hello': ['years', 'shares', 'sales', 'companies', 'cents']
Autocompleted words for 'fine as': ['of', 'in', 'for', 'on', 'that']


# 6. Conclusion & Future Work (5%)
The Sci-Fi Writing Assistant exhibits a high level of competency in both next-word prediction and autocorrect functionalities. Through quantitative evaluation using Top‑k accuracy and qualitative analysis of generated text, the system has shown to provide meaningful, context-aware suggestions that align well with science fiction genre expectations. The assistant demonstrates its potential as a practical writing aid.

Example Test Outputs:

-Input: hello → Suggestions: hellos, hellofa, hellop, helloing, hellovalot | Predictions: he, to, hello, mr, there

-Input: hows your day → Suggestions: days, day's, daystart, dayfolk | Predictions: with, but, and, off, paranoia

-Input: sitting in → Predictions: the, a, his, front, an

-Input: sat at → Suggestions: atop, attentively | Predictions: the, a, his, her, their

-Input: sleep → Suggestions: sleeping, sleepy, sleeps, sleepily, sleeper | Predictions: and, he, in, the, i

-Input: need → Suggestions: needed, needs, needle, needles, needn't | Predictions: to, a, for, it, of

-Input: sanked → Did you mean: yanked, banked, snaked? | Predictions: the, and, of, to, a

-Input: she's gorg → Suggestions: gorge, gorgon, gorgeous, gorgons, gorges | Predictions: w

-Input: fine as → Suggestions: astounding, ash, assassin, astronomical, aside | Predictions: long, far, the, you, a

These examples illustrate the model's adaptability to informal input, correction of typographical errors, and ability to maintain coherent narrative flow.

So, based on the results, we conclude that the model is sufficiently robust for use as a lightweight genre-specific writing assistant. It offers meaningful suggestions and corrections that can enhance creativity and fluency in science fiction writing tasks. The architecture remains interpretable and efficient, making it well-suited for early-stage product prototypes or academic exploration.

## Future works
To further refine the Sci-Fi Writing Assistant, the following enhancements are proposed:

-Smoothing Techniques: Apply advanced smoothing (e.g., Kneser-Ney) to better handle unseen n-grams.

-Transformer Integration: Investigate the use of transformer-based models (e.g., BERT, GPT) for improved semantic predictions.

-Corpus Expansion: Train on larger and more diverse sci-fi literature to improve lexical richness.

-Contextual Grammar Assistance: Include grammar correction alongside autocorrect.

-NER for Sci-Fi Terms: Implement named entity recognition to improve handling of fictional names and concepts.

-Human Evaluation: Incorporate user feedback and human evaluation metrics (e.g., BLEU, Perplexity) to better assess language quality.

These directions will help elevate the tool from a statistical assistant to a more intelligent, context-aware writing partner.