# 1. Background Problem (20%)
Language modeling is a fundamental task in Natural Language Processing (NLP), used in various applications like predictive typing, text generation, and spelling correction. For this project, I chose the Sci-Fi Stories Text Corpus available on Kaggle. Sci-Fi literature is linguistically rich and imaginative, often pushing boundaries of vocabulary and structure. Modeling such text is both challenging and rewarding, and it provides an exciting opportunity to explore how well statistical language models and autocorrect systems can handle complex and creative writing.

# 2. Resource

We used the following dataset found from kaggle:

Sci-Fi Stories Text Corpus by Jannes Klaas: 
- https://www.kaggle.com/datasets/jannesklaas/scifi-stories-text-corpus

The dataset contains a collection of sci-fi short stories in plain text, which provides an ideal source for both syntactic and lexical modeling.

# 3. Methods (10%)
## We applied the following methods:

- Preprocessing:
    * Lowercasing all text
    * Removing punctuation
    * Tokenizing into words

- Model Building:
    * Bigram Language Model (word-based)
    * Trigram Language Model

- Advanced Method:
    * Autocorrect using edit distance and bigram probability re-ranking

## 4. Model Implementation Code (50%)

# 5. Evaluation of Model
## 5a. Performance Metrics (10%)
Since this is a language generation and correction task, we use qualitative evaluation:
- Coherence of generated sentences
- Accuracy of autocorrect predictions (manually tested)

## 5b. Evaluation Code & Result
We evaluate our model using the following code to generate:
- A sentence from the bigram model
- Autocorrect outputs for various intentionally misspelled words

This demonstrates the qualitative performance of the language model and correction system.

# 6. Conclusion & Future Work (5%)
Our bigram and trigram models were able to generate reasonable Sci-Fi themed text based on the training corpus. The autocorrect system showed good potential in correcting common misspellings using both edit distance and word frequency.

Future work:
- Use of smoothing techniques for unseen n-grams
- Implementation of transformer-based models (e.g., GPT)
- Better evaluation using a held-out test set and BLEU/Perplexity scores

In [22]:
import re
import string
from collections import defaultdict, Counter
import nltk
nltk.data.path.append("/Users/stevgo/nltk_data")
nltk.download('averaged_perceptron_tagger', download_dir='/Users/stevgo/nltk_data')

from nltk import pos_tag, word_tokenize
print(pos_tag(word_tokenize("This is a test sentence.")))
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def load_corpus(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    return nltk.word_tokenize(text)

def build_ngram_model_with_pos(words, n=2):
    pos_tags = nltk.pos_tag(words)
    ngrams = defaultdict(Counter)
    for i in range(len(pos_tags) - n + 1):
        prefix = tuple(pos_tags[i:i + n - 1])  # includes POS
        next_word = pos_tags[i + n - 1][0]
        ngrams[prefix][next_word] += 1
    return ngrams

def predict_next_words_pos(model, prefix, top_k=5):
    prefix_words = nltk.word_tokenize(prefix.lower())
    pos_tags = nltk.pos_tag(prefix_words)
    for n in range(len(pos_tags), 0, -1):
        sub_prefix = tuple(pos_tags[-n:])
        if sub_prefix in model:
            return [word for word, _ in model[sub_prefix].most_common(top_k)]
    return []

def autocomplete_pos_interface(file_path, user_input, n=2, top_k=5):
    raw_text = load_corpus(file_path)
    tokens = preprocess_text(raw_text)
    model = build_ngram_model_with_pos(tokens, n)
    return predict_next_words_pos(model, user_input, top_k)

# Example:
# autocomplete_pos_interface("don.txt", "the brave", n=3)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/stevgo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - '/Users/stevgo/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/share/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
    - '/Users/stevgo/nltk_data'
**********************************************************************
