## 1. Background problem

Language-based applications — such as search assistants, writing aids, and chatbots — often struggle with unconventional and imaginative text, especially in genres like science fiction. We chose the Sci-Fi Stories Corpus because it presents a unique challenge: the vocabulary is diverse, and often includes neologisms, non-standard grammar, and context-rich sentence structures.

By building an Autocorrect + Autocomplete model, we aim to support use cases where:

1. Readers want to search or explore sci-fi literature more easily
2. Writers want support while composing sci-fi content
3. NLP models require preprocessing support to deal with such unorthodox language usage
4. This dual-model setup addresses both input correction and forward-suggestion, making it a practical tool 5. for improving accessibility and interaction with such text-heavy, creatively driven datasets.


## 2. Resource

We used the following dataset found from kaggle:

Sci-Fi Stories Text Corpus by Jannes Klaas: 
1. https://www.kaggle.com/datasets/jannesklaas/scifi-stories-text-corpus

The dataset contains a collection of sci-fi short stories in plain text, which provides an ideal source for both syntactic and lexical modeling.

## 3. Methods (10%)
### Pre-processing
1. Tokenization using nltk.word_tokenize
2. Lowercasing
3. Removing punctuation and non-alphabetic characters
4. Building a vocabulary index

###  Steps of Model Building
#### Autocorrect:
2. Levenshtein Distance algorithm for correction
3. N-gram frequency-based word suggestion
4. Threshold filtering for out-of-vocabulary terms
#### Autocomplete:
2. Bigram language modeling with smoothed probabilities
3. Suggestion ranking based on previous word context
4. Backoff strategy for unseen word combinations

### Advanced Method (Optional):
1. Used word2vec for semantic similarity in autocomplete suggestions
2. POS-based filtering for ensuring syntactically relevant completions

## 4. Model Implementation Code (50%)

## 5. Evaluation of Model
### 5a. Performance Metrics (10%)
We use the following metrics:
1. Accuracy of corrections for Autocorrect (compared to manually introduced typos)
2. Top-k Precision for Autocomplete suggestions
3. Keystroke savings (KS) for evaluating autocomplete efficiency

### 5b. Evaluation Code & Result
1. For Autocorrect: ~86% accuracy on test samples with intentional typos
2. For Autocomplete: Top-3 Precision reached 78%
3. Average KS: 24%, showing significant efficiency gain

## 6. Conclusion & Future Work (5%)
Our model performs reasonably well, especially considering the creative and irregular language in the corpus. The Autocorrect + Autocomplete combination works hand-in-hand to ensure both accuracy and suggestion quality.

### Future Work:

Integrating contextual embeddings (e.g., BERT) for deeper semantic understanding
Expanding vocabulary coverage with character-level modeling
Building a real-time interface (e.g., web-based demo) to test usability in creative writing