## *Data Preprocessing for Next-Word Prediction*

*This notebook handles the preprocessing of the Penn Treebank dataset.*

*It includes loading the dataset, tokenizing the sentences, building the vocabulary,and converting text into numerical sequences ready for model training.*


In [1]:
import numpy as np
from pathlib import Path
from nltk.tokenize import word_tokenize 

### *1. Load and Tokenize the Dataset*

*This function reads the text files (train, validation, and test), splits them into sentences and words, and returns tokenized text as lists of tokens.*


In [2]:

# Load and preprocess dataset
def load_and_tokenize(file_path):
    sentences = []
    with open(file_path, 'r', encoding='utf-8') as f:  # as f gives us a file handle (you can think of it as a variable representing the opened file).
        for line in f:
            tokens = word_tokenize(line.strip())
            if tokens:
                sentences.append(tokens)
    return sentences

### *2. Build Vocabulary*

*This step builds a mapping between words and indices.*
- `word_to_index`: maps each unique word to an integer.
- `index_to_word`: reverse mapping, used for decoding predictions.

*We also reserve special tokens like `<PAD>` and `<UNK>` for padding and unknown words.*


In [3]:
def build_vocab(sentences):
    all_tokens = [token for sent in sentences for token in sent]    #This flattens the list of lists.
    vocab = set(all_tokens)
    word_to_index = {"<PAD>":0, "<UNK>":1}
    for idx, word in enumerate(sorted(vocab), start=2):
        word_to_index[word] = idx    #word_to_index:when preparing data for training.
    index_to_word = {idx: word for word, idx in word_to_index.items()}   #index_to_word when translating predictions back to text.
    return word_to_index, index_to_word

### *3. Prepare Data for Model Training*

*This function converts tokenized sentences into numerical sequences using the vocabulary.*

*Each sequence is padded or truncated to have the same length.It also splits inputs and outputs for next-word prediction:*
- *Input: [word1, word2, ..., word_(n-1)]*
- *Output: [word2, word3, ..., word_n]*

In [4]:
def prepare_data(sentences, word_to_index, max_len=None):
    if max_len is None:
        max_len = max(len(s) for s in sentences)
    inputs, outputs = [], []
    for sentence in sentences:
        indices = [word_to_index.get(token, 1) for token in sentence]
        if len(indices) > max_len:
            indices = indices[:max_len]
        else:
            indices += [0]*(max_len - len(indices))
        inputs.append(indices[:-1])
        outputs.append(indices[1:])
    return np.array(inputs), np.array(outputs)