# Georgian Spellchecker

Georgian is a pretty unique language when it comes to writing: every sound maps cleanly to a single letter, there are no uppercase forms, and the alphabet itself is completely different from Latin or Cyrillic.

This makes Georgian look “simple” on paper, but in practice typing errors still happen all the time: letters get duplicated, swapped, dropped, or replaced. Because Georgian spelling is so consistent, even small mistakes stand out immediately, which makes it a great candidate for a character-level spellchecking approach that learns what valid Georgian words actually look like.

We tackle the problem of spellchecking Georgian words using a character-level sequence-to-sequence neural network. Unlike contextual spellcheckers that rely on surrounding text, this model operates on individual words in isolation and learns to correct spelling errors by transforming a corrupted character sequence into its correct form.

By training on pairs of misspelled and correctly spelled Georgian words, the model internalizes the structural and orthographic rules of the Georgian alphabet and learns how realistic typing errors - such as missing, duplicated, swapped, or mistyped characters - can be repaired at the character level.

## Getting Georgian Words

Firstly, we need a large list of correctly spelled Georgian words.

There are a lot of Georgian words on the internet that are already in ready-to-use format for model training purposes.

Notable github repos:
- https://github.com/gamag/ka_GE.spell/tree/master/dictionaries
- https://github.com/akalongman/geo-words
- https://github.com/AleksandreSukh/GeorgianWordsDataBase/blob/master/wordsChunk_2.json

As well as online dictionaries:
- http://www.nplg.gov.ge/gwdict/
- https://www.ganmarteba.ge/

For the sole purpose of having fun I chose option to crawl [ganmarteba online dictionary](https://www.ganmarteba.ge/) and as a result got about 40,000 unique words.

I will not be including crawler code logic since it took me about 1 hour to scrape all their words, but you can take a look at implementaion in *ganmarteba.ge_crawler.py* in package *crawler*.

In [37]:
words  = open('./words/ganmarteba.ge_words.txt', 'r', encoding='utf-8').read().split('\n')
print(f'Collected total {len(words)} words\n')

print(f"First couple of words:")
print(words[:5])

Collected total 41180 words

First couple of words:
['ააალებს', 'ააალმასებს', 'ააბარგებს', 'ააბიბინებს', 'ააბნევს']


## Corrupting Words

For training, we need *source* and *target* pairs as input for our model. To simulate real-world typos, I implemented four types of error functions:

- **Deletion** - randomly removes a character from the word.
- **Duplication** - randomly repeats a character.
- **Substitution** – replaces a character with a neighboring keyboard character.
- **Swap** - exchanges the positions of two neighboring characters.

These corruptions help the model learn how to map incorrect forms back to their correct Georgian spelling.

Choosing character index for all methods is always random. Just like user behavior.

In [38]:
import random

def delete_char(word):
    if len(word) <= 1:
        return word
    i = random.randrange(len(word))
    return word[:i] + word[i + 1:]

word = random.choice(words)
print(f"Result of character deletion: {word} -> {delete_char(word)}")

Result of character deletion: დაუოკდება -> აუოკდება


In [39]:
def duplicate_char(word):
    i = random.randrange(len(word))
    return word[:i] + word[i] + word[i:]

word = random.choice(words)
print(f"Result of character duplication: {word} -> {duplicate_char(word)}")

Result of character duplication: აკენკვა -> აკენნკვა


It’s easy for a finger to accidentally hit a neighboring key on the keyboard. For this reason, we create a dictionary mapping each Georgian letter to its adjacent keys, including the shifted variants.

In [40]:
GEORGIAN_QWERTY_KEYBOARD_NEIGHBORS = {
    "ა": "ქწსზ",
    "ბ": "ვგჰნ",
    "გ": "ფტჰბვ",
    "დ": "ერფცხს",
    "ე": "წსდრ",
    "ვ": "ცფგბ",
    "ზ": "ასხ",
    "თ": "ღრფგყტ",
    "ი": "უჯკო",
    "კ": "იოლმჯ",
    "ლ": "კოპ",
    "მ": "ნჯკლ",
    "ნ": "ბჰჯმ",
    "ო": "იკლპ",
    "პ": "ოლ",
    "ჟ": "ჰუიკმნჯ",
    "რ": "ღედფგტ",
    "ს": "აწდხზშ",
    "ტ": "რფგჰყთ",
    "უ": "ყჰჯკი",
    "ფ": "დრტგვც",
    "ქ": "წსა",
    "ღ": "ედფგტრ",
    "ყ": "ტგჰჯუ",
    "შ": "აჭწდძზხს",
    "ჩ": "ხდფვც",
    "ც": "ხდფვჩ",
    "ძ": "შასხძ",
    "წ": "ქასდეჭ",
    "ჭ": "ქაშსდეწ",
    "ხ": "ზსდც",
    "ჯ": "ჰუიკმნჟ",
    "ჰ": "გყუჯნბ",
}

In [41]:
def substitute_char(word):
    i = random.randrange(len(word))
    char = word[i]
    neighbors = GEORGIAN_QWERTY_KEYBOARD_NEIGHBORS.get(char)
    if not neighbors:
        return word
    replacement = random.choice(neighbors)
    return word[:i] + replacement + word[i + 1:]

word = random.choice(words)
print(f"Result of character substitution: {word} -> {substitute_char(word)}")

Result of character substitution: თრთის -> თრტის


In [42]:
def swap_chars(word):
    if len(word) < 2:
        return word
    i = random.randrange(len(word) - 1)
    return (
            word[:i]
            + word[i + 1]
            + word[i]
            + word[i + 2:]
    )

word = random.choice(words)
print(f"Result of character swapping: {word} -> {swap_chars(word)}")

Result of character swapping: უმწეობა -> უმეწობა


To make the corruptions feel more natural and varied, we randomly select which functions to apply to each word.

For extra control we can also choose how many errors should be applied to a word.

For my model I chose:
- if word length < 10 only 1 error
- if word length > 10 2 or 3 error

Also note that for each word we will need multiple corrupted words. I chose 4.

I believe this simple approach is easier to understand than adding probability space for each function and length.

In [43]:
ERROR_FUNCTIONS = [
    delete_char,
    duplicate_char,
    substitute_char,
    swap_chars,
]

def get_corrupted_words(word, number_of_corrupted_words=4):
    corrupted_words = set()

    word_len = len(word)

    # for word len < 10 1-2 error
    # for word len > 10 3-5 error
    word_error_count_threshold = 10
    if word_len < word_error_count_threshold:
        min_error, max_error = 1, 1
    else:
        min_error, max_error = 2, 3

    while len(corrupted_words) <= number_of_corrupted_words:
        corrupted = word
        num_errors = random.randint(min_error, max_error)

        # apply error functions
        for _ in range(num_errors):
            error_fn = random.choice(ERROR_FUNCTIONS)
            corrupted = error_fn(corrupted)

        corrupted_words.add(corrupted)

    return list(corrupted_words)

In [44]:
    sample_words = ['თბილისი', 'ქუთაისი', 'ბათუმი', 'თელავი']

    print("\n__ word augmentation examples__")
    for w in sample_words:
        print(f"\n Word: {w}")
        for i, c in enumerate(get_corrupted_words(w), 1):
            print(f"   {i:>2}. {c}")


__ word augmentation examples__

 Word: თბილისი
    1. თბლისი
    2. თბბილისი
    3. თბილიისი
    4. თბიკისი
    5. თვილისი

 Word: ქუთაისი
    1. ქთუაისი
    2. ქუუთაისი
    3. ქუთაიისი
    4. ქუთასი
    5. ქურაისი

 Word: ბათუმი
    1. ბაუთმი
    2. ბაუმი
    3. ბათუმჯ
    4. ბათუმ
    5. ბათჰმი

 Word: თელავი
    1. თეკავი
    2. თელავიი
    3. თელაი
    4. თეავი
    5. თეალვი


# Georgian Character Vocabulary and Encoding

We define the **character vocabulary** for Georgian letters and implement helper functions to encode words into sequences of indices and decode them back.


In [45]:
# Georgian alphabet and special tokens
GEORGIAN_LETTERS = list("აბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ-")
special_tokens = ["<PAD>", "<SOS>", "<EOS>"]

# Full vocabulary
vocab = special_tokens + GEORGIAN_LETTERS

# Character ↔ Index mappings
char2idx = {c: i for i, c in enumerate(vocab)}
idx2char = {i: c for i, c in enumerate(vocab)}

# Special constants
PAD_IDX = char2idx["<PAD>"]
vocab_size = len(vocab)

print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 37


## Encoding Words

### The Translation Layer (Text → Numbers)
Neural networks cannot understand the concept of the letter "ა" or "ბ". They only understand numbers (vectors).

- **The Vocabulary:** This is lookup table. If "ა" is at index 5, then every time the model sees the number 5, it "thinks" of the concept of "ა".
- **The Sequence:** A word becomes a list of integers. თბილისი becomes [10, 4, 11, 13, 11, 20, 11].

### The "Traffic Signals": <SOS> and <EOS>

In a Sequence-to-Sequence (Seq2Seq) model, the network needs to know when to start speaking and when to shut up. This is what these tokens handle.


#### 1. `<SOS>` Start Of Sequence
The Decoder part is autoregressive, meaning it generates one character at a time based on what it generated previously.

- **The Problem:** To generate the first character (e.g., "თ"), the model needs an input to look at. But it hasn't generated anything yet!
- **The Solution:** We feed it the `<SOS>` token.

The Logic: "If I see `<SOS>` and have the memory of the corrupted word, the first letter I should write is 'თ'."


#### 2. `<EOS>` End Of Sequence

Since neural networks work with fixed-size tensors, but words have variable lengths ("მზე" has 3 letters, "გამარჯობა" has 9).

- **The Problem:** Without a stop signal, the model would just keep generating garbage characters until it hits the maximum allowed length (e.g., თბილისი<pad><pad><pad>...).

- **The Solution:** We teach the model that after the final "ი", the next "character" is `<EOS>`.

The Logic: When the model predicts `<EOS>`, our code knows the word is finished and cuts off the generation loop.

#### 3. `<PAD>` Padding Token

While `<SOS>` and <EOS> control the flow of a single word, `<PAD>` is essential for training multiple words at once (batching).

- **The Problem:** To train efficiently, we feed the GPU a batch of words (e.g., 64 at a time). GPUs require inputs to be perfectly rectangular matrices. However, "მზე" (3 letters) and "გამარჯობა" (9 letters) have different lengths, so they cannot naturally stack into a single matrix.

- **The Solution:** We extend the shorter sequences with the `<PAD>` token until they match the length of the longest word in the batch.

The Logic: These tokens act as "filler" or silence. We explicitly tell the Loss Function to ignore PAD_IDX, ensuring the model doesn't waste effort trying to learn or predict these empty spaces.

In [46]:
def encode_word(word):
    """Encode a word as a list of character indices, including <SOS> and <EOS>."""
    seq = [char2idx["<SOS>"]] + [char2idx[c] for c in word] + [char2idx["<EOS>"]]
    return seq

word = "თბილისი"
encoded = encode_word(word)
print(f"Original word: {word}")
print(f"Encoded sequence: {encoded}")

Original word: თბილისი
Encoded sequence: [1, 10, 4, 11, 13, 11, 20, 11, 2]


## Decoding Sequences
After the model predicts a sequence of indices, we convert them back to readable text, ignoring special tokens.

In [47]:
def decode_sequence(seq):
    """Decode a sequence of indices into a word, skipping <SOS> and <EOS>."""
    chars = [idx2char[i] for i in seq if idx2char[i] not in ["<SOS>", "<EOS>"]]
    return "".join(chars)

decoded = decode_sequence(encoded)
print(f"Decoded word: {decoded}")

Decoded word: თბილისი


### To note:
- `vocab` contains **all Georgian letters** plug hyphen and special tokens `<PAD>`, `<SOS>`, `<EOS>`.
- `encode_word` converts words to numeric sequences for the model.
- `decode_sequence` converts model outputs back to text.

# Dataset Preparation

We create a **training dataset** with `(corrupted_word, correct_word)` pairs.
Corrupted words are generated using functions for deletion, duplication, swaps, and keyboard-neighbor substitution.

We shuffle the word list, generate corrupted versions, and combine them into `(input, target)` pairs.

It is very **important** to include correct pairs in the input pair so model won't make arbitrary adjustments to already correct words.

In [48]:
words = open('./words/ganmarteba.ge_words.txt', 'r', encoding='utf-8').read().split('\n')
random.shuffle(words)

dataset = [
    (encode_word(c), encode_word(w))
    for w in words
    for c in [w] + get_corrupted_words(w)
]
print(f'Generated {len(dataset)} corrupted pairs from {len(words)} words')

Generated 247080 corrupted pairs from 41180 words


We'll also need dataset dto

In [49]:
from torch.utils.data import Dataset

class SpellDataset(Dataset):
    def __init__(self, pairs):
        self.pairs = pairs

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        return self.pairs[idx]

## DataLoader Setup

Standard neural network training requires us to feed data in "batches" (groups of examples) to the GPU. However, PyTorch's default batching assumes all your data inputs are identical in size (like fixed-size images).

Because words vary in length (e.g., "მზე" is 3 letters, "გამარჯობა" is 9), we cannot simply stack them together. We need a custom Collate Function (collate_fn) to handle this packing process.

### The Role of `collate_fn`

The `collate_fn` is a helper function that runs every time the DataLoader fetches a batch of samples. Its job is to:

1. Find the Maximum Length: Look at all words in the current batch and find the longest one.

2. Create a Blank Canvas: Initialize a matrix (Tensor) filled entirely with our <PAD> token, sized to the longest word.

3. Fill in the Data: Copy the actual word sequences into this matrix.


In [50]:
import torch

def collate_fn(batch):
    src_seqs, tgt_seqs = zip(*batch)

    src_lens = [len(s) for s in src_seqs]
    tgt_lens = [len(t) for t in tgt_seqs]

    max_src = max(src_lens)
    max_tgt = max(tgt_lens)

    src_batch = torch.full((len(batch), max_src), PAD_IDX, dtype=torch.long)
    tgt_batch = torch.full((len(batch), max_tgt), PAD_IDX, dtype=torch.long)

    for i, (s, t) in enumerate(zip(src_seqs, tgt_seqs)):
        src_batch[i, :len(s)] = torch.tensor(s)
        tgt_batch[i, :len(t)] = torch.tensor(t)

    return src_batch, tgt_batch

We also need to split training and validation dataset. I am using 80/20 split.

I am also utilizing hardware optimization of `pin_memory`.

- `pin_memory=True`: This signals the CPU to prepare the data in a special area of memory (`page-locked memory`). This allows for much faster transfer of data from RAM to the GPU (CUDA).


In [51]:
from torch.utils.data import DataLoader

split_ratio = 0.8
split_idx = int(len(dataset) * split_ratio)

train_dataset = SpellDataset(dataset[:split_idx])
val_dataset = SpellDataset(dataset[split_idx:])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Check for GPU
pin_memory = (device.type == "cuda") # If using GPU, pin_memory speeds up transfer

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_fn, pin_memory=pin_memory)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, collate_fn=collate_fn, pin_memory=pin_memory)

print(f"Training pairs: {len(train_dataset)}, Validation pairs: {len(val_dataset)}")

Training pairs: 197664, Validation pairs: 49416


# Model, Loss, Optimizer

## Model Architecture: The Seq2Seq Approach

The architecture follows a Sequence-to-Sequence (Seq2Seq) design using Long Short-Term Memory (LSTM) networks. This choice is specifically suited for spellchecking because the model must process an entire input sequence to understand the "context" of a word before generating the corrected version.

#### The Encoder-Decoder Structure
The model is split into two distinct functional units:

- **The Encoder:** It processes the corrupted Georgian word character by character. As it reads, it updates its internal "hidden state." By the time it reaches the end of the word, this hidden state acts as a context vector—a compressed mathematical summary of the input.
- **The Decoder:** The decoder is tasked with "unpacking" that context vector into a sequence of correct characters. It is autoregressive, meaning it uses the character it just predicted as the input for its next prediction.

---
#### Hyperparameter Choices
When defining the CharSeq2Seq model, the hyperparameters are chosen to balance representational power with computational efficiency:

- **Embedding Dimension** (`embed_dim=64`): Since the Georgian alphabet is relatively small (approx. 33–40 tokens), a 64-dimensional space is sufficient to capture relationships between characters. Higher dimensions risk overfitting, while lower dimensions might fail to distinguish between similar characters.
- **Hidden Dimension** (`hidden_dim=256`): This is the size of the "memory" passed from the Encoder to the Decoder. 256 units provide enough workspace to store the information of long Georgian words and complex error patterns.
- **Layers** (`num_layers=2`): Stacking two LSTMs allows the model to learn a hierarchy. The first layer handles low-level character transitions, while the second layer learns higher-level structural rules of the Georgian language.
- **Dropout** (`0.2`): During training, 20% of the neurons are randomly deactivated. This prevents the model from "memorizing" specific words and forces it to learn robust spelling rules that generalize to unseen typos.
- **Batch First** (`True`): This configuration ensures the data is processed in the format (`batch_size, sequence_length, features`), which is the standard intuitive format for modern PyTorch workflows.
- **Teacher Forcing** (`teacher_forcing_ratio=0.5`): With a certain probability, we ignore the model's actual prediction and instead feed it the correct character from the target sequence as the input for the next step. We use a ratio of 0.5, meaning half the time the model learns from its own mistakes, and half the time it is "guided" by the ground truth.

In [52]:
from torch import nn

class CharSeq2Seq(nn.Module):

    def __init__(self, embed_dim=64, hidden_dim=256):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.LSTM(embed_dim, hidden_dim, num_layers=2, batch_first=True, dropout=0.2)
        self.decoder = nn.LSTM(embed_dim, hidden_dim, num_layers=2, batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, src, tgt=None, teacher_forcing_ratio=0.5):
        embedded_src = self.embedding(src)
        _, hidden = self.encoder(embedded_src)

        # decode
        batch_size = src.size(0)
        max_len = tgt.size(1) if tgt is not None else 30
        outputs = torch.zeros(batch_size, max_len, vocab_size).to(src.device)

        input_token = torch.full((batch_size, 1), char2idx["<SOS>"], dtype=torch.long).to(src.device)

        for t in range(max_len):
            embedded_input = self.embedding(input_token)
            output, hidden = self.decoder(embedded_input, hidden)
            output = self.fc(output)

            pred = output.argmax(2)
            outputs[:, t, :] = output.squeeze(1)

            if tgt is not None and random.random() < teacher_forcing_ratio:
                input_token = tgt[:, t].unsqueeze(1)
            else:
                input_token = pred

            if tgt is None:
                if (pred == char2idx["<EOS>"]).all():
                    break

        return outputs

## Loss Function and Optimization

To turn this architecture into a functioning spellchecker, we need a way to measure its mistakes and a strategy to correct them.

#### Criterion: Cross-Entropy Loss
We use `nn.CrossEntropyLoss` as our mathematical "judge." In character-level modeling, the decoder’s final layer outputs a probability distribution across the entire Georgian vocabulary for every position in the word.

- **Multiclass Classification:** At its core, the model is trying to solve a classification problem for every character: "Which of the 36+ possible Georgian characters is most likely to be here?"
- **The Role of** `ignore_index=PAD_IDX`: This is a crucial setting. It tells the loss function to completely ignore the positions where we added `<PAD>` tokens. We don't want to penalize the model for "predicting" padding, nor do we want it to waste its learning capacity on filler characters used just for batch alignment.

#### Optimizer: Adam

The Adam (Adaptive Moment Estimation) optimizer is used to update the model weights. Unlike standard Stochastic Gradient Descent (SGD), Adam maintains a separate learning rate for every single parameter in the model. It tracks the "momentum" of previous gradients, which helps the model navigate the complex loss landscape of an LSTM more efficiently.

In [53]:
import torch.optim as optim

model = CharSeq2Seq().to(device)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = optim.Adam(model.parameters(), lr=0.001)

## Training Loop

The loop is structured to handle the Autoregressive nature of the model while monitoring for Generalization (validation) versus Memorization (training).

#### 1. The Training Phase & Gradient Descent

We use `model.train()` to flip the internal flags for modules like `nn.Dropout`. Without this, the LSTM layers wouldn't apply the 0.2 dropout mask we defined, rendering the regularization useless.

- **Zeroing Gradients:** `optimizer.zero_grad()` is a mandatory step in PyTorch because gradients accumulate by default.
- **The Loss Objective:** We’re using a flattened Cross-Entropy Loss.
- **Weight Update:** `loss.backward()` triggers the autograd engine, calculating the partial derivatives of the loss with respect to every weight in the Encoder and Decoder. `optimizer.step()` then adjusts those weights based on the Adam update rule.

#### 2. Validation & The "Inference Gap"

Switching to `model.eval()` and `torch.no_grad()` is standard, but the key change is setting `teacher_forcing_ratio=0.0`.

- **The Reason:** During training, "Teacher Forcing" guides the model by feeding it the correct previous character.
- **The Goal:** In validation, we remove the "training wheels." By forcing the model to rely on its own previous predictions, we get an honest assessment of its performance in a real-world, autoregressive inference scenario.

#### 3. Monitoring Metrics
By comparing the `avg_train_loss` and `avg_val_loss`, you can monitor the bias-variance tradeoff. If the validation loss starts to diverge or rise while training loss falls, the model is memorizing the dataset rather than learning Georgian orthography.

In [None]:
import time

num_epochs = 10

for epoch in range(num_epochs):
    start_time = time.time()
    train_loss = 0

    model.train()
    for src, tgt in train_loader:
        # Move tensors to the same device as the model (CPU/GPU)
        src, tgt = src.to(device), tgt.to(device)

        # Clear accumulated gradients from the previous iteration
        optimizer.zero_grad()

        # Forward pass: Uses Teacher Forcing (default ratio 0.5) to stabilize early learning
        output = model(src, tgt)

        # This treats every character position as an individual classification target
        loss = criterion(output.view(-1, model.vocab_size), tgt.view(-1))

        # Backpropagation: Compute gradients for every trainable parameter
        loss.backward()

        # Update weights based on the Adam optimizer update rule
        optimizer.step()

        train_loss += loss.item()

    # Calculate average training loss for the epoch
    avg_train_loss = train_loss / len(train_loader)

    # Set model to evaluation mode (disables Dropout for consistent inference)
    model.eval()
    val_loss = 0

    # Disable autograd engine to reduce memory overhead and speed up computation
    with torch.no_grad():
        for src, tgt in val_loader:
            src, tgt = src.to(device), tgt.to(device)

            # Disable Teacher Forcing (ratio=0.0) to evaluate true autoregressive performance
            output = model(src, tgt, teacher_forcing_ratio=0.0)

            # Use the same flattening logic to compute cross-entropy on validation data
            loss = criterion(output.view(-1, model.vocab_size), tgt.view(-1))
            val_loss += loss.item()

    avg_val_loss = val_loss / len(val_loader)

    # 3. Logging & Performance Tracking
    epoch_time = time.time() - start_time
    print(
        f"Epoch {epoch + 1}: "
        f"Train Loss: {avg_train_loss:.4f}, "
        f"Val Loss: {avg_val_loss:.4f}, "
        f"Time: {epoch_time:.2f}s"
    )

Training a character-level Seq2Seq model on a vocabulary of ~40,000 words (expanded to ~200,000 pairs via augmentation) is computationally intensive.

The time required for convergence depends heavily on your hardware's ability to handle the LSTM's recursive operations.

It took me about 40 minutes for 10 epoch on google's T4 gpu.

## Save Final Model
Export the trained model to disk.

In [None]:
final_model_path = "georgian_spellcheck_model_final.pth"
torch.save(model.state_dict(), final_model_path)
print(f"Final model saved: {final_model_path}")