### Training dataset structure

All data must follow one consistent format i.e.
- always either "," or "." never both.
- controlled unit vocabulary: KG, G, L, ML, SZT, etc. - never mixed SZt, szt, sZT
- names should be standardized e.g.
    - text only (i tried to follow this as much as possible but it turns out to be almost impossible)
    - no tax letters
    - no flags
    - no quantity
    - no price
    - no total
- missing data should be represented as a special symbol, empty labels produce ambiguity
- corrupt labels should be dealt properly with special \_NONE\_ symbol (or for names, they should mimic linguistically meaningful names)

### Tokenizer design

Vocabulary:
- All possible characters
- Special markers:
    - <PAD> for padding
    - <BOS> for beginning of sequence
    - <EOS> for end of sequence
    - <UNK> for unknown

Character-level approach is perfect fot his task because tokenizers struggle with:
- merged words
- broken words
- hallucinated capitals
- random digits in names
- unknown brand names

character-level model can freely cherrypick from noisy data

Maximum input and output length should give safety without wasting memory

All sequences should be padded or cut to max length

Decoding should ignore special marks (aesthetic matter)

**IMPORTANT**: vocab must include special \_NONE\_ and \n (newline) explicitly

### Architecture plan

Encoder --> Decoder --> Linear --> Softmax

No philosophy here, the "philosophy" is simply transformer is the best tool for seq2seq text

- **Embedding layer** maps each character ID to a dense vector, gives the model a continuous representation of characters
- **Positional embedding** adds position information, without it transformer sees all characters as unordered
- **Encoder** (4 layers)
    - *Self-attention*
        - Each characters looks at all other characters
        - Learns dependencies like price patterns, spacing, noise patterns, word boundaries
        - Essential for noisy OCR text
    - *Feed-forward network* expands and compresses information, does non-linear feature extraction
    - *Layer norm + residual connections*
        - Stabilize training
        - Allow deep stacking
        - Prevent gradient collapse
- **Decoder** analogical to encoder with few modifications:
    - *Masked self-attention* when generating output, the model can only see the past tokens
    - *Cross-attention* the decoder looks back at the encoder's output to decide what to spit out
- **Output linear layer** maps decoder hidden state to vector of size of vocabulary
- **Softmax** turns scores into probabilities, the model chooses the next character from this distribution

### Training and loss design

Total loss as weighted sum:

    TOTAL_LOSS =
        w_name * CE(name) +
        w_unit * CE(unit) +
        w_amount * CE(w_amount) +
        w_quantity * CE(w_quantity) +
        w_price * CE(w_price) +
        w_total * CE(w_total) +

with example weights:

    w_name = 1.0 Important
    w_unit = 0.7
    w_amount = 1.5 Critical
    w_quantity = 1.5 Critical
    w_price = 0.3
    w_total = 0.3

Teacher forcing for all seq outputs is *ESSENTIAL* the *gold* previous char is precious, it stabilizes training

Dropout should be used since it helps on small datasets so the model doesn't overfit

Optional but cool metrics:
- NAME accuracy, compare strings with Levenshtein distance
- QUANTITY accuracy, exact class match
- UNIT accuracy
- PRICE accuracy, numeric difference after parsing to floats