# Tokenization & Word↔Index Mapping with NLTK, Gensim, spaCy, and PyTorch/torchtext

This notebook shows **built-in tokenization** features and **word↔index** mappings across popular libraries.

Libraries covered:
- **NLTK** (tokenizers)
- **Gensim** (`simple_preprocess`, `Dictionary`)
- **spaCy** (tokenizer + `StringStore`)
- **PyTorch/torchtext** (`get_tokenizer`, `build_vocab_from_iterator`) + quick `nn.Embedding` demo

> If a library is missing, the cell will **gracefully explain** how to install it and, when possible, show a small fallback demo so the flow isn’t broken.

In [2]:
# Example texts we will reuse across libraries
texts = [
    "I love NLP, esp. tokenization!",
    "Byte-Pair Encoding is cool; so is WordPiece.",
    "Let's build vocabularies and map tokens ↔ ids."
]
print('Loaded', len(texts), 'example texts.')
for i, t in enumerate(texts, 1):
    print(f'{i}.', t)

Loaded 3 example texts.
1. I love NLP, esp. tokenization!
2. Byte-Pair Encoding is cool; so is WordPiece.
3. Let's build vocabularies and map tokens ↔ ids.


## 1) NLTK — Tokenize and Map Words ↔ Indexes
We try `nltk.word_tokenize` (requires the **punkt** model). If it’s not available, we fall back to `wordpunct_tokenize` (no download).

In [3]:
import itertools
try:
    import nltk
    try:
        from nltk.tokenize import word_tokenize
        # Ensure punkt is present; otherwise a LookupError is thrown
        nltk.data.find('tokenizers/punkt')
        tok_fn = word_tokenize
        method = "nltk.word_tokenize (punkt)"
    except LookupError:
        from nltk.tokenize import wordpunct_tokenize as tok_fn
        method = "nltk.wordpunct_tokenize (no model download)"
    print('Tokenizer method:', method)
    nl_tokens = [tok_fn(t) for t in texts]
    for i, toks in enumerate(nl_tokens, 1):
        print(f'NLTK tokens {i}:', toks)

    # Build a simple word-index mapping (contiguous indices)
    vocab = sorted(set(itertools.chain.from_iterable(nl_tokens)))
    stoi = {w:i for i,w in enumerate(vocab, start=1)}  # reserve 0 for <pad>
    itos = {i:w for w,i in stoi.items()}
    print('\nNLTK stoi sample (first 10):', list(itertools.islice(stoi.items(), 10)))

    # Convert first document to ids and back
    ids_doc0 = [stoi[w] for w in nl_tokens[0]]
    back_doc0 = [itos[i] for i in ids_doc0]
    print('Doc0 -> ids:', ids_doc0)
    print('ids -> Doc0:', back_doc0)
except Exception as e:
    print('NLTK not available or failed to tokenize:', type(e).__name__, str(e))
    print('Try: pip install nltk  (and optionally: nltk.download("punkt"))')

Tokenizer method: nltk.word_tokenize (punkt)
NLTK tokens 1: ['I', 'love', 'NLP', ',', 'esp', '.', 'tokenization', '!']
NLTK tokens 2: ['Byte-Pair', 'Encoding', 'is', 'cool', ';', 'so', 'is', 'WordPiece', '.']
NLTK tokens 3: ['Let', "'s", 'build', 'vocabularies', 'and', 'map', 'tokens', '↔', 'ids', '.']

NLTK stoi sample (first 10): [('!', 1), ("'s", 2), (',', 3), ('.', 4), (';', 5), ('Byte-Pair', 6), ('Encoding', 7), ('I', 8), ('Let', 9), ('NLP', 10)]
Doc0 -> ids: [8, 18, 10, 3, 15, 4, 21, 1]
ids -> Doc0: ['I', 'love', 'NLP', ',', 'esp', '.', 'tokenization', '!']


## 2) Gensim — `simple_preprocess` + `Dictionary` (token ↔ id)
`gensim.utils.simple_preprocess` lowercases & strips punctuation. `gensim.corpora.Dictionary` builds token↔id mapping and supports `doc2bow`, `doc2idx`, and reverse lookup.

In [4]:
try:
    from gensim.utils import simple_preprocess
    from gensim.corpora import Dictionary

    gs_tokens = [simple_preprocess(t, deacc=True, min_len=1) for t in texts]
    print('Gensim tokens:')
    for i, toks in enumerate(gs_tokens, 1):
        print(f' {i}:', toks)

    # Build dictionary (token<->id)
    dictionary = Dictionary(gs_tokens)
    print('\nDictionary size:', len(dictionary))
    print('token2id (first 10):', list(itertools.islice(dictionary.token2id.items(), 10)))

    # doc2bow (list of (token_id, count)) and doc2idx (ids sequence)
    bow0 = dictionary.doc2bow(gs_tokens[0])
    idx0 = dictionary.doc2idx(gs_tokens[0], unknown_word_index=-1)
    print('Doc0 doc2bow:', bow0)
    print('Doc0 doc2idx:', idx0)

    # Reverse lookup id->token
    ids_only = [idx for idx in idx0 if idx >= 0]
    rev0 = [dictionary[i] for i in ids_only]
    print('id->token (from doc2idx):', rev0)
except Exception as e:
    print('Gensim not available:', type(e).__name__, str(e))
    print('Try: pip install gensim')

Gensim tokens:
 1: ['i', 'love', 'nlp', 'esp', 'tokenization']
 2: ['byte', 'pair', 'encoding', 'is', 'cool', 'so', 'is', 'wordpiece']
 3: ['let', 's', 'build', 'vocabularies', 'and', 'map', 'tokens', 'ids']

Dictionary size: 20
token2id (first 10): [('esp', 0), ('i', 1), ('love', 2), ('nlp', 3), ('tokenization', 4), ('byte', 5), ('cool', 6), ('encoding', 7), ('is', 8), ('pair', 9)]
Doc0 doc2bow: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]
Doc0 doc2idx: [1, 2, 3, 0, 4]
id->token (from doc2idx): ['i', 'love', 'nlp', 'esp', 'tokenization']


## 3) spaCy — Tokenizer + `StringStore` (token ↔ hash id)
We use `spacy.blank("en")` so no model download is needed. `nlp.vocab.strings` maps **string ↔ integer key**. Note these are **hash-based ids** (not contiguous). For NN embeddings, build your own contiguous mapping if needed.

In [5]:
try:
    import spacy
    nlp = spacy.blank('en')  # tokenizer only, no model download
    spacy_tokens = [[tok.text for tok in nlp(t)] for t in texts]
    for i, toks in enumerate(spacy_tokens, 1):
        print(f'spaCy tokens {i}:', toks)

    # Built-in StringStore mapping (string<->id)
    stringstore = nlp.vocab.strings
    spacy_ids = [[stringstore[tok] for tok in toks] for toks in spacy_tokens]
    spacy_back = [[stringstore[id_] for id_ in ids] for ids in spacy_ids]
    print('\nspaCy StringStore ids (doc0):', spacy_ids[0])
    print('Back to tokens (doc0):', spacy_back[0])

    # Optional: contiguous mapping for embeddings
    import itertools
    vocab = sorted(set(itertools.chain.from_iterable(spacy_tokens)))
    stoi = {w:i for i,w in enumerate(vocab, start=1)}
    itos = {i:w for w,i in stoi.items()}
    ids_doc0 = [stoi[w] for w in spacy_tokens[0]]
    back_doc0 = [itos[i] for i in ids_doc0]
    print('\nContiguous stoi (first 10):', list(itertools.islice(stoi.items(), 10)))
    print('Doc0 -> ids:', ids_doc0)
    print('ids -> Doc0:', back_doc0)
except Exception as e:
    print('spaCy not available:', type(e).__name__, str(e))
    print('Try: pip install spacy  (no model needed for tokenizer-only demo)')

spaCy tokens 1: ['I', 'love', 'NLP', ',', 'esp', '.', 'tokenization', '!']
spaCy tokens 2: ['Byte', '-', 'Pair', 'Encoding', 'is', 'cool', ';', 'so', 'is', 'WordPiece', '.']
spaCy tokens 3: ['Let', "'s", 'build', 'vocabularies', 'and', 'map', 'tokens', '↔', 'ids', '.']

spaCy StringStore ids (doc0): [4690420944186131903, 3702023516439754181, 15832915187156881108, 2593208677638477497, 9888022622711118288, 12646065887601541794, 15418258291467594259, 17494803046312582752]
Back to tokens (doc0): ['I', 'love', 'NLP', ',', 'esp', '.', 'tokenization', '!']

Contiguous stoi (first 10): [('!', 1), ("'s", 2), (',', 3), ('-', 4), ('.', 5), (';', 6), ('Byte', 7), ('Encoding', 8), ('I', 9), ('Let', 10)]
Doc0 -> ids: [9, 20, 11, 3, 17, 5, 23, 1]
ids -> Doc0: ['I', 'love', 'NLP', ',', 'esp', '.', 'tokenization', '!']


## 4) PyTorch / torchtext — Tokenizer + Vocabulary + Embedding
We try `torchtext.data.utils.get_tokenizer("basic_english")` and `torchtext.vocab.build_vocab_from_iterator`. If `torchtext` is missing, we fall back to a simple regex tokenizer and build a vocabulary manually to show the mapping and an `nn.Embedding` lookup.

In [6]:
import re
try:
    import torch
    from torchtext.data.utils import get_tokenizer
    from torchtext.vocab import build_vocab_from_iterator
    import torch.nn as nn

    print('Using torchtext basic_english tokenizer')
    tokenizer = get_tokenizer('basic_english')
    tt_tokens = [tokenizer(t) for t in texts]
    for i, toks in enumerate(tt_tokens, 1):
        print(f'torchtext tokens {i}:', toks)

    vocab = build_vocab_from_iterator(tt_tokens, specials=['<unk>', '<pad>'])
    vocab.set_default_index(vocab['<unk>'])
    print('\nVocab size:', len(vocab))
    print('Sample token->id:', {tok: vocab[tok] for tok in tt_tokens[0]})

    # Map doc0 to ids and back
    ids0 = [vocab[t] for t in tt_tokens[0]]
    back0 = [vocab.lookup_token(i) for i in ids0]
    print('Doc0 -> ids:', ids0)
    print('ids -> Doc0:', back0)

    # Embedding example
    emb = nn.Embedding(num_embeddings=len(vocab), embedding_dim=8)
    x = torch.tensor(ids0).unsqueeze(0)  # shape (1, seq_len)
    E = emb(x)
    print('Embedding output shape (batch, seq, dim):', tuple(E.shape))
except Exception as e:
    print('torchtext not available; showing a minimal PyTorch fallback.')
    print('Error:', type(e).__name__, str(e))
    import torch
    import torch.nn as nn
    # very simple regex tokenizer
    tt_tokens = [re.findall(r"\w+|[^\w\s]", t.lower()) for t in texts]
    for i, toks in enumerate(tt_tokens, 1):
        print(f'fallback tokens {i}:', toks)
    # build contiguous vocab (reserve 0:<pad>, 1:<unk>)
    vocab_set = sorted(set(itertools.chain.from_iterable(tt_tokens)))
    stoi = {w:i+2 for i,w in enumerate(vocab_set)}
    stoi['<pad>'] = 0; stoi['<unk>'] = 1
    itos = {i:w for w,i in stoi.items()}
    ids0 = [stoi.get(t, 1) for t in tt_tokens[0]]
    back0 = [itos[i] for i in ids0]
    print('Doc0 -> ids:', ids0)
    print('ids -> Doc0:', back0)
    # Embedding
    emb = nn.Embedding(num_embeddings=len(stoi), embedding_dim=8)
    x = torch.tensor(ids0).unsqueeze(0)
    E = emb(x)
    print('Embedding output shape (batch, seq, dim):', tuple(E.shape))

torchtext not available; showing a minimal PyTorch fallback.
Error: ModuleNotFoundError No module named 'torchtext'
fallback tokens 1: ['i', 'love', 'nlp', ',', 'esp', '.', 'tokenization', '!']
fallback tokens 2: ['byte', '-', 'pair', 'encoding', 'is', 'cool', ';', 'so', 'is', 'wordpiece', '.']
fallback tokens 3: ['let', "'", 's', 'build', 'vocabularies', 'and', 'map', 'tokens', '↔', 'ids', '.']
Doc0 -> ids: [14, 18, 20, 4, 13, 6, 24, 2]
ids -> Doc0: ['i', 'love', 'nlp', ',', 'esp', '.', 'tokenization', '!']
Embedding output shape (batch, seq, dim): (1, 8, 8)


## Summary
- **NLTK**: `word_tokenize` (needs punkt) or `wordpunct_tokenize` — then build your own `stoi/itos`.
- **Gensim**: `simple_preprocess` + `Dictionary` gives **token↔id** out of the box, plus `doc2idx`, `doc2bow`.
- **spaCy**: tokenizer + `StringStore` (string↔hash), or build your own contiguous ids for embeddings.
- **PyTorch/torchtext**: `get_tokenizer('basic_english')` + `build_vocab_from_iterator`, then `vocab()` and `lookup_token()`; **easy to plug into `nn.Embedding`**.