# Words and Tokens

How do we define words in NLP and how does that effect our machine learning model and the downstream task.

## Words

Deciding what counts as a word can depend heavily on the downstream task. For topic modeling or text classification punctuation may not play a big role, however things like sentiment analysis, speech to text, or parsing linguistic features may rely on such information. Furthermore, in speech to text, one might encounter words like *uh* or *um* which may or may not be useful.

Further complications can be things like the casing of a word, *they* versus *They* or the inflection of words like *cats* verus *cat*.  A **lemma** refers to sets of inflected word forms which share the same stem, part of speech and word sense. In contrast **word form** is the inflected form, the natural variations of words that come up in language.

The english language is rather forgiving in this sense, because a given **lemma** tends to have a limited number of **word forms**. In contrast languages like **Arabic** or **Turkish** rely on a large variety of **word forms** for any given **lemma**. This can be more challenging because if these tokens are treated indepently, a model has to recover the relationship between a wide array possible **word forms**.

One phenomena that comes up in english corpora is called **Herdan's Law** or **Heap's Law** which implies that the  size of the vocabulary for a text goes up significantly fast than the square root of its length in words. Namely for `V` the vocabulary (i.e. unique words) of a corpora, where the corpora consists of `N` tokens (the individual, non-unique words of a text) the following holds:

```
|V| = kN^b
```
 where `k` and `b` are positive constants and `0 < b < 1` (usually ~ 0.65-0.75). This pattern will come up when we experiment with tokenization in the next section.
 
 

## Tokenization

Strictly speaking, things like tokenization can be done via hand written regexp and bash scripts, however there now exist libraries like spaCy which make the process of tokenization and extraction of linguistic features much easier. In addition, to overcome limitations of more complex languages and their morphological features, sub-word tokenization approaches have also been developed.

Nonetheless this isn't a solved problem. Because this is one of the first steps in preparing text for NLP applications, decisions here have long lasting decisions in the downstream model.

## Text Normalization

As part of a standard text normalization process the following likely arise:

- Tokenization
- Normalizing word formats
- Segmenting sentences

If you read enough NLP papers, especially more classical approaches, you will observe that significant thought and effort can be put into optimizing your normalization approach for downstream tasks. This might include any or more of the following:

- making tokenization exceptions for unusual patterns specific to your corpus which might otherwise indicate word boundaries (i.e. 'New York' or 'New' 'York')
- accepting punctuation or not as a token
- replacing URLs or money sums with special tokens like #URL and #MONEY
- splitting tokens like "can't" to "can" "'t"
- lowercasing all text

Each of these decisions may have practical or performance related impacts which one should think about.


### Demo

Here is a small example comparing the tokenization of a couple different approaches to see how a document might get tokenized

Make sure to install a spacy model if necessary:

```
!python -m spacy download en_core_web_sm
```

#### Importing Libraries 

In [1]:
from sklearn import datasets
import pandas as pd
from collections import Counter, defaultdict
from pprint import pprint

import nltk
nltk.download('punkt')

import spacy
nlp = spacy.load("en_core_web_sm")

from spacy.tokenizer import Tokenizer
from spacy.pipeline import Sentencizer
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

[nltk_data] Downloading package punkt to
[nltk_data]     /home/kylehiroyasu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
data = datasets.fetch_20newsgroups()

In [3]:
texts = data['data']

#### Comparing Speed Performance

Here we compare the speed of the `nltk`  tokenizers with a simple `regex` and with the built in `spaCy` tokenizer as well. First we'll see how quickly this process takes on a sample news dataset.

In [4]:
%%time

word_tokenizer_count = Counter()

for text in texts:
    tokens = nltk.tokenize.word_tokenize(text)
    word_tokenizer_count.update(tokens)

CPU times: user 22.4 s, sys: 0 ns, total: 22.4 s
Wall time: 22.4 s


In [5]:
%%time

wordpunct_tokenizer_count = Counter()

for text in texts:
    tokens = nltk.tokenize.wordpunct_tokenize(text)
    wordpunct_tokenizer_count.update(tokens)

CPU times: user 1.5 s, sys: 666 Âµs, total: 1.5 s
Wall time: 1.5 s


In [6]:
%%time

nltk_regex_tokenizer_count = Counter()

for text in texts:
    tokens = nltk.tokenize.regexp_tokenize(text,"[\w']+")
    nltk_regex_tokenizer_count.update(tokens)

CPU times: user 1.07 s, sys: 0 ns, total: 1.07 s
Wall time: 1.07 s


In [7]:
%%time

spacy_tokenizer_count = Counter()

for text in tokenizer.pipe(texts):
    spacy_tokenizer_count.update([t.text for t in text])

CPU times: user 16.6 s, sys: 54.8 ms, total: 16.7 s
Wall time: 16.7 s


#### Comparing Token Counts

In [8]:
counters = [word_tokenizer_count, wordpunct_tokenizer_count, nltk_regex_tokenizer_count, spacy_tokenizer_count]
names = ['word', 'wordpunct', 'regex','spaCy']
results = []
for name, count in zip(names, counters):
    results.append({
        'name':name,
        'unique tokens': len(count.keys()),
        'total_tokens': sum(count.values())
    })
    
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,name,unique tokens,total_tokens
0,word,204783,4790508
1,wordpunct,173119,5051134
2,regex,163358,3663948
3,spaCy,282056,3811544


In [9]:
for name, count in zip(names, counters):
    pprint(name)
    pprint(count.most_common(10))

'word'
[('>', 187306),
 (',', 161254),
 ('.', 143573),
 ('the', 129570),
 ('--', 116980),
 (':', 113896),
 (')', 71944),
 ('to', 71194),
 ('(', 70544),
 ('*', 68230)]
'wordpunct'
[('.', 282553),
 (',', 151604),
 ('the', 129745),
 (':', 108345),
 ('>', 71695),
 ('to', 71604),
 ('-', 70394),
 ('of', 67699),
 (">'", 66852),
 ('AX', 62396)]
'regex'
[('the', 129708),
 ('to', 71596),
 ('of', 67699),
 ("'AX", 61917),
 ('a', 57392),
 ('and', 54634),
 ('I', 43987),
 ('is', 41377),
 ('in', 39257),
 ('that', 36498)]
'spaCy'
[('\n', 289433),
 ('the', 127670),
 (' ', 78811),
 ('to', 69836),
 ('of', 66705),
 ('a', 56148),
 ('\n\n', 55935),
 ('and', 52580),
 ('is', 39554),
 ('in', 37787)]


As you can see, even for a relatively small corpus the tokenization method you choose can have a large impact on the resulting vocabulary and token counts. Furthermore, depending on the scale of the data, some approaches will take substantially longer than others. It should be noted that although `spaCy` appears to be much slower, it provides a much greater range of linguistic features out of the box, thus if you need more linguistic nature this library will make the most sense.

### Sentencization

Another often important aspect of preprocessing texts is breaking sentences into their respective sentences. While usually looking for a period is a good start, there are a great deal of exceptions to this rule which make this process rather complicated. To show the variation in results here we can again compare the `NLTK` and `spaCy` results.


In [10]:
%%time
nltk_sentences = set()
nltk_count = 0
for text in texts:
    #sents = nltk.tokenize.sent_tokenize(text)
    sents = [s.strip() for s in nltk.tokenize.sent_tokenize(text)]
    nltk_count += len(sents)
    nltk_sentences.update(sents)

CPU times: user 5.07 s, sys: 35.8 ms, total: 5.11 s
Wall time: 5.11 s


In [11]:
%%time
spacy_sentences = set()
spacy_count = 0
for text in texts:
    #sents = [s.text for s in list(nlp(text).sents)]
    sents = [s.text.strip() for s in list(nlp(text).sents)]
    spacy_count += len(sents)
    spacy_sentences.update(sents)

CPU times: user 31.5 s, sys: 31.3 ms, total: 31.5 s
Wall time: 31.5 s


In [12]:
pprint(u"Number of sentences NLTK : {} spaCy: {}".format(nltk_count, spacy_count))

'Number of sentences NLTK : 187884 spaCy: 189612'


In [13]:
pprint(u"Number of unique sentences NLTK : {} spaCy: {}".format(len(nltk_sentences), len(spacy_sentences)))

'Number of unique sentences NLTK : 167792 spaCy: 165475'


In [14]:
intersection = len(nltk_sentences.intersection(spacy_sentences))
pprint("Number of identical sentences: {}".format(intersection))

'Number of identical sentences: 146573'


Again we see here that there appears to be a discrepancy in the number of sentences found by the two libraries, and an even greater discrepancy in how the sentence must be getting parsed. If this step is crucial to your application, it is certainly worthwhile to understand the differences, and if either approach is appropriate for your corpus. If the text you are trying process has nonstandardized text you may need to find your own solution, fix texts before **sentencizing** them, or write `spaCy` exceptions to correct these patterns.

Here is an example text to compare the results of the two **sentencizers**. In this case we see that white space is being handled differently between the two libraries.

In [15]:
pprint([s.strip() for s in nltk.tokenize.sent_tokenize(text)])

['From: gunning@cco.caltech.edu (Kevin J. Gunning)\n'
 'Subject: stolen CBR900RR\n'
 'Organization: California Institute of Technology, Pasadena\n'
 'Lines: 12\n'
 'Distribution: usa\n'
 'NNTP-Posting-Host: alumni.caltech.edu\n'
 'Summary: see above\n'
 '\n'
 'Stolen from Pasadena between 4:30 and 6:30 pm on 4/15.',
 'Blue and white Honda CBR900RR california plate KG CBR.',
 'Serial number\nJH2SC281XPM100187, engine number 2101240.',
 'No turn signals or mirrors, lights taped over for track riders session\n'
 'at Willow Springs tomorrow.',
 "Guess I'll miss it.",
 ':-(((\n\nHelp me find my baby!!!',
 'kjg']


In [16]:
pprint([s.text.strip() for s in list(nlp(text).sents)])

['From: gunning@cco.caltech.edu (Kevin J. Gunning)\n'
 'Subject: stolen CBR900RR\n'
 'Organization: California Institute of Technology, Pasadena\n'
 'Lines: 12\n'
 'Distribution: usa\n'
 'NNTP-Posting-Host: alumni.caltech.edu\n'
 'Summary: see above\n'
 '\n'
 'Stolen from Pasadena between 4:30 and 6:30 pm on 4/15.',
 'Blue and white Honda CBR900RR california plate KG CBR.',
 'Serial number\nJH2SC281XPM100187, engine number 2101240.',
 'No turn signals or mirrors, lights taped over for track riders session\n'
 'at Willow Springs tomorrow.',
 "Guess I'll miss it.",
 ':-(((\n\nHelp me find my baby!!!',
 'kjg']


In [17]:
[s.text.strip() for s in list(nlp(text).sents)]

['From: gunning@cco.caltech.edu (Kevin J. Gunning)\nSubject: stolen CBR900RR\nOrganization: California Institute of Technology, Pasadena\nLines: 12\nDistribution: usa\nNNTP-Posting-Host: alumni.caltech.edu\nSummary: see above\n\nStolen from Pasadena between 4:30 and 6:30 pm on 4/15.',
 'Blue and white Honda CBR900RR california plate KG CBR.',
 'Serial number\nJH2SC281XPM100187, engine number 2101240.',
 'No turn signals or mirrors, lights taped over for track riders session\nat Willow Springs tomorrow.',
 "Guess I'll miss it.",
 ':-(((\n\nHelp me find my baby!!!',
 'kjg']

## Byte-Pair and Wordpiece Encoding

Two additional forms of tokenization which relax the strict focus on word boundaries are byte-pair encoding and wordpiece encoding. The advantage of such systems is that their relaxation of word boundaries means that we might be able to find common subwords, or even using word parts, understand new words which weren't present in the training corpus.



#### Byte-Pair Encoding

Byte-Pair encoding is deemed as one of the simplest methods of finding this new word encoding. In order to build a vocabulary by which the tokenizer can split a corpus, the algorithm starts by creating a list of all character pairs in the corpus. Then the most common pair and merged to create a new "character" and the process is repeated. This approach however assumes that individual words have already been tokenized into their indvidial words, this prevents us from merging characters across words. Finally, the merging process terminates after some arbitrary numer of merges which can be seen as a hyperparameter in the training process.

Below we will try to implement an example which illustrates the process. First we'll tokenize our data as above, however not sentencize it, then we can begin the byte-pair encoding process.

In [18]:
bpe_corpus  = []

for text in tokenizer.pipe(texts):
    bpe_corpus.append([t.text for t in text if not t.is_punct and not t.is_space])


In [19]:
pprint(bpe_corpus[10][0:5])

['From:', 'irwin@cmptrc.lonestar.org', '(Irwin', 'Arnstein)', 'Subject:']


So at this point we have a list of lists which contain our text. Now we we want to start counting the frequency of character pairs. One exception, is that what happens to single character words like "I" or "a". Actually, in BPE we append a special character to the end of each word as a place holder.

Since I assume there is a lot of overlap in tokens, i.e. a finite vocabulary, we could start also counting how often each token appears in the corpus.

In [20]:
token_frequencies = Counter()
for text in bpe_corpus:
    token_frequencies.update(text)
token_frequencies.most_common(10)

[('the', 127670),
 ('to', 69836),
 ('of', 66705),
 ('a', 56148),
 ('and', 52580),
 ('is', 39554),
 ('in', 37787),
 ('I', 37616),
 ('that', 34582),
 ('>', 27843)]

Now we just have to iterate through the character pairs and take into consideration how often each word appears, keeping in mind the trailing special character.

To make the process of splitting and rejoining subword pairs, we will initially split words into their individual characters, using a space as a delimiter.

In [54]:
new_token_frequencies = dict()
initial_unique_characters = set()

for key, value in token_frequencies.items():
    characters = list(key)
    characters += ['</W>']
    initial_unique_characters.update(characters)
    new_key = ' '.join(characters)
    new_token_frequencies[new_key] = value
    
list(new_token_frequencies.keys())[0:5]

['F r o m : </W>',
 'l e r x s t @ w a m . u m d . e d u </W>',
 "( w h e r e ' s </W>",
 'm y </W>',
 't h i n g ) </W>']

Now we need to find the most common character pairs

In [49]:
pairs = defaultdict(int)

for token, value in new_token_frequencies.items():
    characters = token.split()
    for index in range(len(characters)-1):
        pairs[characters[index], characters[index+1]] += value

In [53]:
most_common_pair = max(pairs, key=pairs.get)
most_common_pair

('e', '</W>')

Now having found the most common pair, we need to save this pair in our vocab and modify the keys of the token frequences to reflect this merge. After that we can continue to find the next most common pair.

In [64]:
vocab = list(initial_unique_characters)
joined_pair = ''.join(most_common_pair)
vocab.append(joined_pair)

merged_token_frequencies = dict()
for token, value in new_token_frequencies.items():
    characters = token.split()
    new_characters = []

    while len(characters) > 1:
        merged_characters = ''.join(characters[0:2])
        if merged_characters in vocab:
            new_characters.append(merged_characters)
            characters.pop(1)
            characters.pop(0)
        else:
            new_characters.append(characters[0])
            characters.pop(0)
    if len(characters) == 1:
        new_characters.append(characters[0])
    new_token = ' '.join(new_characters)
    merged_token_frequencies[new_token] = value

To make the whole process more convenient we could refactor everything into a tokenizer class.

In [7]:
class bpe_tokenizer(object):
    
    def __init__(self, corpus, n_merges=10):
        self.corpus = corpus
        self.n_merges = n_merges
        self.n_merges_done = 0
        self.token_counts = self.count_tokens()
        self.vocab = self.initialize_vocab()

    def make_bpe_style_tokens(self, token_counts):
        new_token_counts = dict()

        for key, value in token_counts.items():
            characters = list(key)
            characters += ['</W>']
            new_key = ' '.join(characters)
            new_token_counts[new_key] = value
        return new_token_counts

    def count_tokens(self):
        tokenizer = Tokenizer(nlp.vocab)
        token_counts = Counter()
        for text in tokenizer.pipe(self.corpus):
            tokens = [t.text for t in text if not t.is_punct and not t.is_space]
            token_counts.update(tokens)
        bpe_token_counts = self.make_bpe_style_tokens(token_counts)
        return bpe_token_counts

    def initialize_vocab(self):
        vocab = set()
        for token in self.token_counts.keys():
            characters = token.split()
            vocab.update(characters)
        return list(vocab)
    
    def find_most_common_pair(self):
        pairs = defaultdict(int)

        for token, value in self.token_counts.items():
            characters = token.split()
            for index in range(len(characters)-1):
                pairs[characters[index], characters[index+1]] += value
        most_common_pair = max(pairs, key=pairs.get)
        return most_common_pair
    
    def modify_vocab(self, new_character):
        self.vocab.append(new_character)
    
    def modify_tokens(self, new_character):
        modified_token_frequencies = dict()
        for token, value in self.token_counts.items():
            characters = token.split()
            new_characters = []

            while len(characters) > 1:
                merged_characters = ''.join(characters[0:2])
                if merged_characters in self.vocab:
                    new_characters.append(merged_characters)
                    characters.pop(1)
                    characters.pop(0)
                else:
                    new_characters.append(characters[0])
                    characters.pop(0)
            if len(characters) == 1:
                new_characters.append(characters[0])
            new_token = ' '.join(new_characters)
            modified_token_frequencies[new_token] = value
        self.token_counts = modified_token_frequencies
    
    def merge(self, n=None):
        if n is None:
            n=self.n_merges
        for i in range(n):
            pair = self.find_most_common_pair()
            merged_pair = ''.join(pair)
            self.modify_vocab(merged_pair)
            self.modify_tokens(merged_pair)
        self.n_merges_done += n

In [8]:
test = bpe_tokenizer(texts)

In [9]:
test.merge()

In [10]:
test.vocab[-10:-1]

['e</W>', 'th', 's</W>', 't</W>', 'in', 'er', 'an', 'd</W>', 'on']

In the text they use a simpler example, however to compare we can check it with the following example.

In [11]:
example_text = ['low'] * 5
example_text += ['lowest'] * 2
example_text += ['newer'] * 6
example_text += ['wider'] * 3
example_text += ['new'] * 2


In [12]:
example_tokenizer = bpe_tokenizer(example_text)
example_tokenizer.merge(n=8)

In [13]:
example_tokenizer.vocab

['r',
 'i',
 's',
 'd',
 'o',
 't',
 '</W>',
 'e',
 'w',
 'l',
 'n',
 'er',
 'er</W>',
 'ne',
 'new',
 'lo',
 'low',
 'newer</W>',
 'low</W>']

Funny enough, the example provided in the book highlights a point of ambiguity in the description. It isn't clear how to break ties, for example, 'er' or 'r<\/W>'. Since the algorithm iteratively merges characters, this can have the effect of changing how the next characters are parsed too. 

Despite this ambiguity we also see that the texts example, and the given code agree on the last 4 merges. 

### Wordpiece Segmentation

Beyond using character pair frequencies as the decisive metric, tokenizers like **Wordpiece** do so by minimizing the language model likelihood of the training data and then using a greedy longest-match to split new documents into word pieces. This approach gets discussed in more detail in future chapters.

## Additional Notes:

Below in a reference implementation copied from the book.

In [40]:
import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = dict()
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

num_merges = 8
vocab = {
    'l o w </w>':5,
    'l o w e s t </w>':2,
    'n e w e r </w>':6,
    'w i d e r </w>':3,
    'n e w </w>':2,
    
}


for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)

('e', 'r')
('er', '</w>')
('n', 'e')
('ne', 'w')
('l', 'o')
('lo', 'w')
('new', 'er</w>')
('low', '</w>')
