## Tokens
![deep](https://user-images.githubusercontent.com/12748752/134754236-8d5549c9-bd05-408d-ba63-0d56ab83c999.png)
* A token is a single unit, or piece of information. 
* Typically in NLP we will find that models consume a *token*, which can represent a multitude of different things, such as:
     * A word
     * Part of a word
     * A single character
     * Puntuation mark *[,!-.]*
     * Special token like *\<URL\>*, or *\<NAME\>*
     * Model-specific special tokens, like *[CLS]* and *[SEP]* for BERT

### Tweet Data
![light](https://user-images.githubusercontent.com/12748752/134754235-ae8efaf0-a27a-46f0-b439-b114cbb8cf3e.png)


In [1]:
tweet = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""

> ### Spliting the paragraph into **word-level tokens**

In [3]:
tweet.split()[:20]

['I’m',
 'amazed',
 'how',
 'often',
 'in',
 'practice,',
 'not',
 'only',
 'does',
 'a',
 '@huggingface',
 'NLP',
 'model',
 'solve',
 'your',
 'problem,',
 'but',
 'one',
 'of',
 'their']

> ### Spliting the paragraph into **character-level tokens**

In [3]:
[char for char in tweet][:10]

['I', '’', 'm', ' ', 'a', 'm', 'a', 'z', 'e', 'd']

### Difference between Character-level Tokens and Word-level Tokens
![light](https://user-images.githubusercontent.com/12748752/134754235-ae8efaf0-a27a-46f0-b439-b114cbb8cf3e.png)

> ### Advantage of Character-level Tokens over Word-level Tokens
* The *'advantage'* of using **character-level embeddings** is that any models we train on this data will only need to remember all of the characters of the alphabet, punctuation characters, and spaces/newlines.
* So the model **vocabulary** (list of all the tokens it *knows*) is very small. 
* Additionally if a new word appears outside of training, the model will still be able to digest it -
* Whereas a **word-level embedding** model would not understand the new word and replace it with an *unknown token*.

> ### Advantage of Word-level Tokens over Character-level Tokens  
* Words carry a significant level of semantic meaning, and when we use character-level embedding this is mostly lost.
* At a high-level we can view character-level embedding as being good for syntax, and word-level embedding as being better for semantics.
* Although, in-reality, word-level embeddings almost always outpeform character-level embeddings.

> ### Part-word tokens
* Latest transformer models that text can be split into **part-word tokens**. 
* So for example, we may find that the word *'being'* is split into the tokens *\["be", "-ing"\]*, or *'amazingly'* to *\["amaz", "-ing", "-ly"\]*.

* In addition to this, we typically seperate **punctuation** too, so in our previous example the tokens *'@huggingface'* and *'impressed,'* would become *\["@", "huggingface"\]* and *\["impressed", ","\]* respectively.

* In our tweet we might want to find any token that begins with **@** and convert that token to **\<USER\>**, a unique token that we have specified to identify usernames in our tweets. This rule is logical as there are potentially millions of added tokens in our model if we include Twitter usernames, but the username doesn't tell our model anything about the meaning in the language of the text, for example:

`@elonmusk thinks that the NLP models that @joebloggs made are super cool`

* Has no real meaningful difference to our model as with:

`@joebloggs thinks that the NLP models that @huggingface made are super cool`

* The meaning and subsequent classification of both tweets should really be identical in our model. 
* So, it is logical to replace usernames with a single shared token. This approach is something that is commonly used for many different things such as:
    * emails
    * names/usernames
    * URLs
    * monetary values
    * or any other numbers

* But ofcourse we don't always want to do this for everything, this is simply a rough guide as to what we *may* want to tokenize.


> ### BERT model-specific special tokens

* For the BERT transformer model there are *five* special tokens that are used by the model, these are:


<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0lax">Token</th>
    <th class="tg-0lax">Meaning</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">[PAD]</span></td>
    <td class="tg-0lax">Padding token, allows us to maintain same-length sequences (512 tokens for Bert) even when different sized sentences are fed in</td>
  </tr>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">[UNK]</span></td>
    <td class="tg-0lax">Used when a word is unknown to Bert</td>
  </tr>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">[CLS]</span></td>
    <td class="tg-0lax">Appears at the start of every sequence</td>
  </tr>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">[SEP]</span></td>
    <td class="tg-0lax">Indicates a seperator or end of sequence</td>
  </tr>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">[MASK]</span></td>
    <td class="tg-0lax">Used when masking tokens, for example in training with masked language modelling (MLM)</td>
  </tr>
</tbody>
</table>

* So if we take the *'NLP models'* tweet, processing that directly with our BERT specific tokens might look like this:

```
['[CLS]', '[UNK]', 'thinks', 'that', 'the', 'nlp', 'models', 'that', '[UNK]', 'made', 'are', 'super', 'cool', '[SEP]', '[PAD]', '[PAD]', ..., '[PAD]']
```

> ### Here, we have:

* Applied **\[CLS\]** token to indicate the start of the sequence.
* Both username tokens *@elonmusk* and *@joebloggs* were not 'known' words to BERT so BERT replaced them with unknown tokens **\[UNK\]**, alternatively we could have replaced these with our own special **user** tokens.
* Added **\[SEP\]** token to the end of our sequence.
* Padded the sequence upto the required length of 512 tokens *(required due to fixed input sequence length of BERT model)* using **\[PAD\]** tokens.

* Different models will have different special tokens, but we will often that they are being used for similiar reasons.

* That's everything on tokens for now, although we will cover tokenization in more depth (and the code too) for different models in later notebooks.