## Summary

### Hugging Face Tokenizer Properties
The Hugging Face API allows you to use a variety of tokenizers, each with its own properties. In this demo, we compared:

- 'bert-base-cased'
- 'xlm-roberta-base'
- 'google/pegasus-xsum'
- 'allenai/longformer-base-4096'

### Maximum Length
Different tokenizers will handle some text better, such as longer input sequences. The `.model_max_length` property of the tokenizer object will tell you the maximum length the model can handle.

If the length of your data exceeds the maximum length of your tokenizer, you may need to chunk the data before tokenizing it. Or you could consider switching to a different tokenizer that has a longer maximum length.

### Special Tokens
Different tokenizers will have different special tokens defined. They might have tokens representing:

- Unknown token
- Beginning of sequence token
- Separator token
- Token used for padding
- Classifier token
- Token used for masking values
  
Additionally, there may be multiple subtypes of each special token. For example, some tokenizers have multiple different unknown tokens (e.g. <unk> and <unk_2>).

### Hugging Face Tokenizers Takeaways
**Different tokenizers can create very different tokens for the same piece of text.** When choosing a tokenizer, consider what properties are important to you, such as the maximum length and the special tokens.

If none of the available tokenizers perform the way you need them to, you can also fine-tune a tokenizer to adjust it for your use case.

### Documentation on Hugging Face Tokenizers and Models
- [PreTrainedTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer)
- [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- Documentation on some available models:
    - [bert-base-cased](https://huggingface.co/docs/transformers/model_doc/bert)
    - [xlm-roberta-base](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)
    - [google/pegasus-xsum](https://huggingface.co/docs/transformers/model_doc/pegasus)
    - [allenai/longformer-base-4096](https://huggingface.co/docs/transformers/model_doc/longformer)

[Huggingface Tokenizer Properties Demo](./2.11e.ipynb)

## Additional References

[]()