Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new pre-trained models BERTweet and PhoBERT #6129

Merged
merged 34 commits into from Sep 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b2b24d1
Add BERTweet and PhoBERT models
datquocnguyen Jul 29, 2020
b30bcdf
Update modeling_auto.py
datquocnguyen Jul 29, 2020
aa2e5b5
Update tokenization_auto.py
datquocnguyen Jul 29, 2020
a49233f
Add BERTweet and PhoBERT to pretrained_models.rst
datquocnguyen Jul 30, 2020
c51345a
Update tokenization_auto.py
datquocnguyen Jul 30, 2020
2cf49f9
Update BertweetTokenizer - without nltk
datquocnguyen Jul 30, 2020
abc3d25
Update model card for BERTweet
datquocnguyen Jul 30, 2020
1287460
PhoBERT - with Auto mode - without import fastBPE
datquocnguyen Aug 30, 2020
efcbb59
PhoBERT - with Auto mode - without import fastBPE
datquocnguyen Aug 31, 2020
1a10bb5
BERTweet - with Auto mode - without import fastBPE
datquocnguyen Aug 31, 2020
6c95a39
Add PhoBERT and BERTweet to TF modeling auto
datquocnguyen Aug 31, 2020
65ad9bb
Improve Docstrings for PhobertTokenizer and BertweetTokenizer
datquocnguyen Sep 1, 2020
6592499
Update PhoBERT and BERTweet model cards
datquocnguyen Sep 3, 2020
80b5cb2
Resolved merge conflicts
datquocnguyen Sep 3, 2020
db78504
Fixed a merge conflict in tokenization_auto
datquocnguyen Sep 3, 2020
384c599
Used black to reformat BERTweet- and PhoBERT-related files
datquocnguyen Sep 4, 2020
50407ac
Used isort to reformat BERTweet- and PhoBERT-related files
datquocnguyen Sep 4, 2020
28f514a
Reformatted BERTweet- and PhoBERT-related files based on flake8
datquocnguyen Sep 4, 2020
bda755b
Updated test files
datquocnguyen Sep 4, 2020
8e5aa34
Updated test files
datquocnguyen Sep 4, 2020
6d6e9cd
Updated tf test files
datquocnguyen Sep 4, 2020
f9d694c
Updated tf test files
datquocnguyen Sep 4, 2020
0811a1a
Updated tf test files
datquocnguyen Sep 4, 2020
7e083b9
Updated tf test files
datquocnguyen Sep 4, 2020
3885e12
Update commits from huggingface
datquocnguyen Sep 15, 2020
926a6f6
Merge branch 'master' of git://github.com/huggingface/transformers
datquocnguyen Sep 15, 2020
8db4b5b
Update commits from huggingface
datquocnguyen Sep 15, 2020
3a883d2
Delete unnecessary files
datquocnguyen Sep 15, 2020
e1ee98a
Add tokenizers to auto and init files
datquocnguyen Sep 16, 2020
f15b08b
Add test files for tokenizers
datquocnguyen Sep 16, 2020
62beafe
Revised model cards
datquocnguyen Sep 16, 2020
73de519
Update save_vocabulary function in BertweetTokenizer and PhobertToken…
datquocnguyen Sep 16, 2020
1207489
Revised test files
datquocnguyen Sep 17, 2020
257b9f1
Update orders of Phobert and Bertweet tokenizers in auto tokenization…
datquocnguyen Sep 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
71 changes: 71 additions & 0 deletions model_cards/vinai/bertweet-base/README.md
@@ -0,0 +1,71 @@
# <a name="introduction"></a> BERTweet: A pre-trained language model for English Tweets

- BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) pre-training procedure, using the same model configuration as [BERT-base](https://github.com/google-research/bert).
- The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related the **COVID-19** pandemic.
- BERTweet does better than its competitors RoBERTa-base and [XLM-R-base](https://arxiv.org/abs/1911.02116) and outperforms previous state-of-the-art models on three downstream Tweet NLP tasks of Part-of-speech tagging, Named entity recognition and text classification.

The general architecture and experimental results of BERTweet can be found in our EMNLP-2020 demo [paper](https://arxiv.org/abs/2005.10200):

@inproceedings{bertweet,
title = {{BERTweet: A pre-trained language model for English Tweets}},
author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020}
}

**Please CITE** our paper when BERTweet is used to help produce published results or is incorporated into other software.

For further information or requests, please go to [BERTweet's homepage](https://github.com/VinAIResearch/BERTweet)!

## <a name="install2"></a> Installation

- Python version >= 3.6
- [PyTorch](http://pytorch.org/) version >= 1.4.0
- `pip3 install transformers emoji`

## <a name="models2"></a> Pre-trained model

Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/bertweet-base` | 135M | base | 845M English Tweets (80GB)


## <a name="usage2"></a> Example usage


```python
import torch
from transformers import AutoModel, AutoTokenizer #, BertweetTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
#tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
```

## <a name="preprocess"></a> Normalize raw input Tweets

Before applying `fastBPE` to the pre-training corpus of 850M English Tweets, we tokenized these Tweets using `TweetTokenizer` from the NLTK toolkit and used the `emoji` package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens `@USER` and `HTTPURL`, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets.

```python
import torch
from transformers import BertweetTokenizer

# Load the BertweetTokenizer with a normalization mode if the input Tweet is raw
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)

# BERTweet's tokenizer can be also loaded in the "Auto" mode
# from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)

line = "SC has first two presumptive cases of coronavirus, DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier"

input_ids = torch.tensor([tokenizer.encode(line)])
```
51 changes: 51 additions & 0 deletions model_cards/vinai/phobert-base/README.md
@@ -0,0 +1,51 @@
# <a name="introduction"></a> PhoBERT: Pre-trained language models for Vietnamese

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam):

- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings [paper](https://arxiv.org/abs/2003.00744):

@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}

**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.

For further information or requests, please go to [PhoBERT's homepage](https://github.com/VinAIResearch/PhoBERT)!

## Installation <a name="install2"></a>
- Python version >= 3.6
- [PyTorch](http://pytorch.org/) version >= 1.4.0
- `pip3 install transformers`

## Pre-trained models <a name="models2"></a>


Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/phobert-base` | 135M | base | 20GB of texts
`vinai/phobert-large` | 370M | large | 20GB of texts

## Example usage <a name="usage2"></a>

```python
import torch
from transformers import AutoModel, AutoTokenizer #, PhobertTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
#tokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-base")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
```
51 changes: 51 additions & 0 deletions model_cards/vinai/phobert-large/README.md
@@ -0,0 +1,51 @@
# <a name="introduction"></a> PhoBERT: Pre-trained language models for Vietnamese

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam):

- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) which optimizes the [BERT](https://github.com/google-research/bert) pre-training procedure for more robust performance.
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings [paper](https://arxiv.org/abs/2003.00744):

@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}

**Please CITE** our paper when PhoBERT is used to help produce published results or is incorporated into other software.

For further information or requests, please go to [PhoBERT's homepage](https://github.com/VinAIResearch/PhoBERT)!

## Installation <a name="install2"></a>
- Python version >= 3.6
- [PyTorch](http://pytorch.org/) version >= 1.4.0
- `pip3 install transformers`

## Pre-trained models <a name="models2"></a>


Model | #params | Arch. | Pre-training data
---|---|---|---
`vinai/phobert-base` | 135M | base | 20GB of texts
`vinai/phobert-large` | 370M | large | 20GB of texts

## Example usage <a name="usage2"></a>

```python
import torch
from transformers import AutoModel, AutoTokenizer #, PhobertTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-large")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-large")
#tokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-base")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
```
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Expand Up @@ -145,6 +145,7 @@
from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
from .tokenization_bertweet import BertweetTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
Expand All @@ -166,6 +167,7 @@
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
from .tokenization_pegasus import PegasusTokenizer
from .tokenization_phobert import PhobertTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/tokenization_auto.py
Expand Up @@ -54,6 +54,7 @@
from .tokenization_bert import BertTokenizer, BertTokenizerFast
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer
from .tokenization_bertweet import BertweetTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
Expand All @@ -68,6 +69,7 @@
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
from .tokenization_pegasus import PegasusTokenizer
from .tokenization_phobert import PhobertTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
Expand Down Expand Up @@ -96,6 +98,8 @@
(MarianConfig, (MarianTokenizer, None)),
(BartConfig, (BartTokenizer, BartTokenizerFast)),
(LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
(RobertaConfig, (BertweetTokenizer, None)),
(RobertaConfig, (PhobertTokenizer, None)),
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
(ReformerConfig, (ReformerTokenizer, None)),
(ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),
Expand Down