Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only access loss tensor every logging_steps #6802

Merged
merged 23 commits into from
Aug 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
18fc69a
Only access loss tensor every logging_steps
jysohn23 Aug 28, 2020
20f7786
Fix style (#6803)
sshleifer Aug 28, 2020
3cac867
t5 model should make decoder_attention_mask (#6800)
sshleifer Aug 28, 2020
5ab21b0
[s2s] Test hub configs in self-scheduled CI (#6809)
sshleifer Aug 28, 2020
ac47458
[s2s] round runtime in run_eval (#6798)
sshleifer Aug 29, 2020
0f58903
Pegasus finetune script: add --adafactor (#6811)
sshleifer Aug 29, 2020
22933e6
[bart] rename self-attention -> attention (#6708)
sshleifer Aug 29, 2020
563485b
[tests] fix typos in inputs (#6818)
stas00 Aug 30, 2020
a584761
Fixed open in colab link (#6825)
PandaWhoCodes Aug 30, 2020
d176aaa
Add model card for singbert lite. Update widget for singbert and sing…
zyuanlim Aug 30, 2020
0eecace
BR_BERTo model card (#6793)
rdenadai Aug 30, 2020
32fe440
clearly indicate shuffle=False (#6312)
xujiaze13 Aug 30, 2020
dfa10a4
[s2s README] Add more dataset download instructions (#6737)
sshleifer Aug 30, 2020
0e83769
Style
LysandreJik Aug 31, 2020
05c3214
Patch logging issue
LysandreJik Aug 31, 2020
4561f05
Set default logging level to `WARNING` instead of `INFO`
LysandreJik Aug 31, 2020
895d394
TF Flaubert w/ pre-norm (#6841)
LysandreJik Aug 31, 2020
2de7ee0
Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task…
mojave-pku Aug 31, 2020
d2f9cb8
Fix in Adafactor docstrings (#6845)
sgugger Aug 31, 2020
c48546c
Fix resuming training for Windows (#6847)
sgugger Aug 31, 2020
ac03af4
Only access loss tensor every logging_steps
jysohn23 Aug 28, 2020
db74df3
Merge branch 'tpu-mlm' of https://github.com/jysohn23/transformers in…
jysohn23 Aug 31, 2020
2b981cd
comments
jysohn23 Aug 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions examples/lightning_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,10 @@ def train_dataloader(self):
return self.train_loader

def val_dataloader(self):
return self.get_dataloader("dev", self.hparams.eval_batch_size)
return self.get_dataloader("dev", self.hparams.eval_batch_size, shuffle=False)

def test_dataloader(self):
return self.get_dataloader("test", self.hparams.eval_batch_size)
return self.get_dataloader("test", self.hparams.eval_batch_size, shuffle=False)

def _feature_file(self, mode):
return os.path.join(
Expand Down
32 changes: 23 additions & 9 deletions examples/seq2seq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).


### Data
XSUM Data:
## Datasets

#### XSUM:
```bash
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
Expand All @@ -17,23 +18,33 @@ export XSUM_DIR=${PWD}/xsum
this should make a directory called `xsum/` with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.

CNN/DailyMail data
#### CNN/DailyMail
```bash
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm_v2.tgz
tar -xzvf cnn_dm_v2.tgz # empty lines removed
mv cnn_cln cnn_dm
export CNN_DIR=${PWD}/cnn_dm
this should make a directory called `cnn_dm/` with files like `test.source`.
```
this should make a directory called `cnn_dm/` with 6 files.

WMT16 English-Romanian Translation Data:
#### WMT16 English-Romanian Translation Data:
download with this command:
```bash
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export ENRO_DIR=${PWD}/wmt_en_ro
this should make a directory called `wmt_en_ro/` with files like `test.source`.
```
this should make a directory called `wmt_en_ro/` with 6 files.

#### WMT English-German:
```bash
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tar.gz
export DATA_DIR=${PWD}/wmt_en_de
```

#### Private Data

If you are using your own data, it must be formatted as one directory with 6 files:
```
Expand Down Expand Up @@ -75,7 +86,8 @@ Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prep
Future work/help wanted: A new dataset to support multilingual tasks.


### Command Line Options
### Finetuning Scripts
All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.

To see all the possible command line options, run:

Expand Down Expand Up @@ -110,6 +122,8 @@ The following command should work on a 16GB GPU:
--model_name_or_path facebook/bart-large
```

There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.

### Translation Finetuning

First, follow the wmt_en_ro download instructions.
Expand Down
2 changes: 1 addition & 1 deletion examples/seq2seq/finetune_pegasus_xsum.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ python finetune.py \
--n_val 1000 \
--val_check_interval 0.25 \
--max_source_length 512 --max_target_length 56 \
--freeze_embeds --max_target_length 56 --label_smoothing 0.1 \
--freeze_embeds --label_smoothing 0.1 --adafactor --task summarization_xsum \
"$@"
2 changes: 1 addition & 1 deletion examples/seq2seq/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def generate_summaries_or_translations(
fout.write(hypothesis + "\n")
fout.flush()
fout.close()
runtime = time.time() - start_time
runtime = int(time.time() - start_time) # seconds
n_obs = len(examples)
return dict(n_obs=n_obs, runtime=runtime, seconds_per_sample=round(runtime / n_obs, 4))

Expand Down
24 changes: 22 additions & 2 deletions examples/seq2seq/test_seq2seq_examples.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
from torch.utils.data import DataLoader

import lightning_base
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
from transformers.hf_api import HfApi
from transformers.modeling_bart import shift_tokens_right
from transformers.testing_utils import CaptureStderr, CaptureStdout, require_multigpu
from transformers.testing_utils import CaptureStderr, CaptureStdout, require_multigpu, require_torch_and_cuda, slow

from .distillation import distill_main, evaluate_checkpoint
from .finetune import SummarizationModule, main
Expand Down Expand Up @@ -116,6 +117,25 @@ def setUpClass(cls):
logging.disable(logging.CRITICAL) # remove noisy download output from tracebacks
return cls

@slow
@require_torch_and_cuda
def test_hub_configs(self):
"""I put require_torch_and_cuda cause I only want this to run with self-scheduled."""

model_list = HfApi().model_list()
org = "sshleifer"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
allowed_to_be_broken = ["sshleifer/blenderbot-3B", "sshleifer/blenderbot-90M"]
failures = []
for m in model_ids:
if m in allowed_to_be_broken:
continue
try:
AutoConfig.from_pretrained(m)
except Exception:
failures.append(m)
assert not failures, f"The following models could not be loaded through AutoConfig: {failures}"

@require_multigpu
def test_multigpu(self):
updates = dict(
Expand Down
14 changes: 9 additions & 5 deletions model_cards/rdenadai/BR_BERTo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,17 @@ Portuguese (Brazil) model for text inference.

## Params

Trained on a corpus of 5_258_624 sentences, with 132_807_374 non unique tokens (992_418 unique tokens).
Trained on a corpus of 6_993_330 sentences.

- Vocab size: 220_000
- RobertaForMaskedLM size : 32
- Num train epochs: 2
- Time to train: ~23hs (on GCP with a Nvidia T4)
- Vocab size: 150_000
- RobertaForMaskedLM size : 512
- Num train epochs: 3
- Time to train: ~10days (on GCP with a Nvidia T4)

I follow the great tutorial from HuggingFace team:

[How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)

More infor here:

[BR_BERTo](https://github.com/rdenadai/BR-BERTo)
6 changes: 3 additions & 3 deletions model_cards/zanelim/singbert-large-sg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "die [MASK] must try"
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---

# Model name

SingBert - Bert for Singlish (SG) and Manglish (MY).
SingBert Large - Bert for Singlish (SG) and Manglish (MY).

## Model description

Similar to [SingBert](https://huggingface.co/zanelim/singbert) but initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the large version, which was initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.

## Intended uses & limitations
Expand Down
168 changes: 168 additions & 0 deletions model_cards/zanelim/singbert-lite-sg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
language: en
tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- albert-base-v2
license: mit
datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "dont play [MASK] leh"
- text: "die [MASK] must try"
---

# Model name

SingBert Lite - Bert for Singlish (SG) and Manglish (MY).

## Model description

Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the lite-version, which was initialized from [Albert base v2](https://github.com/google-research/albert#albert), with pre-training finetuned on
[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.

## Intended uses & limitations

#### How to use

```python
>>> from transformers import pipeline
>>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg')
>>> nlp("die [MASK] must try")

[{'sequence': '[CLS] die die must try[SEP]',
'score': 0.7731555700302124,
'token': 1327,
'token_str': '▁die'},
{'sequence': '[CLS] die also must try[SEP]',
'score': 0.04763784259557724,
'token': 67,
'token_str': '▁also'},
{'sequence': '[CLS] die still must try[SEP]',
'score': 0.01859409362077713,
'token': 174,
'token_str': '▁still'},
{'sequence': '[CLS] die u must try[SEP]',
'score': 0.015824034810066223,
'token': 287,
'token_str': '▁u'},
{'sequence': '[CLS] die is must try[SEP]',
'score': 0.011271446943283081,
'token': 25,
'token_str': '▁is'}]

>>> nlp("dont play [MASK] leh")

[{'sequence': '[CLS] dont play play leh[SEP]',
'score': 0.4365769624710083,
'token': 418,
'token_str': '▁play'},
{'sequence': '[CLS] dont play punk leh[SEP]',
'score': 0.06880936771631241,
'token': 6769,
'token_str': '▁punk'},
{'sequence': '[CLS] dont play game leh[SEP]',
'score': 0.051739856600761414,
'token': 250,
'token_str': '▁game'},
{'sequence': '[CLS] dont play games leh[SEP]',
'score': 0.045703962445259094,
'token': 466,
'token_str': '▁games'},
{'sequence': '[CLS] dont play around leh[SEP]',
'score': 0.013458190485835075,
'token': 140,
'token_str': '▁around'}]

>>> nlp("catch no [MASK]")

[{'sequence': '[CLS] catch no ball[SEP]',
'score': 0.6197211146354675,
'token': 1592,
'token_str': '▁ball'},
{'sequence': '[CLS] catch no balls[SEP]',
'score': 0.08441998809576035,
'token': 7152,
'token_str': '▁balls'},
{'sequence': '[CLS] catch no joke[SEP]',
'score': 0.0676785409450531,
'token': 8186,
'token_str': '▁joke'},
{'sequence': '[CLS] catch no?[SEP]',
'score': 0.040638409554958344,
'token': 60,
'token_str': '?'},
{'sequence': '[CLS] catch no one[SEP]',
'score': 0.03546864539384842,
'token': 53,
'token_str': '▁one'}]

>>> nlp("confirm plus [MASK]")

[{'sequence': '[CLS] confirm plus chop[SEP]',
'score': 0.9608421921730042,
'token': 17144,
'token_str': '▁chop'},
{'sequence': '[CLS] confirm plus guarantee[SEP]',
'score': 0.011784233152866364,
'token': 9120,
'token_str': '▁guarantee'},
{'sequence': '[CLS] confirm plus confirm[SEP]',
'score': 0.010571340098977089,
'token': 10265,
'token_str': '▁confirm'},
{'sequence': '[CLS] confirm plus egg[SEP]',
'score': 0.0033525123726576567,
'token': 6387,
'token_str': '▁egg'},
{'sequence': '[CLS] confirm plus bet[SEP]',
'score': 0.0008760977652855217,
'token': 5676,
'token_str': '▁bet'}]

```

Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg')
model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

and in TensorFlow:
```python
from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg")
model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```

#### Limitations and bias
This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.

## Training data
Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.

## Training procedure

Initialized with [albert base v2](https://github.com/google-research/albert#albert) vocab and checkpoints (pre-trained weights).

Pre-training was further finetuned on training data with the following hyperparameters
* train_batch_size: 4096
* max_seq_length: 128
* num_train_steps: 125000
* num_warmup_steps: 5000
* learning_rate: 0.00176
* hardware: TPU v3-8
2 changes: 1 addition & 1 deletion model_cards/zanelim/singbert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "die [MASK] must try"
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---

# Model name
Expand Down
4 changes: 2 additions & 2 deletions notebooks/03-pipelines.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2358,7 +2358,7 @@
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/transformers/blob/generation_pipeline_docs/notebooks/03-pipelines.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
Expand Down Expand Up @@ -3402,4 +3402,4 @@
]
}
]
}
}
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@
from .data.data_collator import (
DataCollator,
DataCollatorForLanguageModeling,
DataCollatorForNextSentencePrediction,
DataCollatorForPermutationLanguageModeling,
DataCollatorWithPadding,
default_data_collator,
Expand All @@ -211,6 +212,7 @@
SquadDataset,
SquadDataTrainingArguments,
TextDataset,
TextDatasetForNextSentencePrediction,
)
from .generation_utils import top_k_top_p_filtering
from .modeling_albert import (
Expand Down
Loading