Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add custom datasets tutorial #6466

Merged
merged 5 commits into from
Aug 17, 2020
Merged

Conversation

joeddav
Copy link
Contributor

@joeddav joeddav commented Aug 13, 2020

A tutorial showing examples for working with custom datasets on several tasks. Goals:

  1. Keep it general. The point is to show people how to use their own datasets, so don't use any processors or utilities that are dataset-specific.
  2. Show several tasks with different data formats. I include sequence classification with IMDb, token classification with W-NUT NER, and question answering with squad 2.0. Also link to how to train a language model blog post.
  3. Prepare the data in a way that works with Trainer, TFTrainer, native PyTorch, and native TensorFlow with keras's fit method.

@codecov
Copy link

codecov bot commented Aug 13, 2020

Codecov Report

Merging #6466 into master will decrease coverage by 1.34%.
The diff coverage is 83.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6466      +/-   ##
==========================================
- Coverage   79.77%   78.42%   -1.35%     
==========================================
  Files         148      153       +5     
  Lines       27214    28001     +787     
==========================================
+ Hits        21710    21960     +250     
- Misses       5504     6041     +537     
Impacted Files Coverage Δ
src/transformers/configuration_reformer.py 100.00% <ø> (ø)
src/transformers/data/test_generation_utils.py 0.00% <0.00%> (ø)
src/transformers/modeling_marian.py 90.00% <ø> (-0.91%) ⬇️
src/transformers/modeling_utils.py 87.15% <ø> (-0.20%) ⬇️
src/transformers/trainer_tf.py 12.25% <0.00%> (-0.13%) ⬇️
src/transformers/pipelines.py 25.63% <4.00%> (-54.16%) ⬇️
src/transformers/optimization.py 25.55% <7.14%> (-70.00%) ⬇️
src/transformers/testing_utils.py 51.92% <28.57%> (-20.81%) ⬇️
src/transformers/trainer.py 37.84% <37.50%> (-0.18%) ⬇️
src/transformers/modeling_tf_bert.py 98.38% <50.00%> (+1.79%) ⬆️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bc0056...31ea640. Read the comment docs.

@joeddav joeddav requested a review from sgugger August 13, 2020 17:39
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition, nice work! I've left a few nits/suggestions.

docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
docs/source/custom_datasets.rst Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Great that you did both TensorFlow and PyTorch. Love that you showcased how to use keras' fit method as well.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@LysandreJik
Copy link
Member

Don't mind the failing test, it's been fixed on master.

@joeddav joeddav changed the title [WIP] add custom datasets tutorial add custom datasets tutorial Aug 17, 2020
@joeddav joeddav merged commit d0c2389 into huggingface:master Aug 17, 2020
@joeddav joeddav deleted the custom-datasets branch August 28, 2020 15:08
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
* add custom datasets tutorial

* python -> bash code blocks

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* minor review feedback changes

* add working native QA snippet

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
* add custom datasets tutorial

* python -> bash code blocks

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* minor review feedback changes

* add working native QA snippet

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants