Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adding additional special tokens #26

Merged
merged 6 commits into from Nov 30, 2021
Merged

Support adding additional special tokens #26

merged 6 commits into from Nov 30, 2021

Conversation

seopbo
Copy link
Member

@seopbo seopbo commented Nov 29, 2021

  • additional_special_tokens을 subword tokenizer 학습 시 추가하는 기능
  • 기본적으로 사용되는 special_tokens와 중복되지않게함.

Refs: #5

- "docu_text", "docu_json", "sent_text", "sent_json"으로 corpus_type을
  정의함.
  - 위에 대응하여 load_corpora 함수를 수정함.
  - "sent_text"에 대응되는 loading scripts의 이름과 class 명을 수정함
  - serialize_corpora.py에서 corpus_type에 대응되게 argument parser를
    수정함.
  - train_tokenizer.py에서 corpus_type에 대응되게 refactoring을 수행함.
  - model_name -> model_type으로 수정함.

Refs: #23
- black을 적용하는 style을 huggingface transformers에 맞춤.
- black, isort 재적용

Refs: #23
- additional_special_tokens을 subword tokenizer 학습 시 추가하는 기능
- 기본적으로 사용되는 special_tokens와 중복되지않게함.

Refs: #5
@seopbo seopbo self-assigned this Nov 29, 2021
@seopbo seopbo added the enhancement New feature or request label Nov 29, 2021
Copy link
Collaborator

@bzantium bzantium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

Copy link
Collaborator

@monologg monologg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

train_tokenizer.py Outdated Show resolved Hide resolved
train_tokenizer.py Outdated Show resolved Hide resolved
seopbo and others added 2 commits November 30, 2021 18:12
Co-authored-by: Inje Ryu <36367357+iron-ij@users.noreply.github.com>
Co-authored-by: Inje Ryu <36367357+iron-ij@users.noreply.github.com>
@seopbo seopbo requested a review from iron-ij November 30, 2021 09:15
@seopbo seopbo merged commit a90d2f4 into main Nov 30, 2021
@seopbo seopbo deleted the feature/#5 branch November 30, 2021 10:23
@seopbo seopbo mentioned this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants