Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we expect BertSum to work on text tokenized using a tokenizer other than Core-NLP? #32

Closed
enzoampil opened this issue May 22, 2019 · 3 comments

Comments

@enzoampil
Copy link

Is it safe to assume that BERTSUM will perform without issues on input text tokenized using a tokenizer other than Core-NLP as long as we implement sentence splitting?

@nlpyang
Copy link
Owner

nlpyang commented May 22, 2019

Hi, if the sentence splitting is correct, it should work fine. And the tokenizer is not that important, since we always use BERT's subtokenizer after the tokenization.

@enzoampil
Copy link
Author

Thank you for confirming!

@Santosh-Gupta
Copy link

Hello,

I was wondering if you could provide a sample of what the tokenized data should look like. I am not able to install the Stanford Tokenizer from Java in Google Colab, so I am not not able to check the output files , and format the output using an alternative tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants