Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

Merged
merged 7 commits into from Nov 21, 2021

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Nov 7, 2021

Hello!

Pull request overview

  • Improved tokenize documentation:
    • Added docstrings for span_tokenize and tokenize for NLTKWordTokenizer and TreebankWordTokenizer.
    • Added Python 3.5+ typing to several methods.
  • Made TokenizerI the superclass for TweetTokenizer, which adds the tokenize_sents method for free.

The documentation changes speak for themselves, so I'll only briefly discuss the other changes.

New superclass for TweetTokenizer

TweetTokenizer is (I believe) the only tokenizer that didn't subclass from TokenizerI yet. Making it subclass from TokenizerI will automatically implement tokenize_sents for us, and allow easier implementation of tokenize_span in the future.
I've added a test to show that this works correctly.

Oh, also, I made a small mistake with git, so there's one unrelated commit which I didn't see until just now. It doesn't affect the PR and It shouldn't matter if we just squash and merge anyways, but it can be removed from the squash commit message.

  • Tom Aarsen

@tomaarsen
Copy link
Member Author

@tomaarsen tomaarsen commented Nov 20, 2021

Resolved merge conflicts with return_str deprecations.

@stevenbird stevenbird merged commit b30b6ac into nltk:develop Nov 21, 2021
16 checks passed
@stevenbird
Copy link
Member

@stevenbird stevenbird commented Nov 21, 2021

Thanks @tomaarsen!

@tomaarsen
Copy link
Member Author

@tomaarsen tomaarsen commented Nov 21, 2021

Gladly!

@tomaarsen tomaarsen deleted the enhancement/tokenize_documentation branch Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants