Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data and sentence splitting fixes #8

Merged
merged 5 commits into from Feb 9, 2022

Conversation

osyvokon
Copy link
Contributor

@osyvokon osyvokon commented Feb 4, 2022

This PR makes several changes:

  1. Represent newlines with the \n sequence
  2. Manually fix a dozen of annotated documents for newlines, lists, tables
  3. Better sentence-splitting. From now on source and target files are guaranteed to have the same number of lines. This, in particular, fixes issue Issue with data point 730 in train split #7
  4. Regenerate derivative data views (source only, target only, tokenized, sentence-split, etc.) from the original annotated files on every release. This is to ensure they are always in sync.

Mostly related to newlines, lists and broken annotation
This fixes several issues:

1. About 10 annotated tokens has been fixed (newlines, lists, tables)
2. Better sentence-splitting. Now source and target files are guaranteed
   to have the same number of lines.

This commit closes issue grammarly#7
@pavlo-kuchmiichuk pavlo-kuchmiichuk merged commit aa76000 into grammarly:main Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants