Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove line breaks that appear to be in middle of sentences #484

Merged
merged 2 commits into from Sep 21, 2022

Conversation

bodnarbm
Copy link
Contributor

@bodnarbm bodnarbm commented Sep 16, 2022

Summary

In the source data, there are description records that seem to be copy and paste entries from a PDF or other program that created line breaks in the middle of sentences (likely the end of the line when displayed in the original program). Examples are for trainings 1663 and 9381.

This change removes line breaks if they are the only thing between two alphabetical characters. While there are some false matches, the overall result does look better.

Addresses NJWE-29.

Test plan

Start locally and then compare

  1. http://localhost:3000/training/9381 to https://training.njcareers.org/training/9381; and
  2. http://localhost:3000/training/1663 to https://training.njcareers.org/training/1663

Before
Screen Shot 2022-09-16 at 12 48 23 PM

After
Screen Shot 2022-09-16 at 12 48 28 PM

In the source data, there are description records that seem to be
copy and paste entries from a PDF or other program that created
line breaks in the middle of sentences (likely the end of the line
when displayed in the original program). Examples are for trainings
1663 and 9381.

This change removes line breaks if they are the only thing between
two alphabetical characters. While there are some false matches, the
overall result does look better.
@bodnarbm bodnarbm self-assigned this Sep 16, 2022
Copy link
Collaborator

@namanaman namanaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nIce!

Copy link
Contributor

@derekkinsman derekkinsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

My only other thought would be to swap any newline for a space unless it is a newline directly following a punctuation (.\n and :\n would suggest the end of a paragraph). But depending on quality of data entry you might have to account for "punctuation space newline" as well. And I'm not convinced that would improve the false positive rate. But, overall does the job.

@bodnarbm
Copy link
Contributor Author

@derekkinsman I'll take a quick look at what the set of characters immediately before the \n look like to see if there is something more we can do. It won't help with the false positives from the current iteration, as they will continue to match regardless, but might help with any false negatives. I'll look quickly today, and will merge this as is later if it looks a negative matcher at the start won't help much. (we would differently need to continue to account for any repeated \n\n+ matches continue to be line breaks).

@bodnarbm
Copy link
Contributor Author

Tried a few other options, but anything more than the regex ([A-Za-z])\n([A-Za-z]) was creating too many false positives and removing new lines that should be there. Absent a ML model trained to remove the new lines, I think this regex is probably the best balance between removing unnecessary new lines and keeping those that should be there.

@bodnarbm bodnarbm merged commit 07ddf4e into master Sep 21, 2022
@bodnarbm bodnarbm deleted the remove-extra-new-lines branch September 21, 2022 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants