fix: remove line breaks that appear to be in middle of sentences #484

bodnarbm · 2022-09-16T16:41:55Z

Summary

In the source data, there are description records that seem to be copy and paste entries from a PDF or other program that created line breaks in the middle of sentences (likely the end of the line when displayed in the original program). Examples are for trainings 1663 and 9381.

This change removes line breaks if they are the only thing between two alphabetical characters. While there are some false matches, the overall result does look better.

Addresses NJWE-29.

Test plan

Start locally and then compare

Before

After

In the source data, there are description records that seem to be copy and paste entries from a PDF or other program that created line breaks in the middle of sentences (likely the end of the line when displayed in the original program). Examples are for trainings 1663 and 9381. This change removes line breaks if they are the only thing between two alphabetical characters. While there are some false matches, the overall result does look better.

namanaman

nIce!

derekkinsman

Looks good to me.

My only other thought would be to swap any newline for a space unless it is a newline directly following a punctuation (.\n and :\n would suggest the end of a paragraph). But depending on quality of data entry you might have to account for "punctuation space newline" as well. And I'm not convinced that would improve the false positive rate. But, overall does the job.

bodnarbm · 2022-09-19T17:05:50Z

@derekkinsman I'll take a quick look at what the set of characters immediately before the \n look like to see if there is something more we can do. It won't help with the false positives from the current iteration, as they will continue to match regardless, but might help with any false negatives. I'll look quickly today, and will merge this as is later if it looks a negative matcher at the start won't help much. (we would differently need to continue to account for any repeated \n\n+ matches continue to be line breaks).

bodnarbm · 2022-09-21T17:14:47Z

Tried a few other options, but anything more than the regex ([A-Za-z])\n([A-Za-z]) was creating too many false positives and removing new lines that should be there. Absent a ML model trained to remove the new lines, I think this regex is probably the best balance between removing unnecessary new lines and keeping those that should be there.

bodnarbm requested review from derekkinsman and namanaman September 16, 2022 16:41

bodnarbm self-assigned this Sep 16, 2022

namanaman approved these changes Sep 19, 2022

View reviewed changes

derekkinsman approved these changes Sep 19, 2022

View reviewed changes

Merge branch 'master' into remove-extra-new-lines

7c432a7

bodnarbm merged commit 07ddf4e into master Sep 21, 2022

bodnarbm deleted the remove-extra-new-lines branch September 21, 2022 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove line breaks that appear to be in middle of sentences #484

fix: remove line breaks that appear to be in middle of sentences #484

bodnarbm commented Sep 16, 2022 •

edited

namanaman left a comment

derekkinsman left a comment

bodnarbm commented Sep 19, 2022

bodnarbm commented Sep 21, 2022

fix: remove line breaks that appear to be in middle of sentences #484

fix: remove line breaks that appear to be in middle of sentences #484

Conversation

bodnarbm commented Sep 16, 2022 • edited

Summary

Test plan

namanaman left a comment

Choose a reason for hiding this comment

derekkinsman left a comment

Choose a reason for hiding this comment

bodnarbm commented Sep 19, 2022

bodnarbm commented Sep 21, 2022

bodnarbm commented Sep 16, 2022 •

edited