Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SBD tests from Pragmatic Segmenter #24

Merged
merged 55 commits into from
Jun 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a650313
[WIP] abbreviation replacer
nipunsadvilkar May 11, 2019
6e840e8
BugFix: Amend false variable return
nipunsadvilkar May 11, 2019
7ec3e4a
[WIP] multiple period abbreviation replacer
nipunsadvilkar May 11, 2019
b73c95a
21 tests for english passing
nipunsadvilkar May 11, 2019
b7735cb
[BugFix] "\\1." instead of r"\\1." in replace_abbreviation_as_sentenc…
nipunsadvilkar May 11, 2019
daf0355
[BugFix] Fixed post_process_segments & Add WithMultiplePeriodsAndEmai…
nipunsadvilkar May 11, 2019
fcb8c89
33 tests for english passing
nipunsadvilkar May 11, 2019
015fbe0
45 tests for english passing
nipunsadvilkar May 11, 2019
373609c
[BugFix] Quotation text list flattening after post_process_segments
nipunsadvilkar May 11, 2019
3ed8fb2
Add GeoLocationRule rule
nipunsadvilkar May 12, 2019
1271de5
Add FileFormatRule rule
nipunsadvilkar May 12, 2019
62f4807
Init clean rule: Add Newline rule
nipunsadvilkar May 12, 2019
beb3382
Setup segmenter class" -m "Check desired output with clean=True
nipunsadvilkar May 12, 2019
6fc865f
Switch from Processor to Segmenter class
nipunsadvilkar May 12, 2019
5761593
Add new clean=True test cases
nipunsadvilkar May 12, 2019
f623535
Add cleaner funcs: [WIP] search_for_connected_sentences func
nipunsadvilkar May 13, 2019
91673df
Complete Cleaner funcs: Fixed search_for_connected_sentences func
nipunsadvilkar May 13, 2019
f032d5c
Add doc_type "pdf" test
nipunsadvilkar May 13, 2019
95864df
55 tests for english passing: Normal, PDF, CLEAN
nipunsadvilkar May 16, 2019
5f25991
Add pytest skip for holy grail sentence
nipunsadvilkar May 19, 2019
80d1a67
Add abbreviation replacer, clean rules python files
nipunsadvilkar May 19, 2019
28d80e6
[WIP] bugfix numbered list
nipunsadvilkar May 19, 2019
7e028f9
Fixed numbered list. 56 english test passing
nipunsadvilkar May 19, 2019
54e09e0
63 english test passing
nipunsadvilkar May 19, 2019
990310d
68 english test passing
nipunsadvilkar May 19, 2019
6658104
74 english test passing
nipunsadvilkar May 19, 2019
0c9a0e8
80 english test passing
nipunsadvilkar May 19, 2019
fbecd7e
95 english test passing
nipunsadvilkar May 19, 2019
b30cc72
100 english test passing
nipunsadvilkar May 19, 2019
20ee563
Seperate test for golden rules and clean param rules
nipunsadvilkar May 19, 2019
c92749e
[BugFix] post_process_segments return txt -> pass
nipunsadvilkar May 20, 2019
3807001
[Major BugFix] re.sub with args ✗ -> re.IGNORECASE | *kwarg* ✔ -> fla…
nipunsadvilkar May 20, 2019
7d341aa
116 english test passing
nipunsadvilkar May 20, 2019
ecc079b
124 english test passing
nipunsadvilkar May 30, 2019
e93234b
130 english test passing
nipunsadvilkar May 30, 2019
6f41b5f
140 english tests passing
nipunsadvilkar Jun 6, 2019
498332a
150 english tests passing
nipunsadvilkar Jun 6, 2019
2f254f3
153 english tests passing
nipunsadvilkar Jun 6, 2019
d521798
154 english tests passing
nipunsadvilkar Jun 6, 2019
3215827
[BugFix] substitute_found_list_items on matched items only
nipunsadvilkar Jun 6, 2019
dfce6a4
154 english tests passing
nipunsadvilkar Jun 6, 2019
5f7db13
Add I.V keyword in replace_abbreviation_as_sentence_boundary
nipunsadvilkar Jun 7, 2019
8f282fa
Typo fix in SubSymbolsRules
nipunsadvilkar Jun 7, 2019
83df013
Regex escape chars in cleaner for ?
nipunsadvilkar Jun 7, 2019
2eae0b0
Fix regex of QUOTATION_AT_END_OF_SENTENCE_REGEX
nipunsadvilkar Jun 7, 2019
f81e9f4
166 english tests passing
nipunsadvilkar Jun 7, 2019
8bb7130
BugFix replacing `!` -> &ᓴ& and `?` -> &ᓷ&
nipunsadvilkar Jun 7, 2019
d4fe911
Fix regex of SingleUpperCaseLetterRule
nipunsadvilkar Jun 7, 2019
93aedec
Fix NUMBERED_REFERENCE_REGEX
nipunsadvilkar Jun 7, 2019
dff14b2
173 english tests passing
nipunsadvilkar Jun 7, 2019
37ade3b
174 english tests passing
nipunsadvilkar Jun 7, 2019
05bfca9
179 english tests passing
nipunsadvilkar Jun 8, 2019
eeec0ae
Added single big text english
nipunsadvilkar Jun 8, 2019
a9f7a6f
Update docstrings for tests
nipunsadvilkar Jun 8, 2019
963bf26
Remove print statements and few comments
nipunsadvilkar Jun 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pySBD/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .segmenter import Segmenter
95 changes: 95 additions & 0 deletions pySBD/abbreviation_replacer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# -*- coding: utf-8 -*-
import re
from pySBD.rules import Text
# TODO: SENTENCE_STARTERS should be lang specific
from pySBD.lang.standard import Abbreviation, SENTENCE_STARTERS
from pySBD.lang.common.numbers import (Common, SingleLetterAbbreviationRules,
AmPmRules)


def replace_pre_number_abbr(txt, abbr):
txt = re.sub(r'(?<=\s{abbr})\.(?=\s\d)|(?<=^{abbr})\.(?=\s\d)'.format(abbr=abbr.strip()), "∯", txt)
txt = re.sub(r'(?<=\s{abbr})\.(?=\s+\()|(?<=^{abbr})\.(?=\s+\()'.format(abbr=abbr.strip()), "∯", txt)
return txt


def replace_prepositive_abbr(txt, abbr):
txt = re.sub(r'(?<=\s{abbr})\.(?=\s)|(?<=^{abbr})\.(?=\s)'.format(abbr=abbr.strip()), "∯", txt)
txt = re.sub(r'(?<=\s{abbr})\.(?=:\d+)|(?<=^{abbr})\.(?=:\d+)'.format(abbr=abbr.strip()), "∯", txt)
return txt


def replace_period_of_abbr(txt, abbr):
txt = re.sub(r"(?<=\s{abbr})\.(?=((\.|\:|-|\?)|(\s([a-z]|I\s|I'm|I'll|\d|\())))|(?<=^{abbr})\.(?=((\.|\:|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))".format(abbr=abbr.strip()), '∯', txt)
txt = re.sub(r"(?<=\s{abbr})\.(?=,)|(?<=^{abbr})\.(?=,)".format(abbr=abbr.strip()), '∯', txt)
return txt


def replace_abbreviation_as_sentence_boundary(txt):
for word in SENTENCE_STARTERS:
escaped = re.escape(word)
regex = r"(U∯S|U\.S|U∯K|E∯U|E\.U|U∯S∯A|U\.S\.A|I|i.v|I.V)∯(?=\s{}\s)".format(escaped)
txt = re.sub(regex, '\\1.', txt)
return txt


class AbbreviationReplacer(object):

def __init__(self, text, language='en'):
self.text = text
self.language = language

def replace(self):
self.text = Text(self.text).apply(Common.PossessiveAbbreviationRule,
Common.KommanditgesellschaftRule,
*SingleLetterAbbreviationRules.All)
self.text = self.search_for_abbreviations_in_string()
self.replace_multi_period_abbreviations()
self.text = Text(self.text).apply(*AmPmRules.All)
self.text = replace_abbreviation_as_sentence_boundary(self.text)
return self.text

def replace_multi_period_abbreviations(self):
mpa = re.findall(Common.MULTI_PERIOD_ABBREVIATION_REGEX, self.text, flags=re.IGNORECASE)
if not mpa:
return self.text
for each in mpa:
replacement = re.sub(re.escape(r'.'), '∯', each)
self.text = re.sub(each, replacement, self.text)

def search_for_abbreviations_in_string(self):
original = self.text
lowered = original.lower()
for abbr in Abbreviation.ABBREVIATIONS:
stripped = abbr.strip()
if stripped not in lowered:
continue
abbrev_match = re.findall(
r'(?:^|\s|\r|\n){}'.format(stripped), original,
flags=re.IGNORECASE)
if not abbrev_match:
continue
next_word_start = r"(?<={" + str(re.escape(stripped)) + "} ).{1}"
char_array = re.findall(next_word_start, self.text)
for ind, match in enumerate(abbrev_match):
self.text = self.scan_for_replacements(self.text, match, ind, char_array)
return self.text

def scan_for_replacements(self, txt, am, ind, char_array):
char = char_array[ind] if char_array else ''
prepositive = Abbreviation.PREPOSITIVE_ABBREVIATIONS
number_abbr = Abbreviation.NUMBER_ABBREVIATIONS
upper = str(char).isupper()
if (not upper or am.strip().lower() in prepositive):
if am.strip().lower() in prepositive:
txt = replace_prepositive_abbr(txt, am)
elif am.strip().lower() in number_abbr:
txt = replace_pre_number_abbr(txt, am)
else:
txt = replace_period_of_abbr(txt, am)
return txt


if __name__ == "__main__":
s = "Here’s the - ahem - official citation: Baker, C., Anderson, Kenneth, Martin, James, & Palen, Leysia."
print(AbbreviationReplacer(s).replace())
Empty file added pySBD/clean/__init__.py
Empty file.
80 changes: 80 additions & 0 deletions pySBD/clean/rules.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# -*- coding: utf-8 -*-
from pySBD.rules import Rule


class CleanRules(object):

# NOTE: Caution: Might require \\ for special characters
# if regex is defined with r'' then dont
# add extra \\ for special characters
# Rubular: http://rubular.com/r/V57WnM9Zut
NewLineInMiddleOfWordRule = Rule(r'\n(?=[a-zA-Z]{1,2}\n)', '')

# Rubular: http://rubular.com/r/dMxp5MixFS
DoubleNewLineWithSpaceRule = Rule(r'\n \n', "\r")

# Rubular: http://rubular.com/r/H6HOJeA8bq
DoubleNewLineRule = Rule(r'\n\n', "\r")

# Rubular: http://rubular.com/r/FseyMiiYFT
NewLineFollowedByPeriodRule = Rule(r'\n(?=\.(\s|\n))', '')

ReplaceNewlineWithCarriageReturnRule = Rule(r'\n', "\r")

EscapedNewLineRule = Rule(r'\\n', "\n")

EscapedCarriageReturnRule = Rule(r'\\r', "\r")

TypoEscapedNewLineRule = Rule(r'\\\ n', "\n")

TypoEscapedCarriageReturnRule = Rule(r'\\\ r', "\r")

# Rubular: http://rubular.com/r/bAJrhyLNeZ
InlineFormattingRule = Rule(r'{b\^&gt;\d*&lt;b\^}|{b\^>\d*<b\^}', '')

# Rubular: http://rubular.com/r/8mc1ArOIGy
TableOfContentsRule = Rule(r'\.{4,}\s*\d+-*\d*', "\r")

# Rubular: http://rubular.com/r/DwNSuZrNtk
ConsecutivePeriodsRule = Rule(r'\.{5,}', ' ')

# Rubular: http://rubular.com/r/IQ4TPfsbd8
ConsecutiveForwardSlashRule = Rule(r'\/{3}', '')

# Rubular: http://rubular.com/r/6dt98uI76u
NO_SPACE_BETWEEN_SENTENCES_REGEX = r'(?<=[a-z])\.(?=[A-Z])'
# NO_SPACE_BETWEEN_SENTENCES_REGEX = r'[a-z]\.[A-Z]'
NoSpaceBetweenSentencesRule = Rule(NO_SPACE_BETWEEN_SENTENCES_REGEX, '. ')

# Rubular: http://rubular.com/r/l6KN6rH5XE
NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX = r'(?<=\d)\.(?=[A-Z])'
NoSpaceBetweenSentencesDigitRule = Rule(NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, '. ')

URL_EMAIL_KEYWORDS = ['@', 'http', '.com', 'net', 'www', '//']

# Rubular: http://rubular.com/r/3GiRiP2IbD
NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX = r'(?<=\s)\n(?=([a-z]|\())'

# Rubular: http://rubular.com/r/Gn18aAnLdZ
NewLineFollowedByBulletRule = Rule(r"\n(?=•')", "\r")

QuotationsFirstRule = Rule(r"''", '"')
QuotationsSecondRule = Rule(r'``', '"')


class HTML(object):
# Rubular: http://rubular.com/r/9d0OVOEJWj
HTMLTagRule = Rule(r"<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[\^'\">\s]+))?)+\s*|\s*)\/?>", '')

# Rubular: http://rubular.com/r/XZVqMPJhea
EscapedHTMLTagRule = Rule(r'&lt;\/?[^gt;]*gt;', '')

All = [HTMLTagRule, EscapedHTMLTagRule]


class PDF(object):
# Rubular: http://rubular.com/r/UZAVcwqck8
NewLineInMiddleOfSentenceRule = Rule(r'(?<=[^\n]\s)\n(?=\S)', '')

# Rubular: http://rubular.com/r/eaNwGavmdo
NewLineInMiddleOfSentenceNoSpacesRule = Rule(r"\n(?=[a-z])", ' ')
126 changes: 107 additions & 19 deletions pySBD/cleaner.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# -*- coding: utf-8 -*-
# from pySBD.languages import Language
# from pySBD.languages import Language
import re
from pySBD.rules import Text
from pySBD.clean.rules import PDF, HTML, CleanRules as cr
from pySBD.lang.standard import Abbreviation


class Cleaner(object):
Expand All @@ -12,23 +14,109 @@ def __init__(self, text, language='common', doc_type=None):
self.doc_type = doc_type

def clean(self):
raise NotImplementedError
# clean_rule_1 = remove_all_newlines(self.text)
# clean_rule_2 = replace_double_newlines(clean_rule_1)
# clean_rule_3 = replace_newlines(clean_rule_2)
# clean_rule_4 = replace_escaped_newlines(clean_rule_3)
# clean_rule_5 = remove_or_escape_html_tags(clean_rule_4)
# clean_rule_6 = replace_punctuation_in_brackets(clean_rule_5)
# clean_rule_7 = inlineformattingrule(clean_rule_6)
# clean_rule_8 = clean_quotations(clean_rule_7)
# clean_rule_9 = clean_table_of_contents(clean_rule_8)
# clean_rule_10 = check_for_no_space_in_between_sentences(clean_rule_9)
# clean_rule_11 = clean_consecutive_characters(clean_rule_10)
if not self.text:
return self.text
self.remove_all_newlines()
self.replace_double_newlines()
self.replace_newlines()
self.replace_escaped_newlines()
self.text = Text(self.text).apply(*HTML.All)
self.replace_punctuation_in_brackets()
self.text = Text(self.text).apply(cr.InlineFormattingRule)
self.clean_quotations()
self.clean_table_of_contents()
self.check_for_no_space_in_between_sentences()
self.clean_consecutive_characters()
return self.text

def remove_all_newlines(self):
raise NotImplementedError
# rm_all_newline_1 = remove_newline_in_middle_of_sentence(self)
# rm_newline_in_middle = remove_newline_in_middle_of_word(self)
# return rm_newline_in_middle
self.remove_newline_in_middle_of_sentence()
self.remove_newline_in_middle_of_word()

# def remove_newline_in_middle_of_sentence(self):
def remove_newline_in_middle_of_sentence(self):
def replace_w_blank(match):
match = match.group()
sub = re.sub(cr.NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, '', match)
return sub
self.text = re.sub(r'(?:[^\.])*', replace_w_blank, self.text)

def remove_newline_in_middle_of_word(self):
self.text = Text(self.text).apply(cr.NewLineInMiddleOfWordRule)

def replace_double_newlines(self):
self.text = Text(self.text).apply(cr.DoubleNewLineWithSpaceRule,
cr.DoubleNewLineRule)

def remove_pdf_line_breaks(self):
self.text = Text(
self.text).apply(cr.NewLineFollowedByBulletRule,
PDF.NewLineInMiddleOfSentenceRule,
PDF.NewLineInMiddleOfSentenceNoSpacesRule)

def replace_newlines(self):
if self.doc_type == 'pdf':
self.remove_pdf_line_breaks()
else:
self.text = Text(
self.text).apply(cr.NewLineFollowedByPeriodRule,
cr.ReplaceNewlineWithCarriageReturnRule)

def replace_escaped_newlines(self):
self.text = Text(
self.text).apply(cr.EscapedNewLineRule,
cr.EscapedCarriageReturnRule,
cr.TypoEscapedNewLineRule,
cr.TypoEscapedCarriageReturnRule)

def replace_punctuation_in_brackets(self):
def replace_punct(match):
match = match.group()
if '?' in match:
sub = re.sub(re.escape('?'), '&ᓷ&', match)
return sub
return match
self.text = re.sub(r'\[(?:[^\]])*\]', replace_punct, self.text)

def clean_quotations(self):
# method added explicitly
# pragmatic-segmenter applies thhis method
# at different location
self.text = re.sub('`', "'", self.text)
self.text = Text(self.text).apply(
cr.QuotationsFirstRule,
cr.QuotationsSecondRule)

def clean_table_of_contents(self):
self.text = Text(self.text).apply(
cr.TableOfContentsRule,
cr.ConsecutivePeriodsRule,
cr.ConsecutiveForwardSlashRule)

def search_for_connected_sentences(self, word, txt, regex, rule):
if not re.search(regex, word):
return txt
if any(k in word for k in cr.URL_EMAIL_KEYWORDS):
return txt
if any(a in word for a in Abbreviation.ABBREVIATIONS):
return txt
new_word = Text(word).apply(rule)
txt = re.sub(word, new_word, txt)
return txt

def check_for_no_space_in_between_sentences(self):
words = self.text.split(' ')
for word in words:
self.text = self.search_for_connected_sentences(word, self.text, cr.NO_SPACE_BETWEEN_SENTENCES_REGEX, cr.NoSpaceBetweenSentencesRule)
self.text = self.search_for_connected_sentences(word, self.text, cr.NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, cr.NoSpaceBetweenSentencesDigitRule)

def clean_consecutive_characters(self):
self.text = Text(self.text).apply(
cr.ConsecutivePeriodsRule,
cr.ConsecutiveForwardSlashRule)


if __name__ == "__main__":
# text = "Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot."
text = "• 9. Stop smoking \n• 10. Get some rest \n \nYou have the best chance of having a problem-free pregnancy and a healthy baby if you follow \na few simple guidelines: \n\n1. Organise your pregnancy care early"
c = Cleaner(text)
print(c.clean())
Loading