Skip to content
Branch: master
Find file History
HjalmarrSv Modernized
I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text
My proposed changes does the job.
Basically I had to change by replacing the + at end of line 5 with *(\/)?
The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (|) after reading various posts.
Latest commit fa74706 Dec 18, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
mosestokenizer rename directory to work with python import Nov 9, 2018
basic-protected-patterns Modernized Dec 17, 2019
deescape-special-chars-PTB.perl Add option "-b" (unbuffer output) to tokenizer scripts Nov 9, 2018
deescape-special-chars.perl Add option "-b" (unbuffer output) to tokenizer scripts Nov 9, 2018
delete-long-words.perl
detokenizer.perl Korean words has spaces =) Jan 19, 2018
escape-special-chars.perl Add option "-b" (unbuffer output) to tokenizer scripts Nov 9, 2018
lowercase.perl Add option "-b" (unbuffer output) to tokenizer scripts Nov 9, 2018
normalize-punctuation.perl Single quotes should be escaped as single quotes. Nov 25, 2019
pre-tok-clean.perl Add license notices to scripts. May 29, 2015
pre-tokenizer.perl Add license notices to scripts. May 29, 2015
pre_tokenize_cleaning.py Add license notices to scripts. May 29, 2015
remove-non-printing-char.perl Add option "-b" (unbuffer output) to tokenizer scripts Nov 9, 2018
replace-unicode-punctuation.perl Update replace-unicode-punctuation.perl Oct 14, 2019
tokenizer.perl tokenizer.perl: split final dots unconditionally Nov 7, 2018
tokenizer_PTB.perl ga (mostly) behaves more like fr/it Sep 23, 2015
You can’t perform that action at this time.