-
Notifications
You must be signed in to change notification settings - Fork 4
Scripts and data for the Crúbadán web crawler: http://crubadan.org/
License
kscanne/crubadan
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains scripts and data from the Crúbadán project; http://crubadan.org/ *** Normalization *** In the "normalize" directory, you'll find the script that we apply to web-crawled texts in various languages to clean them up. In general, we only perform very "gentle" cleaning, in order to make the texts more useful for language-modeling and so on. As an example: in some Cyrillic-script languages, it's common for users to type a "lookalike" Latin script character for what ought to be a Cyrillic one; e.g. Latin "ö" (U+00F6) for Cyrillic "ӧ" 04E7. Our script converts U+00F6 to U+04E7 for languages where this is an issue (Komi, Udmurt, ...) In contrast, we wouldn't attempt to restore missing diacritics or any other cleaning that's not deterministic. The rules are expressed as Perl substitutions, and can be found in the file rules.txt. The script reads UTF-8 text (Normalization form C) on standard input, and sends the normalized text to standard output. We welcome contributions from additional language communities. The ruleset at present only covers a fraction of the 2000+ languages our crawler recognizes.
About
Scripts and data for the Crúbadán web crawler: http://crubadan.org/
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published