GitHub - kscanne/crubadan: Scripts and data for the Crúbadán web crawler: http://crubadan.org/

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
bibleTexts		bibleTexts
clustering		clustering
facebook		facebook
normalize		normalize
sample		sample
transliterate		transliterate
LICENSE		LICENSE
README		README

Repository files navigation

This repository contains scripts and data from the Crúbadán project;
http://crubadan.org/


*** Normalization ***

In the "normalize" directory, you'll find the script that we apply
to web-crawled texts in various languages to clean them up.  
In general, we only perform very "gentle" cleaning, in order
to make the texts more useful for language-modeling and so on. 

As an example: in some Cyrillic-script languages, it's common for
users to type a "lookalike" Latin script character for what ought to be
a Cyrillic one; e.g.  Latin "ö" (U+00F6) for Cyrillic "ӧ" 04E7.
Our script converts U+00F6 to U+04E7 for languages where this is an 
issue (Komi, Udmurt, ...) 

In contrast, we wouldn't attempt to restore missing diacritics or 
any other cleaning that's not deterministic.  

The rules are expressed as Perl substitutions, and can be 
found in the file rules.txt.  The script reads UTF-8 text 
(Normalization form C) on standard input, and sends the 
normalized text to standard output.

We welcome contributions from additional language communities.  
The ruleset at present only covers a fraction of the 2000+
languages our crawler recognizes.