-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #51 from label-sleuth/lang
Update language documentation for extended 150+ language support
- Loading branch information
Showing
3 changed files
with
173 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,61 +1,176 @@ | ||
# Language Support | ||
|
||
Currently, Label Sleuth can work with text data in the following languages: | ||
- `ENGLISH` <br /><defvalue>default</defvalue> | ||
- `ITALIAN` | ||
- `ROMANIAN` | ||
- `HEBREW` | ||
- `ARABIC` | ||
|
||
To start up the system with your chosen language, use the following command: | ||
Label Sleuth supports text data in _more than 150 languages_. To start the system with your chosen language, use the following command: | ||
```text | ||
python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE> | ||
``` | ||
|
||
Note that not every machine learning model is compatible with every language. For model-language compatibility, see [here](model_training.md#model-policies). | ||
|
||
|
||
## Adding support for a new language | ||
|
||
The system can easily be extended to support additional languages, and we encourage developers who are fluent in additional languages to contribute them to Label Sleuth. | ||
|
||
To support a new language, follow the steps below: | ||
|
||
1. **Find your desired language in the page for [FastText word vectors](https://fasttext.cc/docs/en/crawl-vectors.html)** | ||
|
||
Assuming your language is listed on this webpage, you will need to check for the *2- or 3-letter **language code*** associated with the language. | ||
|
||
This can be done by looking at the download links that appear next to your desired language. The language code can be found within the template `cc.{XX}.300` in the download link. For example, the download link for Nepali is https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ne.300.bin.gz, indicating that the appropriate code is `ne`. This code is the `fasttext_language_id`. | ||
|
||
2. **Compile or find a list of *stop-words* for the language** | ||
|
||
Stop-words are words that are considered less "meaningful" (in English, for instance, a word like *"such"* is considered a stop-word, as it carries very little semanatic meaning compared to words like *"kitchen"* or *"celebrate"*). For this reason, stop-words are often ignored by automated language systems, including some components that can be found within Label Sleuth. | ||
|
||
Bear in mind that the specific set of stop-words you choose is not so crucial; if in doubt, you can go for a very short list of words that come to mind. | ||
|
||
|
||
3. **Create a Language object** | ||
|
||
Each language is defined by a Language instance in [languages.py](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/languages.py). Create a new object for your language, filling in the name of the language as well as the information from steps 1-2. For example, this is the language object for Arabic: | ||
|
||
```python | ||
Arabic = Language(name='Arabic', | ||
stop_words=["التى", "التي", "الذى", "الذي", "الذين", "ذلك", "هذا", "هذه", "هؤلاء", "قد", "وقد", "حيث", | ||
"ان", "إن", "انه", "وان", "فان", "فإن", "بان", "اي", "أي", "ايضا", "أيضا", "إياه"], | ||
fasttext_language_id='ar', | ||
right_to_left=True) | ||
``` | ||
|
||
*Note that in this particular case an additional parameter of `right_to_left` is specified; this is only necessary for languages that use a right-to-left writing direction.* | ||
|
||
4. **Add the object to the `Languages` class** | ||
|
||
[languages.py](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/languages.py) also contains a `Languages` class which holds all the languages supported by the system. Simply add your newly-created language object to `Languages`. | ||
|
||
5. **Try it out** | ||
|
||
All done! As specified on the top of this page, you can now start up Label Sleuth to use your chosen language, with `python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE>` (It may take a little longer to start up the system for the first time, as the system downloads the necessary files for this newly-added language). | ||
|
||
Once Label Sleuth has started up using the chosen language, you can load documents in this language and work with the system as usual. Try out the system in the new language for a few model training iterations to make sure that the language extension works and the system can learn a model in the new language. | ||
|
||
Be sure to [open a pull request](https://github.com/label-sleuth/label-sleuth) so that fellow language speakers could use it! | ||
where `<YOUR_LANGUAGE>` is the name of the language from the list of supported languages below. Note that if the language name consists of multiple words, it should be enclosed in double quotes. | ||
|
||
:::{note} | ||
Not every machine learning model is compatible with every language. For model-language compatibility, see [here](model_training.md#model-policies). | ||
::: | ||
|
||
## Supported languages | ||
|
||
Label Sleuth supports the following languages: | ||
|
||
| Language | | ||
| -------- | | ||
| Afrikaans | | ||
| Albanian | | ||
| Alemannic | | ||
| Amharic | | ||
| Arabic | | ||
| Aragonese | | ||
| Armenian | | ||
| Assamese | | ||
| Asturian | | ||
| Azerbaijani | | ||
| Bashkir | | ||
| Basque | | ||
| Bavarian | | ||
| Belarusian | | ||
| Bengali | | ||
| Bihari | | ||
| Bishnupriya Manipuri | | ||
| Bosnian | | ||
| Breton | | ||
| Bulgarian | | ||
| Burmese | | ||
| Catalan | | ||
| Cebuano | | ||
| Central Bicolano | | ||
| Chechen | | ||
| Chinese | | ||
| Chuvash | | ||
| Corsican | | ||
| Croatian | | ||
| Czech | | ||
| Danish | | ||
| Divehi | | ||
| Dutch | | ||
| Eastern Punjabi | | ||
| Egyptian Arabic | | ||
| Emilian-Romagnol | | ||
| English <defvalue>default</defvalue> | | ||
| Erzya | | ||
| Esperanto | | ||
| Estonian | | ||
| Fiji Hindi | | ||
| Finnish | | ||
| French | | ||
| Galician | | ||
| Georgian | | ||
| German | | ||
| Goan Konkani | | ||
| Greek | | ||
| Gujarati | | ||
| Haitian | | ||
| Hebrew | | ||
| Hill Mari | | ||
| Hindi | | ||
| Hungarian | | ||
| Icelandic | | ||
| Ido | | ||
| Ilokano | | ||
| Indonesian | | ||
| Interlingua | | ||
| Irish | | ||
| Italian | | ||
| Japanese | | ||
| Javanese | | ||
| Kannada | | ||
| Kapampangan | | ||
| Kazakh | | ||
| Khmer | | ||
| Kirghiz | | ||
| Korean | | ||
| Kurdish (Kurmanji) | | ||
| Kurdish (Sorani) | | ||
| Latin | | ||
| Latvian | | ||
| Limburgish | | ||
| Lithuanian | | ||
| Lombard | | ||
| Low Saxon | | ||
| Luxembourgish | | ||
| Macedonian | | ||
| Maithili | | ||
| Malagasy | | ||
| Malay | | ||
| Malayalam | | ||
| Maltese | | ||
| Manx | | ||
| Marathi | | ||
| Mazandarani | | ||
| Meadow Mari | | ||
| Minangkabau | | ||
| Mingrelian | | ||
| Mirandese | | ||
| Mongolian | | ||
| Nahuatl | | ||
| Neapolitan | | ||
| Nepali | | ||
| Newar | | ||
| North Frisian | | ||
| Northern Sotho | | ||
| Norwegian (Bokmål) | | ||
| Norwegian (Nynorsk) | | ||
| Occitan | | ||
| Oriya | | ||
| Ossetian | | ||
| Palatinate German | | ||
| Pashto | | ||
| Persian | | ||
| Piedmontese | | ||
| Polish | | ||
| Portuguese | | ||
| Quechua | | ||
| Romanian | | ||
| Romansh | | ||
| Russian | | ||
| Sakha | | ||
| Sanskrit | | ||
| Sardinian | | ||
| Scots | | ||
| Scottish Gaelic | | ||
| Serbian | | ||
| Serbo-Croatian | | ||
| Sicilian | | ||
| Sindhi | | ||
| Sinhalese | | ||
| Slovak | | ||
| Slovenian | | ||
| Somali | | ||
| Southern Azerbaijani | | ||
| Spanish | | ||
| Sundanese | | ||
| Swahili | | ||
| Swedish | | ||
| Tagalog | | ||
| Tajik | | ||
| Tamil | | ||
| Tatar | | ||
| Telugu | | ||
| Thai | | ||
| Tibetan | | ||
| Turkish | | ||
| Turkmen | | ||
| Ukrainian | | ||
| Upper Sorbian | | ||
| Urdu | | ||
| Uyghur | | ||
| Uzbek | | ||
| Venetian | | ||
| Vietnamese | | ||
| Volapük | | ||
| Walloon | | ||
| Waray | | ||
| Welsh | | ||
| West Flemish | | ||
| West Frisian | | ||
| Western Punjabi | | ||
| Yiddish | | ||
| Yoruba | | ||
| Zazaki | | ||
| Zeelandic | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters