Skip to content

Commit

Permalink
Merge pull request #51 from label-sleuth/lang
Browse files Browse the repository at this point in the history
Update language documentation for extended 150+ language support
  • Loading branch information
yannisk2 committed Jul 10, 2023
2 parents 441ef8b + 9901534 commit a4571d3
Show file tree
Hide file tree
Showing 3 changed files with 173 additions and 58 deletions.
2 changes: 1 addition & 1 deletion docs/docs/dev/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ A custom configuration file can be applied by passing the `--config_path` parame
Alternatively, it is possible to override specific configuration parameters at startup by appending them to the "start_label_sleuth" command. For example, to set up the system to work with text data in Arabic, one can set the system language by using the following command:

```
python -m label_sleuth.start_label_sleuth --language ARABIC
python -m label_sleuth.start_label_sleuth --language Arabic
```


Expand Down
227 changes: 171 additions & 56 deletions docs/docs/dev/languages.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,176 @@
# Language Support

Currently, Label Sleuth can work with text data in the following languages:
- `ENGLISH` <br /><defvalue>default</defvalue>
- `ITALIAN`
- `ROMANIAN`
- `HEBREW`
- `ARABIC`

To start up the system with your chosen language, use the following command:
Label Sleuth supports text data in _more than 150 languages_. To start the system with your chosen language, use the following command:
```text
python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE>
```

Note that not every machine learning model is compatible with every language. For model-language compatibility, see [here](model_training.md#model-policies).


## Adding support for a new language

The system can easily be extended to support additional languages, and we encourage developers who are fluent in additional languages to contribute them to Label Sleuth.

To support a new language, follow the steps below:

1. **Find your desired language in the page for [FastText word vectors](https://fasttext.cc/docs/en/crawl-vectors.html)**

Assuming your language is listed on this webpage, you will need to check for the *2- or 3-letter **language code*** associated with the language.

This can be done by looking at the download links that appear next to your desired language. The language code can be found within the template `cc.{XX}.300` in the download link. For example, the download link for Nepali is https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ne.300.bin.gz, indicating that the appropriate code is `ne`. This code is the `fasttext_language_id`.

2. **Compile or find a list of *stop-words* for the language**

Stop-words are words that are considered less "meaningful" (in English, for instance, a word like *"such"* is considered a stop-word, as it carries very little semanatic meaning compared to words like *"kitchen"* or *"celebrate"*). For this reason, stop-words are often ignored by automated language systems, including some components that can be found within Label Sleuth.

Bear in mind that the specific set of stop-words you choose is not so crucial; if in doubt, you can go for a very short list of words that come to mind.


3. **Create a Language object**

Each language is defined by a Language instance in [languages.py](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/languages.py). Create a new object for your language, filling in the name of the language as well as the information from steps 1-2. For example, this is the language object for Arabic:

```python
Arabic = Language(name='Arabic',
stop_words=["التى", "التي", "الذى", "الذي", "الذين", "ذلك", "هذا", "هذه", "هؤلاء", "قد", "وقد", "حيث",
"ان", "إن", "انه", "وان", "فان", "فإن", "بان", "اي", "أي", "ايضا", "أيضا", "إياه"],
fasttext_language_id='ar',
right_to_left=True)
```

*Note that in this particular case an additional parameter of `right_to_left` is specified; this is only necessary for languages that use a right-to-left writing direction.*

4. **Add the object to the `Languages` class**

[languages.py](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/languages.py) also contains a `Languages` class which holds all the languages supported by the system. Simply add your newly-created language object to `Languages`.

5. **Try it out**

All done! As specified on the top of this page, you can now start up Label Sleuth to use your chosen language, with `python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE>` (It may take a little longer to start up the system for the first time, as the system downloads the necessary files for this newly-added language).

Once Label Sleuth has started up using the chosen language, you can load documents in this language and work with the system as usual. Try out the system in the new language for a few model training iterations to make sure that the language extension works and the system can learn a model in the new language.

Be sure to [open a pull request](https://github.com/label-sleuth/label-sleuth) so that fellow language speakers could use it!
where `<YOUR_LANGUAGE>` is the name of the language from the list of supported languages below. Note that if the language name consists of multiple words, it should be enclosed in double quotes.

:::{note}
Not every machine learning model is compatible with every language. For model-language compatibility, see [here](model_training.md#model-policies).
:::

## Supported languages

Label Sleuth supports the following languages:

| Language |
| -------- |
| Afrikaans |
| Albanian |
| Alemannic |
| Amharic |
| Arabic |
| Aragonese |
| Armenian |
| Assamese |
| Asturian |
| Azerbaijani |
| Bashkir |
| Basque |
| Bavarian |
| Belarusian |
| Bengali |
| Bihari |
| Bishnupriya Manipuri |
| Bosnian |
| Breton |
| Bulgarian |
| Burmese |
| Catalan |
| Cebuano |
| Central Bicolano |
| Chechen |
| Chinese |
| Chuvash |
| Corsican |
| Croatian |
| Czech |
| Danish |
| Divehi |
| Dutch |
| Eastern Punjabi |
| Egyptian Arabic |
| Emilian-Romagnol |
| English <defvalue>default</defvalue> |
| Erzya |
| Esperanto |
| Estonian |
| Fiji Hindi |
| Finnish |
| French |
| Galician |
| Georgian |
| German |
| Goan Konkani |
| Greek |
| Gujarati |
| Haitian |
| Hebrew |
| Hill Mari |
| Hindi |
| Hungarian |
| Icelandic |
| Ido |
| Ilokano |
| Indonesian |
| Interlingua |
| Irish |
| Italian |
| Japanese |
| Javanese |
| Kannada |
| Kapampangan |
| Kazakh |
| Khmer |
| Kirghiz |
| Korean |
| Kurdish (Kurmanji) |
| Kurdish (Sorani) |
| Latin |
| Latvian |
| Limburgish |
| Lithuanian |
| Lombard |
| Low Saxon |
| Luxembourgish |
| Macedonian |
| Maithili |
| Malagasy |
| Malay |
| Malayalam |
| Maltese |
| Manx |
| Marathi |
| Mazandarani |
| Meadow Mari |
| Minangkabau |
| Mingrelian |
| Mirandese |
| Mongolian |
| Nahuatl |
| Neapolitan |
| Nepali |
| Newar |
| North Frisian |
| Northern Sotho |
| Norwegian (Bokmål) |
| Norwegian (Nynorsk) |
| Occitan |
| Oriya |
| Ossetian |
| Palatinate German |
| Pashto |
| Persian |
| Piedmontese |
| Polish |
| Portuguese |
| Quechua |
| Romanian |
| Romansh |
| Russian |
| Sakha |
| Sanskrit |
| Sardinian |
| Scots |
| Scottish Gaelic |
| Serbian |
| Serbo-Croatian |
| Sicilian |
| Sindhi |
| Sinhalese |
| Slovak |
| Slovenian |
| Somali |
| Southern Azerbaijani |
| Spanish |
| Sundanese |
| Swahili |
| Swedish |
| Tagalog |
| Tajik |
| Tamil |
| Tatar |
| Telugu |
| Thai |
| Tibetan |
| Turkish |
| Turkmen |
| Ukrainian |
| Upper Sorbian |
| Urdu |
| Uyghur |
| Uzbek |
| Venetian |
| Vietnamese |
| Volapük |
| Walloon |
| Waray |
| Welsh |
| West Flemish |
| West Frisian |
| Western Punjabi |
| Yiddish |
| Yoruba |
| Zazaki |
| Zeelandic |
2 changes: 1 addition & 1 deletion docs/docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Recommended procedure to install Label Sleuth:
As an example, if you would like to use the system with text data in Romanian, rather than the default setting of English, you can do so by entering the following command:

```text
python -m label_sleuth.start_label_sleuth --language ROMANIAN
python -m label_sleuth.start_label_sleuth --language Romanian
```


Expand Down

0 comments on commit a4571d3

Please sign in to comment.