Language detection for English, German, French, Italian and Romansh.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ipynb
src
.gitignore
LICENSE
README.md
requirements.txt

README.md

Language Detection

Language detection for English, German, French, Italian, Spanish and Romansh.

The machine learning model was implemented with Keras and tested with TensorFlow as backend. For English, German, French, Italian and Romansh 75'000 texts extracts from Wikipedia and RTR.ch as the Romansh Wikipedia is much smaller than other languages. For English, German, French, Italian and Spanish 200'000 chunks of text from Wikipedia were used.

Accuracy

The accuracy of the predictions with Romansh is 94.65%. The accuracy for English, German, French, Italian and Spanish is over 99%. The model consists of an embedding layer with 10 dimensions as input, a hidden layer with 4 LSTM units and an output layer with 5 unit and softmax activation. The maximum length of an input text is 200 words (of course you can analyze longer text. But the first 200 words are enough to detect the language).

Several texts can be posted to the REST service at once for analysis (see the examples below).

Scope of Application

Determining the language in which a text was written is one of the most important tasks in the automated processing of documents. A classification, sentiment, fake news or spam analysis can not be made without knowing in which language the text, tweet or review was written.

This is especially relevant in Switzerland, which has four official languages. Here entries in forums or product reviews of users are often written intermixed in several languages.

Installation

The REST service was tested with Python 3.6, Keras 2.1 and Tensorflow 1.5. Run the following command to install the dependencies:

pip3 install -r ./requirements.txt

Starting the REST Service

To start the REST server in a terminal:

ld-rest.py -h
Usage: ld-rest.py --model=<model> --host=<host> --port=<port>

model: id composed of the languages and the version (example: en-de-fr-it-rm_1.0.0) of the model to load. Per default the model for English, German, French, Italian and Romansh (en-de-fr-it-rm_1.0.0) is loaded. The identifier for the model with Spanish is en-de-fr-it-es_1.0.0.

Note: The first time you start the service, the model (including the trained weights) and the tokenizer are loaded from ipublia's website (using https). They are stored in the directory ~/.ipublia/data/language-detection.

Example

python3 ./src/ld-rest.py --model=en-de-fr-it-rm_1.0.0 --host=127.0.0.1 --port=5000

Querying the REST Service

Queries use the following JSON format:

{
  "texts": [
    "Workers in a small Illinois town are worried that a Supreme Court decision curbing union power would hurt their community.",
    "Es soll sich um den bisher grössten Streik im Bildungswesen Grossbritanniens handeln.",
    "Pour la première fois, des parties de peintures rupestres de trois grottes espagnoles ont été attribuées à l’homme de Neandertal.",
    "Sono passati 10 anni dallo sciopero. Ma sarà il prossimo decennio ad essere decisivo per il futuro delle Officine FFS di Bellinzona.",
    "Cun questa decisiun è la sessiun dal favrer dal Cussegl grond ida a fin schon in zic pli baud.",
    "Non mi piace affatto questo film. Gli attori sono cattivi e la storia è noiosa!"
  ]
}

Example Call with Curl

Example of a curl call in a terminal:

curl -H "Content-Type: application/json" -X POST -d '{"texts": ["Workers in a small Illinois town are worried...","Es soll sich um den bisher grössten Streik..."]}' http://127.0.0.1:5000/predict

The answer will look like this:

{
  "predictions": [
    {
      "lang": {
        "label": "en", 
        "probability": {
          "de": 0.1527421623468399, 
          "en": 0.3940083086490631, 
          "fr": 0.15050075948238373, 
          "it": 0.15083013474941254, 
          "rm": 0.1519186794757843
        }
      }, 
      "text": "Workers in a small Illinois town are worried that a Supreme Court decision curbing union power would hurt their community."
    }, 
    {
      "lang": {
        "label": "de", 
        "probability": {
          "de": 0.40271395444869995, 
          "en": 0.14915402233600616, 
          "fr": 0.14908619225025177, 
          "it": 0.1491246223449707, 
          "rm": 0.1499212086200714
        }
      }, 
      "text": "Es soll sich um den bisher grössten Streik im Bildungswesen Grossbritanniens handeln."
    }, 
    {
      "lang": {
        "label": "fr", 
        "probability": {
          "de": 0.16025426983833313, 
          "en": 0.16263195872306824, 
          "fr": 0.2883320748806, 
          "it": 0.193227618932724, 
          "rm": 0.19555407762527466
        }
      }, 
      "text": "Pour la première fois, des parties de peintures rupestres de trois grottes espagnoles ont été attribuées à l’homme de Neandertal."
    }, 
    {
      "lang": {
        "label": "rm", 
        "probability": {
          "de": 0.1611061990261078, 
          "en": 0.156982421875, 
          "fr": 0.1632387787103653, 
          "it": 0.18945465981960297, 
          "rm": 0.32921797037124634
        }
      }, 
      "text": "Sono passati 10 anni dallo sciopero. Ma sarà il prossimo decennio ad essere decisivo per il futuro delle Officine FFS di Bellinzona."
    }, 
    {
      "lang": {
        "label": "rm", 
        "probability": {
          "de": 0.15039856731891632, 
          "en": 0.15162697434425354, 
          "fr": 0.15005752444267273, 
          "it": 0.14968758821487427, 
          "rm": 0.39822936058044434
        }
      }, 
      "text": "Cun questa decisiun è la sessiun dal favrer dal Cussegl grond ida a fin schon in zic pli baud."
    }, 
    {
      "lang": {
        "label": "it", 
        "probability": {
          "de": 0.15224331617355347, 
          "en": 0.15012496709823608, 
          "fr": 0.15062405169010162, 
          "it": 0.39602142572402954, 
          "rm": 0.15098623931407928
        }
      }, 
      "text": "Non mi piace affatto questo film. Gli attori sono cattivi e la storia è noiosa!"
    }
  ], 
  "success": true
}