Merge f399588 into 34d8d0f

nielstron · Oct 3, 2018 · bc43ce8 · bc43ce8
2 parents 34d8d0f + f399588
commit bc43ce8
Show file tree

Hide file tree

Showing 38 changed files with 21,726 additions and 20,996 deletions.
diff --git a/README.md b/README.md
@@ -206,7 +206,8 @@ Here is an example of an entry in *entities.json*:
 }
 ```
 
--   *name* and *URI* are self explanatory.
+-   *name* is self explanatory.
+-   *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)
 -   *dimensions* is the dimensionality, a list of dictionaries each
     having a *base* (the name of another entity) and a *power* (an
     integer, can be negative).
@@ -218,22 +219,23 @@ Here is an example of an entry in *units.json*:
     "name": "metre per second",
     "surfaces": ["metre per second", "meter per second"],
     "entity": "speed",
-    "URI": "https://en.wikipedia.org/wiki/Metre_per_second",
+    "URI": "Metre_per_second",
     "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
     "symbols": ["mps"]
 },
 {
     "name": "year",
     "surfaces": [ "year", "annum" ],
     "entity": "time",
-    "URI": "https://en.wikipedia.org/wiki/Year",
+    "URI": "Year",
     "dimensions": [],
     "symbols": [ "a", "y", "yr" ],
     "prefixes": [ "k", "M", "G", "T", "P", "E" ]
 }
 ```
 
--   *name* and *URI* are self explanatory.
+-   *name* is self explanatory.
+-   *URI* follows the same scheme as in the *entities.json*
 -   *surfaces* is a list of strings that refer to that unit. The library
     takes care of plurals, no need to specify them.
 -   *entity* is the name of an entity in *entities.json*
@@ -266,23 +268,19 @@ If you'd like to contribute follow these steps:
 (Optional, will be done automatically after pushing)
 8. Create a Pull Request when having commited and pushed your changes
 
-
 Language support
 ----------------
 [![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=language_support "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
 [![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=language_support)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)
 
 There is a branch for language support, namely `language_support`.
-From inspecting the functions and values given in the new `_lang.en_US` submodule,
+From inspecting the `README` file in the `_lang` subdirectory and
+the functions and values given in the new `_lang.en_US` submodule,
 one should be able to create own language submodules.
 The new language modules should automatically be invoked and be available,
 both through the `lang=` keyword argument in the parser functions as well
 as in the automatic unittests.
 
-No changes above the own language submoduel (i.e. `_lang.de_DE`) should
+No changes outside the own language submodule folder (i.e. `_lang.de_DE`) should
 be necessary. If there are problems implementing a new language, don't hesitate to open an issue.
 
-Language support is very beta currently and will thus stay in the seperate branch
-until at least one additional language has been added. That way the current parsing
-quality can be assured and meanwhile be extended for additional languages.
-
diff --git a/quantulum3/_lang/README.md b/quantulum3/_lang/README.md
@@ -0,0 +1,106 @@
+Checklist language support
+--------------------------
+
+Following is a list, describing all necessary files and functions to
+add a new language to the quantulum project.
+
+### load.py
+
+#### `pluralize(singular, count)`
+
+Turn a given word into its plural form based
+on the given count (`None` -> plural)
+
+#### `number_to_words(number)`
+
+Turn the given float into a pronouncable string
+
+
+### regex.py
+
+`TEXT_PATTERN` is a regex pattern describing numbers that are spelled out
+(at least partly). It can be freely assigned but has to contain the formatting
+groups contained in the following example pattern. Also braces that are to be
+contained in the resulting regular expression untampered should be escaped (see below on how this is done)
+
+`number_pattern_no_groups` will be replaced by the below number pattern with
+no regex groups inserted.
+
+`numberwords_regex` is just a concatenation of all possible numberwords (defined in
+regex.py) with `|`, resulting in an "either or" decision for the regex engine.
+
+```pythonregexp
+(?:
+    (?<![a-zA-Z0-9+.-])    # lookbehind, avoid "Area51"
+    {number_pattern_no_groups}
+)?
+[ -]?(?:{numberwords_regex})
+[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?
+[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?
+```
+`NUM_PATTERN` is an regex pattern, describing the general form of a number.
+it is important that the below formatting groups are contained in
+the regex string (marked by enclosing `{}`). It is defined on a package wide level
+and should not have to be changed for localization.
+
+`{{3}}` is a case of escaped braces and will be formatted to `{3}`.
+
+```pythonregexp
+(?{number}              # required number
+    [+-]?                  #   optional sign
+    \.?\d+                 #   required digits
+    (?:[{grouping}]\d{{3}})*         #   allowed grouping
+    (?{decimals}[{decimal_operators}]\d+)?    #   optional decimals
+)
+(?{scale}               # optional exponent
+    (?:{multipliers})?                #   multiplicative operators
+    (?{base}(E|e|\d+)\^?)    #   required exponent prefix
+    (?{exponent}[+-]?\d+|[{superscript}])      #   required exponent, superscript or normal
+)?
+(?{fraction}             # optional fraction
+    \ \d+/\d+|\ ?[{unicode_fract}]|/\d+
+)?
+```
+
+The rest should be self-explanatory by inspecting `en_US/regex.py`.
+
+
+### parser.py, classifier.py, speak.py
+
+Refer to the corresponding python modules in `en_US`. They contain example code and documentation on the
+necessary functions. Most currently included functions are necessary.
+
+
+### train
+
+This directory contains `json`-files with training data for the classifier for that language.
+The serialized objects are lists, containing dicts with two keys
+
+- `unit`: the unit which is associated with the following text
+- `text`: a text associated with the unit. It will be tokenized and stemmed
+          and then be used as training data for the unit
+
+All `json`-files will be included for training. All other keys in the dictionary
+will be ignored. They can be used for comments.
+
+The classifier can later be trained via `scripts/train.py`.
+
+
+### tests
+
+This directory contains `json`-files with test data for the language.
+The tests are subdivided in `expand`, `quantities` and `quantities.ambiguity`.
+
+- `expand`: Test if parsing and replacing in the text, correctly pluralized, works
+- `quantities`: Test if units are correctly parsed **without** a classifier
+- `quantities.ambiguity`: Test if the classifier succeeds a telling which unit
+                          was meant in the text, invoking the classifier.
+
+### units.json, entities.json
+
+These files can be built exactly like the corresponding files on the package top-level.
+
+The defined units and entities will be added if not defined in the gloabl files or
+override values for the global entities. `surfaces` and `URI` are fields that should be
+overridden by all languages.
+
diff --git a/quantulum3/_lang/__init__.py b/quantulum3/_lang/__init__.py
diff --git a/quantulum3/_lang/en_US/__init__.py b/quantulum3/_lang/en_US/__init__.py
@@ -0,0 +1,4 @@
+# Standard library
+from pathlib import Path
+
+lang = Path(__file__).parent.name
diff --git a/quantulum3/_lang/en_US/classifier.py b/quantulum3/_lang/en_US/classifier.py
@@ -0,0 +1,43 @@
+# -*- coding: utf-8 -*-
+"""
+:mod:`Quantulum` classifier functions.
+"""
+
+# Standard library
+import re
+import string
+
+try:
+    from stemming.porter2 import stem
+except ImportError:
+    stem = None
+
+
+################################################################################
+def clean_text(text):
+    """
+    Clean text for TFIDF
+    """
+    if not stem:
+        raise ImportError("Module stemming is not installed.")
+
+    my_regex = re.compile(r'[%s]' % re.escape(string.punctuation))
+    new_text = my_regex.sub(' ', text)
+
+    new_text = [
+        stem(i) for i in new_text.lower().split()
+        if not re.findall(r'[0-9]', i)
+    ]
+
+    new_text = ' '.join(new_text)
+
+    return new_text
+
+
+################################################################################
+def stop_words():
+    """
+    Return the string, identifying stop word language for TFIDF vectorizer
+    :return:
+    """
+    return 'english'
diff --git a/quantulum3/clf.joblib → quantulum3/_lang/en_US/clf.joblib b/quantulum3/clf.joblib → quantulum3/_lang/en_US/clf.joblib