Skip to content

Commit

Permalink
Merge f399588 into 34d8d0f
Browse files Browse the repository at this point in the history
  • Loading branch information
nielstron committed Oct 3, 2018
2 parents 34d8d0f + f399588 commit bc43ce8
Show file tree
Hide file tree
Showing 38 changed files with 21,726 additions and 20,996 deletions.
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,8 @@ Here is an example of an entry in *entities.json*:
}
```

- *name* and *URI* are self explanatory.
- *name* is self explanatory.
- *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)
- *dimensions* is the dimensionality, a list of dictionaries each
having a *base* (the name of another entity) and a *power* (an
integer, can be negative).
Expand All @@ -218,22 +219,23 @@ Here is an example of an entry in *units.json*:
"name": "metre per second",
"surfaces": ["metre per second", "meter per second"],
"entity": "speed",
"URI": "https://en.wikipedia.org/wiki/Metre_per_second",
"URI": "Metre_per_second",
"dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
"symbols": ["mps"]
},
{
"name": "year",
"surfaces": [ "year", "annum" ],
"entity": "time",
"URI": "https://en.wikipedia.org/wiki/Year",
"URI": "Year",
"dimensions": [],
"symbols": [ "a", "y", "yr" ],
"prefixes": [ "k", "M", "G", "T", "P", "E" ]
}
```

- *name* and *URI* are self explanatory.
- *name* is self explanatory.
- *URI* follows the same scheme as in the *entities.json*
- *surfaces* is a list of strings that refer to that unit. The library
takes care of plurals, no need to specify them.
- *entity* is the name of an entity in *entities.json*
Expand Down Expand Up @@ -266,23 +268,19 @@ If you'd like to contribute follow these steps:
(Optional, will be done automatically after pushing)
8. Create a Pull Request when having commited and pushed your changes


Language support
----------------
[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=language_support "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=language_support)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)

There is a branch for language support, namely `language_support`.
From inspecting the functions and values given in the new `_lang.en_US` submodule,
From inspecting the `README` file in the `_lang` subdirectory and
the functions and values given in the new `_lang.en_US` submodule,
one should be able to create own language submodules.
The new language modules should automatically be invoked and be available,
both through the `lang=` keyword argument in the parser functions as well
as in the automatic unittests.

No changes above the own language submoduel (i.e. `_lang.de_DE`) should
No changes outside the own language submodule folder (i.e. `_lang.de_DE`) should
be necessary. If there are problems implementing a new language, don't hesitate to open an issue.

Language support is very beta currently and will thus stay in the seperate branch
until at least one additional language has been added. That way the current parsing
quality can be assured and meanwhile be extended for additional languages.

106 changes: 106 additions & 0 deletions quantulum3/_lang/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
Checklist language support
--------------------------

Following is a list, describing all necessary files and functions to
add a new language to the quantulum project.

### load.py

#### `pluralize(singular, count)`

Turn a given word into its plural form based
on the given count (`None` -> plural)

#### `number_to_words(number)`

Turn the given float into a pronouncable string


### regex.py

`TEXT_PATTERN` is a regex pattern describing numbers that are spelled out
(at least partly). It can be freely assigned but has to contain the formatting
groups contained in the following example pattern. Also braces that are to be
contained in the resulting regular expression untampered should be escaped (see below on how this is done)

`number_pattern_no_groups` will be replaced by the below number pattern with
no regex groups inserted.

`numberwords_regex` is just a concatenation of all possible numberwords (defined in
regex.py) with `|`, resulting in an "either or" decision for the regex engine.

```pythonregexp
(?:
(?<![a-zA-Z0-9+.-]) # lookbehind, avoid "Area51"
{number_pattern_no_groups}
)?
[ -]?(?:{numberwords_regex})
[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?
[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?[ -]?(?:{numberwords_regex})?
```
`NUM_PATTERN` is an regex pattern, describing the general form of a number.
it is important that the below formatting groups are contained in
the regex string (marked by enclosing `{}`). It is defined on a package wide level
and should not have to be changed for localization.

`{{3}}` is a case of escaped braces and will be formatted to `{3}`.

```pythonregexp
(?{number} # required number
[+-]? # optional sign
\.?\d+ # required digits
(?:[{grouping}]\d{{3}})* # allowed grouping
(?{decimals}[{decimal_operators}]\d+)? # optional decimals
)
(?{scale} # optional exponent
(?:{multipliers})? # multiplicative operators
(?{base}(E|e|\d+)\^?) # required exponent prefix
(?{exponent}[+-]?\d+|[{superscript}]) # required exponent, superscript or normal
)?
(?{fraction} # optional fraction
\ \d+/\d+|\ ?[{unicode_fract}]|/\d+
)?
```

The rest should be self-explanatory by inspecting `en_US/regex.py`.


### parser.py, classifier.py, speak.py

Refer to the corresponding python modules in `en_US`. They contain example code and documentation on the
necessary functions. Most currently included functions are necessary.


### train

This directory contains `json`-files with training data for the classifier for that language.
The serialized objects are lists, containing dicts with two keys

- `unit`: the unit which is associated with the following text
- `text`: a text associated with the unit. It will be tokenized and stemmed
and then be used as training data for the unit

All `json`-files will be included for training. All other keys in the dictionary
will be ignored. They can be used for comments.

The classifier can later be trained via `scripts/train.py`.


### tests

This directory contains `json`-files with test data for the language.
The tests are subdivided in `expand`, `quantities` and `quantities.ambiguity`.

- `expand`: Test if parsing and replacing in the text, correctly pluralized, works
- `quantities`: Test if units are correctly parsed **without** a classifier
- `quantities.ambiguity`: Test if the classifier succeeds a telling which unit
was meant in the text, invoking the classifier.

### units.json, entities.json

These files can be built exactly like the corresponding files on the package top-level.

The defined units and entities will be added if not defined in the gloabl files or
override values for the global entities. `surfaces` and `URI` are fields that should be
overridden by all languages.

Empty file added quantulum3/_lang/__init__.py
Empty file.
4 changes: 4 additions & 0 deletions quantulum3/_lang/en_US/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Standard library
from pathlib import Path

lang = Path(__file__).parent.name
43 changes: 43 additions & 0 deletions quantulum3/_lang/en_US/classifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# -*- coding: utf-8 -*-
"""
:mod:`Quantulum` classifier functions.
"""

# Standard library
import re
import string

try:
from stemming.porter2 import stem
except ImportError:
stem = None


################################################################################
def clean_text(text):
"""
Clean text for TFIDF
"""
if not stem:
raise ImportError("Module stemming is not installed.")

my_regex = re.compile(r'[%s]' % re.escape(string.punctuation))
new_text = my_regex.sub(' ', text)

new_text = [
stem(i) for i in new_text.lower().split()
if not re.findall(r'[0-9]', i)
]

new_text = ' '.join(new_text)

return new_text


################################################################################
def stop_words():
"""
Return the string, identifying stop word language for TFIDF vectorizer
:return:
"""
return 'english'
File renamed without changes.

0 comments on commit bc43ce8

Please sign in to comment.