Skip to content

Commit

Permalink
docs: Move citations into a global references ReST file
Browse files Browse the repository at this point in the history
  • Loading branch information
lentinj committed Nov 16, 2018
1 parent e519c52 commit a0ae951
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 15 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ CLiC
advanced
appendices
footnotes
references

* :ref:`genindex`
* :ref:`search`
6 changes: 6 additions & 0 deletions docs/references.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
References
==========

.. [ICU] http://userguide.icu-project.org/boundaryanalysis
.. [UAX29] https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries
.. [UNIDECODE] https://pypi.org/project/Unidecode/
6 changes: 1 addition & 5 deletions server/clic/region/chapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@
blank line in the text).
``chapter.paragraph`` are then broken up into ``chapter.sentence``, using the
Unicode sentence segmentation in [UAX29], using the implementation in the [ICU]
Unicode sentence segmentation in [UAX29]_, using the implementation in the [ICU]_
library.
* We use the ``en_GB@ss=standard`` locale (ss=standard tells ICU to not treat
Expand Down Expand Up @@ -207,10 +207,6 @@
('chapter.sentence', 141, 236, 2, 'Above the door was p...Oliver, News Agent."'),
('chapter.paragraph', 238, 395, 2, 'So if you wish to st...et all these things.'),
('chapter.sentence', 238, 395, 3, 'So if you wish to st...et all these things.')]
.. [ICU] http://userguide.icu-project.org/boundaryanalysis
.. [UAX29] https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries
.. [UNIDECODE] https://pypi.org/project/Unidecode/
"""
import re

Expand Down
13 changes: 3 additions & 10 deletions server/clic/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
Method
------
To extract tokens, we use Unicode text segmentation as described in [UAX29],
using the implementation in the [ICU] library and standard rules for en_GB, and
To extract tokens, we use Unicode text segmentation as described in [UAX29]_,
using the implementation in the [ICU]_ library and standard rules for en_GB, and
then apply our own additions (see later).
Please read the document for a full description of ICU word boundaries, however
Expand Down Expand Up @@ -43,7 +43,7 @@
Tokens are then normalised into types by:-
* Lower-casing, ``The`` -> ``the``.
* Normalising any non-ascii characters with [UNIDECODE], e.g.
* Normalising any non-ascii characters with [UNIDECODE]_, e.g.
* ``can’t`` -> ``can't``.
* ``café`` -> ``cafe``.
* Removing any surrounding underscores, e.g. ``_connoisseur_`` -> ``connoisseur``.
Expand Down Expand Up @@ -133,13 +133,6 @@
... ''')]
['we', 'have', 'books', 'everywhere',
'moo', 'oi', 'nk']
References
----------
.. [ICU] http://userguide.icu-project.org/boundaryanalysis
.. [UAX29] https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries
.. [UNIDECODE] https://pypi.org/project/Unidecode/
"""
import re

Expand Down

0 comments on commit a0ae951

Please sign in to comment.