Skip to content

Commit

Permalink
Merge branch 'release/0.3.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
kororo committed Jul 29, 2018
2 parents 6cb2cff + cc6587e commit c79f683
Show file tree
Hide file tree
Showing 23 changed files with 447 additions and 171 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ tests/data/nlp_test
.tox
*\.coverage\.*
dist/
tests/data/export
tests/data/nlp
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Change Logs

## 0.3.0
- Add Phase API
- Less forgiving data storage (if non recognised property added)
- Add simple CLI
- Improve docs

## 0.2.4
- Enable py35
- Add remote download support
Expand Down
62 changes: 51 additions & 11 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,14 @@ The **TRAIN_DATA**, describes sentences and annotated entities to be trained. It
excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')
# use the nlp object as per spaCy API
doc = excelcy.nlp('Google rebrands its business apps')
# or save it for faster bootstrap for application
# or save_storage it for faster bootstrap for application
excelcy.nlp.to_disk('/model')
ExcelCy is Friendly
-------------------

ExcelCy training is divided into phases, the example Excel file can be found in `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__ :
By default, ExcelCy training is divided into phases, the example Excel file can be found in `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__:

1. Discovery
^^^^^^^^^^^^
Expand All @@ -65,12 +66,14 @@ The first phase is to collect sentences from data source in sheet "source". The
2. Preparation
^^^^^^^^^^^^^^

Next phase, the sentences will be analysed in sheet "prepare", based on:
Next phase, the Gold annotation needs to be defined in sheet "prepare", based on:

- Current Data Model: Using spaCy API of **nlp(sentence).ents**
- Phrase pattern: Robertus Johansyah, Uber, Google, Amazon
- Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$

All annotations in here are considered as Gold annotations, which described in `here <https://spacy.io/usage/training#example-new-entity-type>`__.

3. Training
^^^^^^^^^^^

Expand All @@ -81,6 +84,29 @@ Main phase of NER training, which described in `Simple Style Training <https://s

The last phase, is to test/save the results and repeat the phases if required.

ExcelCy is Flexible
-------------------

Need more specific export and phases? It is possible to control it using phase API. This is the illustration of the real-world scenario:

1. Train from `tests/data/test_data_05.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx>`__

.. code-block:: bash
# Note: this will create a directory and file "export/train_05.xlsx"
$ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx
2. Open the result in "export/train_05.xlsx", it shows all identified sentences from source given. However, there is error in the "Himalayas" as identified as "PRODUCT".
3. To fix this, add phrase matcher for "Himalayas = FAC". It is illustrated in `tests/data/test_data_05a.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx>`__
4. Train again and check the result in "export/train_05a.xlsx"

.. code-block:: bash
# Note: this will create a directory and file "export/train_05a.xlsx"
$ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx
5. Check the result that there is backed up nlp data model in "nlp" and the result is corrected in "export/train_05a.xlsx"

ExcelCy is Comprehensive
------------------------

Expand All @@ -95,26 +121,26 @@ Under the hood, ExcelCy has strong and well-defined data storage. At any given p
# excelcy.load(file_path='test_data_01.xlsx')
# or define manually
excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=2, train_drop=0.2)
print(json.dumps(excelcy.storage.items(), indent=2))
print(json.dumps(excelcy.storage.as_dict(), indent=2))
# add sources
excelcy.storage.source.add(kind='text', value='Robertus Johansyah is the maintainer ExcelCy')
excelcy.storage.source.add(kind='textract', value='tests/data/source/test_source_01.txt')
excelcy.discover()
print(json.dumps(excelcy.storage.items(), indent=2))
print(json.dumps(excelcy.storage.as_dict(), indent=2))
# add phrase matcher Robertus Johansyah -> PERSON
excelcy.storage.prepare.add(kind='phrase', value='Robertus Johansyah', entity='PERSON')
excelcy.prepare()
print(json.dumps(excelcy.storage.items(), indent=2))
print(json.dumps(excelcy.storage.as_dict(), indent=2))
# train it
excelcy.train()
print(json.dumps(excelcy.storage.items(), indent=2))
print(json.dumps(excelcy.storage.as_dict(), indent=2))
# test it
doc = excelcy.nlp('Robertus Johansyah is maintainer ExcelCy')
print(json.dumps(excelcy.storage.items(), indent=2))
print(json.dumps(excelcy.storage.as_dict(), indent=2))
Features
Expand Down Expand Up @@ -150,20 +176,31 @@ To train the spaCy model:
Note: `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__

CLI
---

ExelCy has basic CLI command for execute:

.. code-block:: bash
$ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx
Data Definition
---------------

ExcelCy has data definition which expressed in `api.yml <https://github.com/kororo/excelcy/raw/master/data/api.yml>`__. As long as, data given in this specific format and structure, ExcelCy will able to support any type of data format. Check out, the Excel file format in `api.xlsx <https://github.com/kororo/excelcy/raw/master/data/api.xlsx>`__. Data classes are defined with `attrs <https://github.com/python-attrs/attrs>`__, check in `storage.py <https://github.com/kororo/excelcy/raw/master/excelcy/storage.py>`__ for more detail.


TODO
----

- [X] Start get cracking into spaCy

- [ ] More features and enhancements listed `here <https://github.com/kororo/excelcy/labels/enhancement>`__

- [ ] [`link <https://github.com/kororo/excelcy/issues/3>`__] Add CLI support
- [ ] [`link <https://github.com/kororo/excelcy/issues/4>`__] Add export outputs such as identified Entities, Tags
- [ ] [`link <https://github.com/kororo/excelcy/issues/5>`__] JSONL integration with Prodigy
- [ ] [`link <https://github.com/kororo/excelcy/issues/6>`__] Add enabled, notes columns
- [ ] Add special case for tokenisation described `here <https://spacy.io/usage/linguistic-features#special-cases>`__
- [ ] Add custom tags.
- [ ] Add classifier text training described `here <https://spacy.io/usage/training#textcat>`__
Expand All @@ -174,18 +211,21 @@ TODO
- [X] Add list of patterns easily (such as kitten breed.
- [X] Add more data structure check in Excel and more warning messages
- [X] Add plugin, otherwise just extends for now.
- [X] [`link <https://github.com/kororo/excelcy/issues/4>`__] Add export outputs such as identified Entities, Tags
- [X] [`link <https://github.com/kororo/excelcy/issues/3>`__] Add CLI support
- [X] [`link <https://github.com/kororo/excelcy/issues/2>`__] Improve experience
- [X] [`link <https://github.com/kororo/excelcy/issues/1>`__] Add more file format such as YML, JSON. Make standardise and well documented on data structure.
- [X] Add support to accept sentences to Excel


- [X] Submit to Prodigy Universe

FAQ
---

**What is that idx columns in the Excel sheet?**

The idea is to give reference between two things. Imagine in sheet "train", like to know where the sentence generated from in sheet "source".
The idea is to give reference between two things. Imagine in sheet "train", like to know where the sentence generated from in sheet "source". And also, the nature of Excel, you can sort things, this is the safe guard to keep things in the correct order.

**Can ExcelCy import/export to X, Y, Z data format?**

Expand Down
Binary file modified data/api.xlsx
Binary file not shown.
17 changes: 17 additions & 0 deletions data/api.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,23 @@ config:
train_iteration: 2
# X dropout rate based on https://spacy.io/usage/training#tips-dropout
train_drop: 0.2
# list API execution to control the journey
phase:
items:
'1':
idx: 1
enabled: true
notes: null
fn: save_nlp
args:
key1: val1
key2: val2
'2':
idx: 2
enabled: true
notes: null
fn: discover
args: {}
# data source to train
source:
items:
Expand Down
2 changes: 1 addition & 1 deletion excelcy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from excelcy.excelcy import ExcelCy


__version__ = '0.2.4'
__version__ = '0.3.0'
__all__ = ['ExcelCy']
14 changes: 14 additions & 0 deletions excelcy/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import sys

from excelcy import ExcelCy


def main(argv: list = None):
# quick CLI execution
args = argv or sys.argv
if args[1] == 'execute':
excelcy = ExcelCy.execute(file_path=args[2])


if __name__ == '__main__':
main()
Loading

0 comments on commit c79f683

Please sign in to comment.