Merge branch 'release/0.3.0'

kororo · Jul 29, 2018 · c79f683 · c79f683
2 parents 6cb2cff + cc6587e
commit c79f683
Show file tree

Hide file tree

Showing 23 changed files with 447 additions and 171 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,5 @@ tests/data/nlp_test
 .tox
 *\.coverage\.*
 dist/
+tests/data/export
+tests/data/nlp
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Change Logs
 
+## 0.3.0
+- Add Phase API
+- Less forgiving data storage (if non recognised property added)
+- Add simple CLI
+- Improve docs
+
 ## 0.2.4
 - Enable py35
 - Add remote download support

diff --git a/README.rst b/README.rst
@@ -46,13 +46,14 @@ The **TRAIN_DATA**, describes sentences and annotated entities to be trained. It
     excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')
     # use the nlp object as per spaCy API
     doc = excelcy.nlp('Google rebrands its business apps')
-    # or save it for faster bootstrap for application
+    # or save_storage it for faster bootstrap for application
     excelcy.nlp.to_disk('/model')
 
+
 ExcelCy is Friendly
 -------------------
 
-ExcelCy training is divided into phases, the example Excel file can be found in `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__ :
+By default, ExcelCy training is divided into phases, the example Excel file can be found in `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__:
 
 1. Discovery
 ^^^^^^^^^^^^
@@ -65,12 +66,14 @@ The first phase is to collect sentences from data source in sheet "source". The
 2. Preparation
 ^^^^^^^^^^^^^^
 
-Next phase, the sentences will be analysed in sheet "prepare", based on:
+Next phase, the Gold annotation needs to be defined in sheet "prepare", based on:
 
 - Current Data Model: Using spaCy API of **nlp(sentence).ents**
 - Phrase pattern: Robertus Johansyah, Uber, Google, Amazon
 - Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$
 
+All annotations in here are considered as Gold annotations, which described in `here <https://spacy.io/usage/training#example-new-entity-type>`__.
+
 3. Training
 ^^^^^^^^^^^
 
@@ -81,6 +84,29 @@ Main phase of NER training, which described in `Simple Style Training <https://s
 
 The last phase, is to test/save the results and repeat the phases if required.
 
+ExcelCy is Flexible
+-------------------
+
+Need more specific export and phases? It is possible to control it using phase API. This is the illustration of the real-world scenario:
+
+1. Train from `tests/data/test_data_05.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx>`__
+
+    .. code-block:: bash
+
+        # Note: this will create a directory and file "export/train_05.xlsx"
+        $ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx
+
+2. Open the result in "export/train_05.xlsx", it shows all identified sentences from source given. However, there is error in the "Himalayas" as identified as "PRODUCT".
+3. To fix this, add phrase matcher for "Himalayas = FAC". It is illustrated in `tests/data/test_data_05a.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx>`__
+4. Train again and check the result in "export/train_05a.xlsx"
+
+    .. code-block:: bash
+
+        # Note: this will create a directory and file "export/train_05a.xlsx"
+        $ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx
+
+5. Check the result that there is backed up nlp data model in "nlp" and the result is corrected in "export/train_05a.xlsx"
+
 ExcelCy is Comprehensive
 ------------------------
 
@@ -95,26 +121,26 @@ Under the hood, ExcelCy has strong and well-defined data storage. At any given p
     # excelcy.load(file_path='test_data_01.xlsx')
     # or define manually
     excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=2, train_drop=0.2)
-    print(json.dumps(excelcy.storage.items(), indent=2))
+    print(json.dumps(excelcy.storage.as_dict(), indent=2))
 
     # add sources
     excelcy.storage.source.add(kind='text', value='Robertus Johansyah is the maintainer ExcelCy')
     excelcy.storage.source.add(kind='textract', value='tests/data/source/test_source_01.txt')
     excelcy.discover()
-    print(json.dumps(excelcy.storage.items(), indent=2))
+    print(json.dumps(excelcy.storage.as_dict(), indent=2))
 
     # add phrase matcher Robertus Johansyah -> PERSON
     excelcy.storage.prepare.add(kind='phrase', value='Robertus Johansyah', entity='PERSON')
     excelcy.prepare()
-    print(json.dumps(excelcy.storage.items(), indent=2))
+    print(json.dumps(excelcy.storage.as_dict(), indent=2))
 
     # train it
     excelcy.train()
-    print(json.dumps(excelcy.storage.items(), indent=2))
+    print(json.dumps(excelcy.storage.as_dict(), indent=2))
 
     # test it
     doc = excelcy.nlp('Robertus Johansyah is maintainer ExcelCy')
-    print(json.dumps(excelcy.storage.items(), indent=2))
+    print(json.dumps(excelcy.storage.as_dict(), indent=2))
 
 
 Features
@@ -150,20 +176,31 @@ To train the spaCy model:
 
 Note: `tests/data/test_data_01.xlsx <https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx>`__
 
+CLI
+---
+
+ExelCy has basic CLI command for execute:
+
+.. code-block:: bash
+
+    $ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx
+
+
 Data Definition
 ---------------
 
 ExcelCy has data definition which expressed in `api.yml <https://github.com/kororo/excelcy/raw/master/data/api.yml>`__. As long as, data given in this specific format and structure, ExcelCy will able to support any type of data format. Check out, the Excel file format in `api.xlsx <https://github.com/kororo/excelcy/raw/master/data/api.xlsx>`__. Data classes are defined with `attrs <https://github.com/python-attrs/attrs>`__, check in `storage.py <https://github.com/kororo/excelcy/raw/master/excelcy/storage.py>`__ for more detail.
 
+
 TODO
 ----
 
 - [X] Start get cracking into spaCy
 
 - [ ] More features and enhancements listed `here <https://github.com/kororo/excelcy/labels/enhancement>`__
 
-    - [ ] [`link <https://github.com/kororo/excelcy/issues/3>`__] Add CLI support
-    - [ ] [`link <https://github.com/kororo/excelcy/issues/4>`__] Add export outputs such as identified Entities, Tags
+    - [ ] [`link <https://github.com/kororo/excelcy/issues/5>`__] JSONL integration with Prodigy
+    - [ ] [`link <https://github.com/kororo/excelcy/issues/6>`__] Add enabled, notes columns
     - [ ] Add special case for tokenisation described `here <https://spacy.io/usage/linguistic-features#special-cases>`__
     - [ ] Add custom tags.
     - [ ] Add classifier text training described `here <https://spacy.io/usage/training#textcat>`__
@@ -174,18 +211,21 @@ TODO
     - [X] Add list of patterns easily (such as kitten breed.
     - [X] Add more data structure check in Excel and more warning messages
     - [X] Add plugin, otherwise just extends for now.
+    - [X] [`link <https://github.com/kororo/excelcy/issues/4>`__] Add export outputs such as identified Entities, Tags
+    - [X] [`link <https://github.com/kororo/excelcy/issues/3>`__] Add CLI support
     - [X] [`link <https://github.com/kororo/excelcy/issues/2>`__] Improve experience
     - [X] [`link <https://github.com/kororo/excelcy/issues/1>`__] Add more file format such as YML, JSON. Make standardise and well documented on data structure.
     - [X] Add support to accept sentences to Excel
 
+
 - [X] Submit to Prodigy Universe
 
 FAQ
 ---
 
 **What is that idx columns in the Excel sheet?**
 
-The idea is to give reference between two things. Imagine in sheet "train", like to know where the sentence generated from in sheet "source".
+The idea is to give reference between two things. Imagine in sheet "train", like to know where the sentence generated from in sheet "source". And also, the nature of Excel, you can sort things, this is the safe guard to keep things in the correct order.
 
 **Can ExcelCy import/export to X, Y, Z data format?**
 

diff --git a/data/api.xlsx b/data/api.xlsx
diff --git a/data/api.yml b/data/api.yml
@@ -12,6 +12,23 @@ config:
   train_iteration: 2
   # X dropout rate based on https://spacy.io/usage/training#tips-dropout
   train_drop: 0.2
+# list API execution to control the journey
+phase:
+  items:
+    '1':
+      idx: 1
+      enabled: true
+      notes: null
+      fn: save_nlp
+      args:
+        key1: val1
+        key2: val2
+    '2':
+      idx: 2
+      enabled: true
+      notes: null
+      fn: discover
+      args: {}
 # data source to train
 source:
   items:

diff --git a/excelcy/__init__.py b/excelcy/__init__.py
@@ -1,5 +1,5 @@
 from excelcy.excelcy import ExcelCy
 
 
-__version__ = '0.2.4'
+__version__ = '0.3.0'
 __all__ = ['ExcelCy']
diff --git a/excelcy/cli.py b/excelcy/cli.py
@@ -0,0 +1,14 @@
+import sys
+
+from excelcy import ExcelCy
+
+
+def main(argv: list = None):
+    # quick CLI execution
+    args = argv or sys.argv
+    if args[1] == 'execute':
+        excelcy = ExcelCy.execute(file_path=args[2])
+
+
+if __name__ == '__main__':
+    main()