Merge branch 'dev' into main

neocl · Apr 29, 2021 · aae4f87 · aae4f87
2 parents 18cf2ee + c544769
commit aae4f87
Show file tree

Hide file tree

Showing 13 changed files with 194 additions and 186 deletions.
diff --git a/README.md b/README.md
@@ -28,13 +28,7 @@ pip install speach
 
 ## ELAN support
 
-speach library contains a command line tool for converting EAF files into CSV.
-
-```bash
-python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
-```
-
-For more complex analyses, speach Python scripts can be used to extract metadata and annotations from ELAN transcripts, for example:
+Speach can be used to extract annotations as well as metadata from ELAN transcripts, for example:
 
 ``` python
 from speach import elan
@@ -46,33 +40,14 @@ eaf = elan.open_eaf('./test/data/test.eaf')
 for tier in eaf:
     print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
     for ann in tier:
-        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} -- {ann.to_ts}] {ann.text}")
+        print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} :: {ann.to_ts}] {ann.text}")
 ```
 
-## Text corpus
-
-```python
->>> from speach import ttl
->>> doc = ttl.Document('mydoc')
->>> sent = doc.new_sent("I am a sentence.")
->>> sent
-#1: I am a sentence.
->>> sent.ID
-1
->>> sent.text
-'I am a sentence.'
->>> sent.import_tokens(["I", "am", "a", "sentence", "."])
->>> >>> sent.tokens
-[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
->>> doc.write_ttl()
-```
+Speach also provides command line tools for processing EAF files.
 
-The script above will generate this corpus
-
-```
--rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_concepts.txt
--rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_links.txt
--rw-rw-r--.  1 tuananh tuananh      20  3月 29 13:10 mydoc_sents.txt
--rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_tags.txt
--rw-rw-r--.  1 tuananh tuananh      58  3月 29 13:10 mydoc_tokens.txt
+```bash
+# this command converts an eaf file into csv
+python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
 ```
+
+Read [Speach documentation](https://speach.readthedocs.io/) for more information.
diff --git a/docs/Makefile b/docs/Makefile
@@ -14,7 +14,7 @@ help:
 
 
 serve:
-	python -m http.server 7000 --bind 127.0.0.1 --directory ${BUILDDIR}/dirhtml
+	python -m http.server 7001 --bind 127.0.0.1 --directory ${BUILDDIR}/dirhtml
 
 .PHONY: help Makefile
 

diff --git a/docs/_static/images/ttl.png b/docs/_static/images/ttl.png
diff --git a/docs/api.rst b/docs/api.rst
@@ -1,42 +1,18 @@
-Speach APIs
-===============
+.. _api:
+
+API Reference
+=============
 
 An overview of ``speach`` modules.
 
 .. module:: speach
 
-ELAN supports
--------------
-
-speach supports reading and manipulating multi-tier transcriptions from ELAN directly.
-
-.. automodule:: speach.elan
-   :members: open_eaf, parse_eaf_stream
-
-.. autoclass:: ELANDoc
-   :members:
-   :member-order: groupwise
-
-.. autoclass:: ELANTier
-   :members:
-   :member-order: groupwise
-
-TTL Interlinear Gloss Format
-----------------------------
-
-TTLIG is a human friendly interlinear gloss format that can be edited using any text editor.
-
-.. module:: speach.ttlig
-
-TTL SQLite
-----------
-
-TTL supports SQLite storage format to manage large scale corpuses.
-
-.. module:: speach.sqlite
+Contents
+--------
 
-WebVTT
-------
+.. toctree::
+   :maxdepth: 2
 
-Speach supports manipulating Web Video Text Tracks format (Web VTT).
-Read more in :ref:`page_vtt` page.
+   api_elan
+   api_ttl
+   api_vtt
diff --git a/docs/api_elan.rst b/docs/api_elan.rst
@@ -0,0 +1,15 @@
+ELAN module
+===========
+
+``speach`` supports reading and manipulating multi-tier transcriptions from ELAN directly.
+
+.. automodule:: speach.elan
+   :members: open_eaf, parse_eaf_stream
+
+.. autoclass:: ELANDoc
+   :members:
+   :member-order: groupwise
+
+.. autoclass:: ELANTier
+   :members:
+   :member-order: groupwise
diff --git a/docs/api_ttl.rst b/docs/api_ttl.rst
@@ -0,0 +1,62 @@
+texttaglib module
+=================
+
+TTL (abbreviated from ``texttaglib``) is a Python implementation of the corpus linguistic method
+described in :ref:`Tuan Anh (2019) <ta_2019>`.
+TTL was designed to be a robust linguistic documentation framework which is flexible enough
+to handle linguistic data from different sources
+(Core NLP, ELAN, CoNLL, Semcor, Babelfy, Glosstag Wordnet, Tatoeba project, TSDB++, to name a few).
+
+TTL can be used as a data interchange format for converting to and from different data formats.
+
+.. image:: _static/images/ttl.png
+
+Text corpus
+-----------
+
+   >>> from speach import ttl
+   >>> doc = ttl.Document('mydoc')
+   >>> sent = doc.new_sent("I am a sentence.")
+   >>> sent
+   #1: I am a sentence.
+   >>> sent.ID
+   1
+   >>> sent.text
+   'I am a sentence.'
+   >>> sent.import_tokens(["I", "am", "a", "sentence", "."])
+   >>> >>> sent.tokens
+   [`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
+   >>> doc.write_ttl()
+
+The script above will generate this corpus
+
+::
+
+   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_concepts.txt
+   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_links.txt
+   -rw-rw-r--.  1 tuananh tuananh      20  3月 29 13:10 mydoc_sents.txt
+   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_tags.txt
+   -rw-rw-r--.  1 tuananh tuananh      58  3月 29 13:10 mydoc_tokens.txt
+
+TIG - TTL Interlinear Gloss format
+----------------------------------
+
+TIG is a human friendly interlinear gloss format that can be edited using any text editor.
+
+.. module:: speach.tig
+
+TTL SQLite
+----------
+
+TTL supports SQLite storage format to manage large scale corpuses.
+
+.. module:: speach.sqlite
+
+References
+----------
+
+.. _ta_2019:
+
+- Le, T. A. (2019). *Developing and applying an integrated semantic framework for natural language understanding (pp. 69-78)*.
+  `DOI:10.32657/10220/49370 <https://doi.org/10.32657/10220/49370>`_ 
+
diff --git a/docs/api_vtt.rst b/docs/api_vtt.rst
@@ -1,13 +1,10 @@
 .. _page_vtt:
 
-Web VTT APIs
-============
+Web VTT module
+==============
 
 Speach supports Web VTT - The Web Video Text Tracks Format.
 Read more about it at: https://www.w3.org/2013/07/webvtt.html
 
-APIs
-----
-
 .. automodule:: speach.vtt
    :members: sec2ts, ts2sec
diff --git a/docs/elan.rst b/docs/elan.rst
@@ -0,0 +1,65 @@
+ELAN Recipes
+============
+
+Common snippets for processing ELAN transcriptions with ``speach``.
+
+Open an ELAN file
+-----------------
+
+    >>> from speach import elan
+    >>> eaf = elan.open_eaf('./data/test.eaf')
+    >>> eaf
+    <speach.elan.ELANDoc object at 0x7f67790593d0>
+
+Parse an existing text stream
+-----------------------------
+
+If you have an input stream ready, you can parse its content with :code:`parse_eaf_stream()` method.
+
+.. code-block:: python
+
+    >>> from speach import elan
+    >>> with open('./data/test.eaf') as eaf_stream:
+    >>> ...  eaf = elan.parse_eaf_stream(eaf_stream)
+    >>> ...
+    >>> eaf
+    <speach.elan.ELANDoc object at 0x7f6778f7a9d0>
+
+Accessing tiers & annotations
+-----------------------------
+
+You can loop through all tiers in an ``ELANDoc`` object (i.e. an eaf file)
+and all annotations in each tier using Python's ``for ... in ...`` loops.
+For example:
+
+.. code-block:: python
+
+    for tier in eaf:
+        print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
+        for ann in tier:
+            print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.text}")
+
+Accessing nested tiers in ELAN
+------------------------------
+
+If you want to loop through the root tiers only, you can use the :code:`roots` list of an ``ELANDoc``:
+
+.. code-block:: python
+
+    eaf = elan.open_eaf('./data/test_nested.eaf')
+    # accessing nested tiers
+    for tier in eaf.roots:
+        print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
+        for child_tier in tier.children:
+            print(f"    | {child_tier.ID} | Participant: {child_tier.participant} | Type: {child_tier.type_ref}")
+            for ann in child_tier.annotations:
+                print(f"    |- {ann.ID.rjust(4, ' ')}. [{ann.from_ts} -- {ann.to_ts}] {ann.text}")
+         
+Converting ELAN files to CSV
+----------------------------
+
+``speach`` includes a command line tool to convert an EAF file into CSV.
+
+.. code-block:: bash
+
+   python -m speach eaf2csv my_transcript.eaf -o my_transcript.csv
diff --git a/docs/index.rst b/docs/index.rst
@@ -31,73 +31,33 @@ Speach can be used to extract annotations as well as metadata from ELAN transcri
 
 .. code:: python
 
-   from speach import elan
+    from speach import elan
 
-   # Test ELAN reader function in speach
-   eaf = elan.open_eaf('./data/test.eaf')
+    # Test ELAN reader function in speach
+    eaf = elan.open_eaf('./test/data/test.eaf')
 
-   # accessing metadata
-   print(f"Author: {eaf.author} | Date: {eaf.date} | Format: {eaf.fileformat} | Version: {eaf.version}")
-   print(f"Media file: {eaf.media_file}")
-   print(f"Time units: {eaf.time_units}")
-   print(f"Media URL: {eaf.media_url} | MIME type: {eaf.mime_type}")
-   print(f"Media relative URL: {eaf.relative_media_url}")
+    # accessing tiers & annotations
+    for tier in eaf:
+        print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
+        for ann in tier:
+            print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} :: {ann.to_ts}] {ann.text}")
 
-   # accessing tiers & annotations
-   for tier in eaf.tiers():
-       print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
-       for ann in tier.annotations:
-           print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")
-
-Speach contains a command line tool for converting EAF files into CSV.
+Speach also provides command line tools for processing EAF files.
 
 .. code:: bash
 
+   # this command converts an eaf file into csv
    python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
 
-           
-Text corpus
------------
-
-   >>> from speach import ttl
-   >>> doc = ttl.Document('mydoc')
-   >>> sent = doc.new_sent("I am a sentence.")
-   >>> sent
-   #1: I am a sentence.
-   >>> sent.ID
-   1
-   >>> sent.text
-   'I am a sentence.'
-   >>> sent.import_tokens(["I", "am", "a", "sentence", "."])
-   >>> >>> sent.tokens
-   [`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
-   >>> doc.write_ttl()
-
-The script above will generate this corpus
-
-::
-
-   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_concepts.txt
-   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_links.txt
-   -rw-rw-r--.  1 tuananh tuananh      20  3月 29 13:10 mydoc_sents.txt
-   -rw-rw-r--.  1 tuananh tuananh       0  3月 29 13:10 mydoc_tags.txt
-   -rw-rw-r--.  1 tuananh tuananh      58  3月 29 13:10 mydoc_tokens.txt
-
-SQLite support
---------------
-
-TTL data can be stored in a SQLite database for better corpus analysis.
-
-Table of contents
------------------
+More information:
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    tutorials
    recipes
    api
-
+           
 Useful Links
 ------------