-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
194 additions
and
186 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,42 +1,18 @@ | ||
Speach APIs | ||
=============== | ||
.. _api: | ||
|
||
API Reference | ||
============= | ||
|
||
An overview of ``speach`` modules. | ||
|
||
.. module:: speach | ||
|
||
ELAN supports | ||
------------- | ||
|
||
speach supports reading and manipulating multi-tier transcriptions from ELAN directly. | ||
|
||
.. automodule:: speach.elan | ||
:members: open_eaf, parse_eaf_stream | ||
|
||
.. autoclass:: ELANDoc | ||
:members: | ||
:member-order: groupwise | ||
|
||
.. autoclass:: ELANTier | ||
:members: | ||
:member-order: groupwise | ||
|
||
TTL Interlinear Gloss Format | ||
---------------------------- | ||
|
||
TTLIG is a human friendly interlinear gloss format that can be edited using any text editor. | ||
|
||
.. module:: speach.ttlig | ||
|
||
TTL SQLite | ||
---------- | ||
|
||
TTL supports SQLite storage format to manage large scale corpuses. | ||
|
||
.. module:: speach.sqlite | ||
Contents | ||
-------- | ||
|
||
WebVTT | ||
------ | ||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
Speach supports manipulating Web Video Text Tracks format (Web VTT). | ||
Read more in :ref:`page_vtt` page. | ||
api_elan | ||
api_ttl | ||
api_vtt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
ELAN module | ||
=========== | ||
|
||
``speach`` supports reading and manipulating multi-tier transcriptions from ELAN directly. | ||
|
||
.. automodule:: speach.elan | ||
:members: open_eaf, parse_eaf_stream | ||
|
||
.. autoclass:: ELANDoc | ||
:members: | ||
:member-order: groupwise | ||
|
||
.. autoclass:: ELANTier | ||
:members: | ||
:member-order: groupwise |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
texttaglib module | ||
================= | ||
|
||
TTL (abbreviated from ``texttaglib``) is a Python implementation of the corpus linguistic method | ||
described in :ref:`Tuan Anh (2019) <ta_2019>`. | ||
TTL was designed to be a robust linguistic documentation framework which is flexible enough | ||
to handle linguistic data from different sources | ||
(Core NLP, ELAN, CoNLL, Semcor, Babelfy, Glosstag Wordnet, Tatoeba project, TSDB++, to name a few). | ||
|
||
TTL can be used as a data interchange format for converting to and from different data formats. | ||
|
||
.. image:: _static/images/ttl.png | ||
|
||
Text corpus | ||
----------- | ||
|
||
>>> from speach import ttl | ||
>>> doc = ttl.Document('mydoc') | ||
>>> sent = doc.new_sent("I am a sentence.") | ||
>>> sent | ||
#1: I am a sentence. | ||
>>> sent.ID | ||
1 | ||
>>> sent.text | ||
'I am a sentence.' | ||
>>> sent.import_tokens(["I", "am", "a", "sentence", "."]) | ||
>>> >>> sent.tokens | ||
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>] | ||
>>> doc.write_ttl() | ||
|
||
The script above will generate this corpus | ||
|
||
:: | ||
|
||
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_concepts.txt | ||
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_links.txt | ||
-rw-rw-r--. 1 tuananh tuananh 20 3月 29 13:10 mydoc_sents.txt | ||
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_tags.txt | ||
-rw-rw-r--. 1 tuananh tuananh 58 3月 29 13:10 mydoc_tokens.txt | ||
|
||
TIG - TTL Interlinear Gloss format | ||
---------------------------------- | ||
|
||
TIG is a human friendly interlinear gloss format that can be edited using any text editor. | ||
|
||
.. module:: speach.tig | ||
|
||
TTL SQLite | ||
---------- | ||
|
||
TTL supports SQLite storage format to manage large scale corpuses. | ||
|
||
.. module:: speach.sqlite | ||
|
||
References | ||
---------- | ||
|
||
.. _ta_2019: | ||
|
||
- Le, T. A. (2019). *Developing and applying an integrated semantic framework for natural language understanding (pp. 69-78)*. | ||
`DOI:10.32657/10220/49370 <https://doi.org/10.32657/10220/49370>`_ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,10 @@ | ||
.. _page_vtt: | ||
|
||
Web VTT APIs | ||
============ | ||
Web VTT module | ||
============== | ||
|
||
Speach supports Web VTT - The Web Video Text Tracks Format. | ||
Read more about it at: https://www.w3.org/2013/07/webvtt.html | ||
|
||
APIs | ||
---- | ||
|
||
.. automodule:: speach.vtt | ||
:members: sec2ts, ts2sec |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
ELAN Recipes | ||
============ | ||
|
||
Common snippets for processing ELAN transcriptions with ``speach``. | ||
|
||
Open an ELAN file | ||
----------------- | ||
|
||
>>> from speach import elan | ||
>>> eaf = elan.open_eaf('./data/test.eaf') | ||
>>> eaf | ||
<speach.elan.ELANDoc object at 0x7f67790593d0> | ||
|
||
Parse an existing text stream | ||
----------------------------- | ||
|
||
If you have an input stream ready, you can parse its content with :code:`parse_eaf_stream()` method. | ||
|
||
.. code-block:: python | ||
>>> from speach import elan | ||
>>> with open('./data/test.eaf') as eaf_stream: | ||
>>> ... eaf = elan.parse_eaf_stream(eaf_stream) | ||
>>> ... | ||
>>> eaf | ||
<speach.elan.ELANDoc object at 0x7f6778f7a9d0> | ||
Accessing tiers & annotations | ||
----------------------------- | ||
|
||
You can loop through all tiers in an ``ELANDoc`` object (i.e. an eaf file) | ||
and all annotations in each tier using Python's ``for ... in ...`` loops. | ||
For example: | ||
|
||
.. code-block:: python | ||
for tier in eaf: | ||
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}") | ||
for ann in tier: | ||
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.text}") | ||
Accessing nested tiers in ELAN | ||
------------------------------ | ||
|
||
If you want to loop through the root tiers only, you can use the :code:`roots` list of an ``ELANDoc``: | ||
|
||
.. code-block:: python | ||
eaf = elan.open_eaf('./data/test_nested.eaf') | ||
# accessing nested tiers | ||
for tier in eaf.roots: | ||
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}") | ||
for child_tier in tier.children: | ||
print(f" | {child_tier.ID} | Participant: {child_tier.participant} | Type: {child_tier.type_ref}") | ||
for ann in child_tier.annotations: | ||
print(f" |- {ann.ID.rjust(4, ' ')}. [{ann.from_ts} -- {ann.to_ts}] {ann.text}") | ||
Converting ELAN files to CSV | ||
---------------------------- | ||
|
||
``speach`` includes a command line tool to convert an EAF file into CSV. | ||
|
||
.. code-block:: bash | ||
python -m speach eaf2csv my_transcript.eaf -o my_transcript.csv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.