Skip to content

Commit

Permalink
Merge branch 'dev' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
letuananh committed Apr 29, 2021
2 parents 18cf2ee + c544769 commit aae4f87
Show file tree
Hide file tree
Showing 13 changed files with 194 additions and 186 deletions.
41 changes: 8 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,7 @@ pip install speach

## ELAN support

speach library contains a command line tool for converting EAF files into CSV.

```bash
python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
```

For more complex analyses, speach Python scripts can be used to extract metadata and annotations from ELAN transcripts, for example:
Speach can be used to extract annotations as well as metadata from ELAN transcripts, for example:

``` python
from speach import elan
Expand All @@ -46,33 +40,14 @@ eaf = elan.open_eaf('./test/data/test.eaf')
for tier in eaf:
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} -- {ann.to_ts}] {ann.text}")
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} :: {ann.to_ts}] {ann.text}")
```

## Text corpus

```python
>>> from speach import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()
```
Speach also provides command line tools for processing EAF files.

The script above will generate this corpus

```
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_concepts.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_links.txt
-rw-rw-r--. 1 tuananh tuananh 20 3月 29 13:10 mydoc_sents.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_tags.txt
-rw-rw-r--. 1 tuananh tuananh 58 3月 29 13:10 mydoc_tokens.txt
```bash
# this command converts an eaf file into csv
python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
```

Read [Speach documentation](https://speach.readthedocs.io/) for more information.
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ help:


serve:
python -m http.server 7000 --bind 127.0.0.1 --directory ${BUILDDIR}/dirhtml
python -m http.server 7001 --bind 127.0.0.1 --directory ${BUILDDIR}/dirhtml

.PHONY: help Makefile

Expand Down
Binary file added docs/_static/images/ttl.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
46 changes: 11 additions & 35 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,42 +1,18 @@
Speach APIs
===============
.. _api:

API Reference
=============

An overview of ``speach`` modules.

.. module:: speach

ELAN supports
-------------

speach supports reading and manipulating multi-tier transcriptions from ELAN directly.

.. automodule:: speach.elan
:members: open_eaf, parse_eaf_stream

.. autoclass:: ELANDoc
:members:
:member-order: groupwise

.. autoclass:: ELANTier
:members:
:member-order: groupwise

TTL Interlinear Gloss Format
----------------------------

TTLIG is a human friendly interlinear gloss format that can be edited using any text editor.

.. module:: speach.ttlig

TTL SQLite
----------

TTL supports SQLite storage format to manage large scale corpuses.

.. module:: speach.sqlite
Contents
--------

WebVTT
------
.. toctree::
:maxdepth: 2

Speach supports manipulating Web Video Text Tracks format (Web VTT).
Read more in :ref:`page_vtt` page.
api_elan
api_ttl
api_vtt
15 changes: 15 additions & 0 deletions docs/api_elan.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
ELAN module
===========

``speach`` supports reading and manipulating multi-tier transcriptions from ELAN directly.

.. automodule:: speach.elan
:members: open_eaf, parse_eaf_stream

.. autoclass:: ELANDoc
:members:
:member-order: groupwise

.. autoclass:: ELANTier
:members:
:member-order: groupwise
62 changes: 62 additions & 0 deletions docs/api_ttl.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
texttaglib module
=================

TTL (abbreviated from ``texttaglib``) is a Python implementation of the corpus linguistic method
described in :ref:`Tuan Anh (2019) <ta_2019>`.
TTL was designed to be a robust linguistic documentation framework which is flexible enough
to handle linguistic data from different sources
(Core NLP, ELAN, CoNLL, Semcor, Babelfy, Glosstag Wordnet, Tatoeba project, TSDB++, to name a few).

TTL can be used as a data interchange format for converting to and from different data formats.

.. image:: _static/images/ttl.png

Text corpus
-----------

>>> from speach import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()

The script above will generate this corpus

::

-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_concepts.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_links.txt
-rw-rw-r--. 1 tuananh tuananh 20 3月 29 13:10 mydoc_sents.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_tags.txt
-rw-rw-r--. 1 tuananh tuananh 58 3月 29 13:10 mydoc_tokens.txt

TIG - TTL Interlinear Gloss format
----------------------------------

TIG is a human friendly interlinear gloss format that can be edited using any text editor.

.. module:: speach.tig

TTL SQLite
----------

TTL supports SQLite storage format to manage large scale corpuses.

.. module:: speach.sqlite

References
----------

.. _ta_2019:

- Le, T. A. (2019). *Developing and applying an integrated semantic framework for natural language understanding (pp. 69-78)*.
`DOI:10.32657/10220/49370 <https://doi.org/10.32657/10220/49370>`_

7 changes: 2 additions & 5 deletions docs/api_vtt.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
.. _page_vtt:

Web VTT APIs
============
Web VTT module
==============

Speach supports Web VTT - The Web Video Text Tracks Format.
Read more about it at: https://www.w3.org/2013/07/webvtt.html

APIs
----

.. automodule:: speach.vtt
:members: sec2ts, ts2sec
65 changes: 65 additions & 0 deletions docs/elan.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
ELAN Recipes
============

Common snippets for processing ELAN transcriptions with ``speach``.

Open an ELAN file
-----------------

>>> from speach import elan
>>> eaf = elan.open_eaf('./data/test.eaf')
>>> eaf
<speach.elan.ELANDoc object at 0x7f67790593d0>

Parse an existing text stream
-----------------------------

If you have an input stream ready, you can parse its content with :code:`parse_eaf_stream()` method.

.. code-block:: python
>>> from speach import elan
>>> with open('./data/test.eaf') as eaf_stream:
>>> ... eaf = elan.parse_eaf_stream(eaf_stream)
>>> ...
>>> eaf
<speach.elan.ELANDoc object at 0x7f6778f7a9d0>
Accessing tiers & annotations
-----------------------------

You can loop through all tiers in an ``ELANDoc`` object (i.e. an eaf file)
and all annotations in each tier using Python's ``for ... in ...`` loops.
For example:

.. code-block:: python
for tier in eaf:
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.text}")
Accessing nested tiers in ELAN
------------------------------

If you want to loop through the root tiers only, you can use the :code:`roots` list of an ``ELANDoc``:

.. code-block:: python
eaf = elan.open_eaf('./data/test_nested.eaf')
# accessing nested tiers
for tier in eaf.roots:
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for child_tier in tier.children:
print(f" | {child_tier.ID} | Participant: {child_tier.participant} | Type: {child_tier.type_ref}")
for ann in child_tier.annotations:
print(f" |- {ann.ID.rjust(4, ' ')}. [{ann.from_ts} -- {ann.to_ts}] {ann.text}")
Converting ELAN files to CSV
----------------------------

``speach`` includes a command line tool to convert an EAF file into CSV.

.. code-block:: bash
python -m speach eaf2csv my_transcript.eaf -o my_transcript.csv
66 changes: 13 additions & 53 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,73 +31,33 @@ Speach can be used to extract annotations as well as metadata from ELAN transcri

.. code:: python
from speach import elan
from speach import elan
# Test ELAN reader function in speach
eaf = elan.open_eaf('./data/test.eaf')
# Test ELAN reader function in speach
eaf = elan.open_eaf('./test/data/test.eaf')
# accessing metadata
print(f"Author: {eaf.author} | Date: {eaf.date} | Format: {eaf.fileformat} | Version: {eaf.version}")
print(f"Media file: {eaf.media_file}")
print(f"Time units: {eaf.time_units}")
print(f"Media URL: {eaf.media_url} | MIME type: {eaf.mime_type}")
print(f"Media relative URL: {eaf.relative_media_url}")
# accessing tiers & annotations
for tier in eaf:
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts} :: {ann.to_ts}] {ann.text}")
# accessing tiers & annotations
for tier in eaf.tiers():
print(f"{tier.ID} | Participant: {tier.participant} | Type: {tier.type_ref}")
for ann in tier.annotations:
print(f"{ann.ID.rjust(4, ' ')}. [{ann.from_ts.ts} -- {ann.to_ts.ts}] {ann.value}")
Speach contains a command line tool for converting EAF files into CSV.
Speach also provides command line tools for processing EAF files.

.. code:: bash
# this command converts an eaf file into csv
python -m speach eaf2csv input_elan_file.eaf -o output_file_name.csv
Text corpus
-----------

>>> from speach import ttl
>>> doc = ttl.Document('mydoc')
>>> sent = doc.new_sent("I am a sentence.")
>>> sent
#1: I am a sentence.
>>> sent.ID
1
>>> sent.text
'I am a sentence.'
>>> sent.import_tokens(["I", "am", "a", "sentence", "."])
>>> >>> sent.tokens
[`I`<0:1>, `am`<2:4>, `a`<5:6>, `sentence`<7:15>, `.`<15:16>]
>>> doc.write_ttl()

The script above will generate this corpus

::

-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_concepts.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_links.txt
-rw-rw-r--. 1 tuananh tuananh 20 3月 29 13:10 mydoc_sents.txt
-rw-rw-r--. 1 tuananh tuananh 0 3月 29 13:10 mydoc_tags.txt
-rw-rw-r--. 1 tuananh tuananh 58 3月 29 13:10 mydoc_tokens.txt

SQLite support
--------------

TTL data can be stored in a SQLite database for better corpus analysis.

Table of contents
-----------------
More information:

.. toctree::
:maxdepth: 2
:maxdepth: 1

tutorials
recipes
api

Useful Links
------------

Expand Down

0 comments on commit aae4f87

Please sign in to comment.