Phonology: better TextGrid parsing #1101

myrix · 2024-01-16T16:32:00Z

TextGrid parsing

Some valid TextGrid markup present in the system can't be processed due to the inability of the currently used library, pympi, to properly parse it.

See e.g. come markups in /dictionary/330/3/perspective/330/4/view and /dictionary/3104/11/perspective/3104/12/view perspectives, at least some of them fail to parse due to newlines present in the interval text, which is valid and can be properly worked with Praat but fails to be processed by Lingvodoc as of now.

Functionality due to #1100 may help with locating such improperly parsed markups.

We need a better way to parse markups, probably a new TextGrid parsing library or libraries, at least for phonology but maybe for other uses too.

Requirements:

Satisfy Lingvodoc licensing requirements, no GPL, likes of Apache, BSD, MIT are ok. If in doubt, consult Oleg Borisenko.
Obviously, could parse valid TextGrid markup we can't parse now.
Won't introduce new parsing errors.
Should be more or less actively developed, or at least maintained.

Maybe https://github.com/kylebgorman/textgrid would be ok, though obviously should search for more, maybe something better would be found, investigate and check.

As searching the source code indicates, except for phonology, pympi is used to parse TextGrid in elan_functions.py, https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L165, to convert it to EAF. If possible, i.e. if the new TextGrid library or libraries has EAF convertion, we should replace this pympi use as well.

Though actually, looking at another issue, looks like we would definitely need TextGrid <-> EAF convertion capabilities, so if TextGrid parsing abilities of whatever we would use for that is better than pympi's, we definitely should replace this use of it.

TGT dependency removal

An additional closely related issue, backend dependency on tgt TextGrid parsing library should be removed, as its GPL license is incompatible with Lingvodoc, and its use everywhere in the source code should be replaced with other libraries, perhaps with new TextGrid library or libraries among others.

It should not be difficult, looks like tgt is used only in https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/scripts/convert_rules.py (definitely please remove unused duplicate _export_to_elan() definition from https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L125), specifically for convertions between TextGrid and EAF.

General improvements

If possible, all the functionality and code related to TextGrid, or at least those directly or indirectly touched and affected by changes introduced by the work on the issue, should be improved whenever possible, both architecturally and with immediately obvious optimizations.

E.g., as searching the sources indicates, to_eaf() function https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L26 is not used anywhere, should be removeed then.

The text was updated successfully, but these errors were encountered:

vmonakhov · 2024-01-21T20:00:38Z

Found several libraries for Praat parsing. Unfortunately, none of them can convert to EAF and most of them has GPLv3 license. So pympi + tgt = the best pair but we have to dispose of them.

https://github.com/kylebgorman/textgrid - developed and partly suitable, MIT, noELAN
https://github.com/Legisign/Praat-textgrids - developed and suitable but GPLv3, noELAN
https://github.com/timmahrt/praatIO - developed, MIT, noELAN
https://github.com/rolandomunoz/mytextgrid - less developed, GPLv3, noELAN
https://github.com/nltk/nltk - developed, Apache-2.0, TextGrid is a tiny part of it

Found out that pympi has more special methods to work with TextGrid but the libs above store and process TextGrid easier. So all the used now methods have to be changed.

myrix · 2024-01-22T04:26:08Z

If all else fails, we can leave pympi to work with EAF, use one of the TextGrid libraries to read TextGrid and write our own TextGrid -> EAF converter which will construct the EAF via pympi, like e.g. in https://github.com/ispras/lingvodoc/blob/c61649f929e923066383e9dbd0e3b5b6883ec4ab/lingvodoc/scripts/docx_import.py#L834.

Or, another way, let's get rid of tgt, don't use any new TextGrid libraries and fix the problem with pympi's TextGrid parsing, which as far as I remember from my investigation some time ago is traced to line-by-line decoding and parsing at https://github.com/dopefishh/pympi/blob/c17292c21dacb747a20fc1069450792b52c8a6f8/pympi/Praat.py#L92.

Had an idea about that, tried to write it up, it happened to be easier to develop it a little bit and now it seems to work for TextGrids with multiline annotations in arbitrary encoding, see https://github.com/myrix/pympi, please check if it would work for our needs.

vmonakhov · 2024-01-23T12:57:27Z

'pympi' lib was modified mostly by @myrix and the custom version is used in Lingvodoc now:
git+https://github.com/vmonakhov/pympi.git
There are changes:
TextGrid: multi-line parsing, escaped quote parsing dopefishh/pympi#55
vmonakhov/pympi@dbdde16
No any new TextGrid library we have taken. Exceptions due to newlines and encoding in/of markups have disappeared.
Performed some refactoring for pympi commands within Lingvodoc code. We got rid of 'tgt' library.
Removed several unused functions.

myrix added bug something doesn't work as expected. Has sub-labels: regression, backend, frontend, critical enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Jan 16, 2024

myrix assigned vmonakhov Jan 16, 2024

vmonakhov mentioned this issue Jan 23, 2024

Better textgrid parsing -- resolved https://github.com/ispras/lingvodoc-react/issues/1101 ispras/lingvodoc#1487

Merged

vmonakhov closed this as completed in ispras/lingvodoc@4feafe2 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phonology: better TextGrid parsing #1101

Phonology: better TextGrid parsing #1101

myrix commented Jan 16, 2024

vmonakhov commented Jan 21, 2024

myrix commented Jan 22, 2024

vmonakhov commented Jan 23, 2024 •

edited

Loading

Phonology: better TextGrid parsing #1101

Phonology: better TextGrid parsing #1101

Comments

myrix commented Jan 16, 2024

TextGrid parsing

TGT dependency removal

General improvements

vmonakhov commented Jan 21, 2024

myrix commented Jan 22, 2024

vmonakhov commented Jan 23, 2024 • edited Loading

vmonakhov commented Jan 23, 2024 •

edited

Loading