Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phonology: better TextGrid parsing #1101

Closed
myrix opened this issue Jan 16, 2024 · 3 comments
Closed

Phonology: better TextGrid parsing #1101

myrix opened this issue Jan 16, 2024 · 3 comments
Assignees
Labels
backend bug is related to backend bug something doesn't work as expected. Has sub-labels: regression, backend, frontend, critical enhancement this label means that resolving the issue would improve some part of the system

Comments

@myrix
Copy link
Contributor

myrix commented Jan 16, 2024

TextGrid parsing

Some valid TextGrid markup present in the system can't be processed due to the inability of the currently used library, pympi, to properly parse it.

See e.g. come markups in /dictionary/330/3/perspective/330/4/view and /dictionary/3104/11/perspective/3104/12/view perspectives, at least some of them fail to parse due to newlines present in the interval text, which is valid and can be properly worked with Praat but fails to be processed by Lingvodoc as of now.

Functionality due to #1100 may help with locating such improperly parsed markups.

We need a better way to parse markups, probably a new TextGrid parsing library or libraries, at least for phonology but maybe for other uses too.

Requirements:

  1. Satisfy Lingvodoc licensing requirements, no GPL, likes of Apache, BSD, MIT are ok. If in doubt, consult Oleg Borisenko.
  2. Obviously, could parse valid TextGrid markup we can't parse now.
  3. Won't introduce new parsing errors.
  4. Should be more or less actively developed, or at least maintained.

Maybe https://github.com/kylebgorman/textgrid would be ok, though obviously should search for more, maybe something better would be found, investigate and check.

As searching the source code indicates, except for phonology, pympi is used to parse TextGrid in elan_functions.py, https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L165, to convert it to EAF. If possible, i.e. if the new TextGrid library or libraries has EAF convertion, we should replace this pympi use as well.

Though actually, looking at another issue, looks like we would definitely need TextGrid <-> EAF convertion capabilities, so if TextGrid parsing abilities of whatever we would use for that is better than pympi's, we definitely should replace this use of it.

TGT dependency removal

An additional closely related issue, backend dependency on tgt TextGrid parsing library should be removed, as its GPL license is incompatible with Lingvodoc, and its use everywhere in the source code should be replaced with other libraries, perhaps with new TextGrid library or libraries among others.

It should not be difficult, looks like tgt is used only in https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/scripts/convert_rules.py (definitely please remove unused duplicate _export_to_elan() definition from https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L125), specifically for convertions between TextGrid and EAF.

General improvements

If possible, all the functionality and code related to TextGrid, or at least those directly or indirectly touched and affected by changes introduced by the work on the issue, should be improved whenever possible, both architecturally and with immediately obvious optimizations.

E.g., as searching the sources indicates, to_eaf() function https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L26 is not used anywhere, should be removeed then.

@myrix myrix added bug something doesn't work as expected. Has sub-labels: regression, backend, frontend, critical enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Jan 16, 2024
@vmonakhov
Copy link

Found several libraries for Praat parsing. Unfortunately, none of them can convert to EAF and most of them has GPLv3 license. So pympi + tgt = the best pair but we have to dispose of them.

https://github.com/kylebgorman/textgrid - developed and partly suitable, MIT, noELAN
https://github.com/Legisign/Praat-textgrids - developed and suitable but GPLv3, noELAN
https://github.com/timmahrt/praatIO - developed, MIT, noELAN
https://github.com/rolandomunoz/mytextgrid - less developed, GPLv3, noELAN
https://github.com/nltk/nltk - developed, Apache-2.0, TextGrid is a tiny part of it

Found out that pympi has more special methods to work with TextGrid but the libs above store and process TextGrid easier. So all the used now methods have to be changed.

@myrix
Copy link
Contributor Author

myrix commented Jan 22, 2024

If all else fails, we can leave pympi to work with EAF, use one of the TextGrid libraries to read TextGrid and write our own TextGrid -> EAF converter which will construct the EAF via pympi, like e.g. in https://github.com/ispras/lingvodoc/blob/c61649f929e923066383e9dbd0e3b5b6883ec4ab/lingvodoc/scripts/docx_import.py#L834.

Or, another way, let's get rid of tgt, don't use any new TextGrid libraries and fix the problem with pympi's TextGrid parsing, which as far as I remember from my investigation some time ago is traced to line-by-line decoding and parsing at https://github.com/dopefishh/pympi/blob/c17292c21dacb747a20fc1069450792b52c8a6f8/pympi/Praat.py#L92.

Had an idea about that, tried to write it up, it happened to be easier to develop it a little bit and now it seems to work for TextGrids with multiline annotations in arbitrary encoding, see https://github.com/myrix/pympi, please check if it would work for our needs.

@vmonakhov
Copy link

vmonakhov commented Jan 23, 2024

  1. 'pympi' lib was modified mostly by @myrix and the custom version is used in Lingvodoc now:
    git+https://github.com/vmonakhov/pympi.git
    There are changes:
    TextGrid: multi-line parsing, escaped quote parsing dopefishh/pympi#55
    vmonakhov/pympi@dbdde16
    No any new TextGrid library we have taken. Exceptions due to newlines and encoding in/of markups have disappeared.

  2. Performed some refactoring for pympi commands within Lingvodoc code. We got rid of 'tgt' library.
    Removed several unused functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug is related to backend bug something doesn't work as expected. Has sub-labels: regression, backend, frontend, critical enhancement this label means that resolving the issue would improve some part of the system
Projects
None yet
Development

No branches or pull requests

2 participants