Phonology: better TextGrid parsing #1101
Labels
backend
bug is related to backend
bug
something doesn't work as expected. Has sub-labels: regression, backend, frontend, critical
enhancement
this label means that resolving the issue would improve some part of the system
TextGrid parsing
Some valid TextGrid markup present in the system can't be processed due to the inability of the currently used library,
pympi
, to properly parse it.See e.g. come markups in /dictionary/330/3/perspective/330/4/view and /dictionary/3104/11/perspective/3104/12/view perspectives, at least some of them fail to parse due to newlines present in the interval text, which is valid and can be properly worked with Praat but fails to be processed by Lingvodoc as of now.
Functionality due to #1100 may help with locating such improperly parsed markups.
We need a better way to parse markups, probably a new TextGrid parsing library or libraries, at least for phonology but maybe for other uses too.
Requirements:
Maybe https://github.com/kylebgorman/textgrid would be ok, though obviously should search for more, maybe something better would be found, investigate and check.
As searching the source code indicates, except for phonology,
pympi
is used to parse TextGrid in elan_functions.py, https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L165, to convert it to EAF. If possible, i.e. if the new TextGrid library or libraries has EAF convertion, we should replace thispympi
use as well.Though actually, looking at another issue, looks like we would definitely need TextGrid <-> EAF convertion capabilities, so if TextGrid parsing abilities of whatever we would use for that is better than
pympi
's, we definitely should replace this use of it.TGT dependency removal
An additional closely related issue, backend dependency on
tgt
TextGrid parsing library should be removed, as its GPL license is incompatible with Lingvodoc, and its use everywhere in the source code should be replaced with other libraries, perhaps with new TextGrid library or libraries among others.It should not be difficult, looks like
tgt
is used only in https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/scripts/convert_rules.py (definitely please remove unused duplicate_export_to_elan()
definition from https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L125), specifically for convertions between TextGrid and EAF.General improvements
If possible, all the functionality and code related to TextGrid, or at least those directly or indirectly touched and affected by changes introduced by the work on the issue, should be improved whenever possible, both architecturally and with immediately obvious optimizations.
E.g., as searching the sources indicates,
to_eaf()
function https://github.com/ispras/lingvodoc/blob/d40f3409e228abeec43425ecdc157644f7492661/lingvodoc/utils/elan_functions.py#L26 is not used anywhere, should be removeed then.The text was updated successfully, but these errors were encountered: