Interlinear glossed text #10

xrotwang · 2015-11-23T08:35:30Z

Cross-linguistic datasets often contain examples as interlinear glossed text (IGT) according to the Leipzig Glossing Rules (LGR). While this kind of data clearly can be modelled as tabular data, thus would fit into csv files, this format would probably go against the idea of having formats that can be easily edited by hand. To satisfy this requirement, a format that groups examples as blocks of aligned IGT constituents might be better suited. One candidate for such a format is the Toolbox variant exported by ELAN:

\utterance_id ...
\ELANBegin 5.708
\ELANEnd 8.974
\ELANParticipant
\utterance <text in source language>
\gramm_units <morphemes split by whitespace>
\rp_gloss <glosses split by whitespace>
\ft <translation>
\comment

The text was updated successfully, but these errors were encountered:

xrotwang · 2015-11-23T09:11:15Z

The ODIN project came up with a custom XML format to store IGT (which doesn't seem to add much over Toolbox):

<igt id="i840">
  <metadata type="xigt-meta">
    <meta type="language" name="czech" iso-639-3="ces" tiers="phrases words morphemes"/>
    <meta type="language" name="English" iso-639-3="eng" tiers="translations"/>
    <meta type="odin-source" doc-id="2990" line-range="3297-3299" line-types="L+DB G+DB T+DB"/>
  </metadata>
  <tier id="o" type="odin-raw">
    <item id="o1" line="3297" tag="L+DB">  (52) Pjdu        tam [já a ty]                                                    NP2 saw Bill , Alice VP, Alice V Bill , Alice saw NP, Alice saw</item>
    <item id="o2" line="3298" tag="G+DB">       will.go-1 SG there I and you                                                 Bill }                                          (Goodall 1987:2.21)</item>
    <item id="o3" line="3299" tag="T+DB">      'You and I wil l go there.'</item>
  </tier>
  <tier id="c" type="odin-clean">
    <item id="c1" line="3297" tag="L+DB">Pjdu        tam [já a ty]                                                    NP2 saw Bill , Alice VP, Alice V Bill , Alice saw NP, Alice saw</item>
    <item id="c2" line="3298" tag="G+DB">will.go-1 SG there I and you                                                 Bill }                                          (Goodall 1987:2.21)</item>
    <item id="c3" line="3299" tag="T+DB">You and I wil l go there.</item>
  </tier>
  <tier id="p" type="phrases" content="c">
    <item id="p0" content="c1[0:140]"/>
  </tier>
  <tier id="g" type="glosses" content="c">
    <item id="g0" content="c2[0:144]" alignment="p1"/>
  </tier>
  <tier id="t" type="translations" content="c">
    <item id="t0" content="c3[0:25]" alignment="p1"/>
  </tier>
</igt>

xrotwang · 2015-11-23T09:12:59Z

I think we clearly do not want to deal with any document-specific ordering/numbering/grouping of examples (which seems to account for much of the variation within the ODIN corpus).

LinguList · 2015-11-23T09:21:05Z

One may argue that for alignments we have a similar situation, since we can annotated it in csv, but for manual editing, the most convenient way is to use tab-separated formats, unless one uses a specific tool to edit alignments. So we have for alignments the way to handle it in CSV, and the other formats that we did not really discuss so far, but which have been used, for example, in the alignment benchmark.

xrotwang · 2015-11-23T09:21:14Z

Btw. are there any standard packages for Python, R, etc. to read and write Toolbox files? The one I came up with is here: https://github.com/clld/clldutils/blob/master/clldutils/sfm.py

xrotwang · 2015-11-23T09:24:58Z

@LinguList I'd argue that one difference to the situation for alignments is that IGT are typically written by hand, whereas alignments are often(?) created automatically and consumed by programs.

LinguList · 2015-11-23T09:26:03Z

Guillaume Jacques has hired some programmers (Céline Buret has done the main part) to convert from Toolbox to LaTeX and the like, they have one library here: https://pypi.python.org/pypi/pylmflib/1.0

Here's the github: https://github.com/buret/pylmflib

I have started to work on a toolbox-LingPy converter, without knowing of the work by Guillaume, since people told me it would be useful to have one:

https://github.com/dighl/lift

xflr6 · 2015-11-23T09:26:26Z

There is also some prior work (and data) here: https://github.com/langsci/lsp-xml
XML though, so not for manual editing.

LinguList · 2015-11-23T09:27:32Z

@xrotwang I wouldn't be completely sure about that: if classical linguists make alignments, they will do them in excel spreadsheets. But I agree that we should encourage people to make alignments in the tools that we are developing for them, since otherwise it get's messy anyway.

xrotwang · 2015-11-23T09:42:57Z

The code from pylmflib seems to be pretty much tied to MDF dictionaries, i.e. makes assumptions about the semantics of markers. I don't know enough about MDF to tell whether it has standard markers for all the layers/tiers/lines we would want to have for IGTs.

xflr6 · 2015-11-23T09:51:14Z

There are some pecularities with how Toolbox counts (combining) characters for aligment when reading/writing UTF-8-encoded files (AFAIR, ELANs import/export might even differ in some cases). Hand-editing in a normal text-editor is an issue there. So one might want to define a strict subset (or rather a sane variant?) of the syntax (or rather use something else?).

Here is an R-Package that loads Toolbox-Files: http://bitbucket.org/tzakharko/toolboxsearch/

xrotwang · 2015-11-23T09:58:46Z

@xflr6 My first idea would have been csv, with a column MORPHEMES_AND_GLOSS, storing analyzed text and gloss as two lines of \t-separated chunks. But at least looking at such a file in LibreOffice, it didn't seem like a good idea. How does Excel fare with multiline cell values?

stevepepper · 2015-11-23T10:01:15Z

Excel doesn't handle multiline cell values at all. On import it treats a newline within a cell as an end of record, even if the cell is delimited with quotes.

xrotwang · 2015-11-23T10:03:30Z

@stevepepper oops. Good call. I guess this a showstopper.

xrotwang · 2015-11-23T10:28:20Z

JSON may be another good option, primarily because it

supports lists natively, so no need to markup morpheme boundaries with some custom markers,
lends itself to editing via JavaScript tools, which could easily be hosted online.

The format could look like this

{
  "iso_code": "nmn",
  "sentences": [
    {
      "gloss": [
        "3SG",
          "dry-PFV/STAT"
      ],
      "morph": [
          "ha",
          "\u01c1oo-a"
      ],
      "text": "Ha \u01c1ooa.",
      "trans": "It is dry."
    }
  ]
}

stevepepper · 2015-11-23T10:28:41Z

Does IGT really lend itself to being "modelled as tabular data"? A single instance might consist of:

(ID?, language?, source?, utterance?, segmented, glossed, translation?)

This could be represented as seven columns, but not intuitively (since they would be displayed horizontally, whereas they are typically presented vertically). In theory, the elements of the segmented utterance could be represented one per column with the gloss below them, but then you'd either need two rows, or line breaks within cells, which would be screwed up by Excel.

I think the ELAN representation is the better starting point, but I personally would prefer to see a minimalistic form of XML tagging rather than Latex-style commands:

<igt id="53" lang="ces" src="xyz">
<u>Pjdu tam [já a ty]</u>
<s>Pjdu tam [já a ty]</s>
<g>will.go.1SG there I and you</g>
<t>You and I will go there.</t>
</igt>

In <s> and <g> morphemes are delimited by white-space. In this example, <u> is superfluous because there are no morpheme boundaries within words. The whole example could be represented as JSON_DATA in a CSV file, could it not?

LinguList · 2015-11-23T10:57:13Z

If you handle IGT with whitespace as delimiter in the columns as specified by @stevepepper , all you have to do is to define how whitespace inside a gloss will be distinguished from whitespace as delimiter. Than you have basically the same format we have for alignments and multitiered sequence representation in phonetic entries, where we also may have multiple values virtually aligned but put in different columns, like the word "th o ch t e r", it's simplified version in sound classes "T O X T E R", etc. Seems to be straightforward like that, given that we need to allow for one delimiter inside the cells anyway (at least I assume we will do so for the phonetic alignments).

xrotwang · 2015-11-23T11:18:34Z

AFAICT the Leipzig Glossing Rules do not allow whitespace in the gloss, other than to indicate word boundaries (morpheme boundaries are to be expressed as hyphens). So I guess, in terms of the data model, basically all alternatives are expressive enough. What remains is the problem of "manipulating by hand with a text editor".

Coming back to my use case: Currently, I have the following workflow in mind for submissions to dictionaria:

Author submits files.
Technical editor assesses technical correctness of files, including
- linked multimedia files are available,
- correctness of IGTs, i.e. number of morphemes in gloss and segmented text is the same,
- etc.
Author fixes technical problems.

In many dictionaries the same examples are used for multiple entries. The typical toolbox way of dealing with this seems to be to copy&paste. So if we want authors to fix examples, they either have to do this for all instances of the same example, or we could offer a file with unique examples, extracted from their toolbox file. That's my favourite variant, in particular, because it would also work for submission formats other than toolbox; and that's my motiviation for requiring this format to be editable "with bare hands".

xrotwang · 2015-11-23T12:55:51Z

I start to think that JSON is the way to go. Another advantage over Toolbox (and XML as well) is that JSON requires a unicode file encoding, which will make some classes of errors when handling encodings impossible.

xflr6 · 2015-11-23T12:59:52Z

Maybe a good idea to include a version field in the format so the exact schema can be evolving.

stevepepper · 2015-11-23T13:25:07Z

@xrotwang -- For the specific use case of Authors-Fixing-Examples, a spreadsheet is almost certainly the best tool, with one morpheme per column, as here:

A	B	C	D	E	F
meta	53; ces; Forkel (2015:311)
utterance	Pjdu tam [já a ty]
segmented	Pjdu	tam	já	a	ty
glossed	will.go.1SG	there	I	and	you
translation	You and I will go there.

The first column is a label corresponding to the element type in the XML that I proposed above (or the style name in a Word or Writer document). This example would display better in MS Excel and OpenOffice Calc than it does here, because cell content that exceeds the cell boundary is visible when the following cell is empty. (In other words, column B would not have to be wider than the other columns, as it is here.)

Most importantly: One must distinguish between the underlying data model and the way it is represented for any particular purpose. The latter can vary: JSON lists in a database, CSV for simple editing (as above), specific styles in word processing documents, XML for some forms of interchange, etc. I believe we should define an abstract format (most easily expressed as XML, in my opinion) and mappings from it to multiple application-specific representations.

P.S. XML also requires Unicode.

P.P.S. Should we not also consider other use cases?

xrotwang · 2015-11-23T13:45:42Z

@stevepepper I think in this case the underlying data model is fairly well-defined by the LGR and what I'm after is the best representation for authoring/editing such data, where "best" is measured by

how well known the necessary tools may be to the average user/linguist,
how much validation could be off-loaded to existing tools,
etc.

stevepepper · 2015-11-23T14:18:16Z

@xrotwang I would like to see a formal definition of the data model, both to be sure that it really is well-defined and to establish common terminology. Don't you think that would be useful?

As for the average user/linguist, we can assume that (s)he knows Word and Excel, period. Many (but not all) field linguists will know Toolbox, so we should definitely cater for it, but it's is a legacy tool, so shouldn't build solutions around it. I would not want to have to explain to an average user how to edit JSON directly.

Most validation probably has to come from providing good visual feedback, but some could be implemented using relatively simple Excel macros with the model I proposed above.

xrotwang · 2015-11-23T14:41:19Z

I think the full data model is exemplified by this, using terminology of the LGR document as keys:

    {
      "segmented_words": [
          ["ha"],
          ["\u01c1oo", "a"]
      ],      
      "glosses": [
          ["3SG"],
          ["dry", "PFV/STAT"]
      ],
      "text": "Ha \u01c1ooa.",
      "translation": "It is dry."
    }

stevepepper · 2015-11-23T15:22:19Z

Thanks. You may be right, but it's impossible to tell unless you provide the interpretation for this example as well. Can you show how it would be formatted in a grammar or dictionary? And what is the meaning of the slash in "PFV/STAT". I don't find the slash as a delimiter in LGR... Also, what kind of thing is "ha", and what is the relationship between X and Y in (a) [X], [Y] and (b) [X, Y]? We need names for these things.

In addition, one example is not enough. The definition of the model should take the form of a schema of some kind, against which we can test multiple examples.

I'm not trying to be difficult :-) I'm just worried that some important details might get overlooked if we don't have a formal description. I'll be happy once we have a schema AND a demonstration that every example in the LGR document can be represented in every detail...

xrotwang · 2015-11-23T15:40:10Z

@stevepepper I'm not too worried that important details might get overlooked, as long as I don't overlook essentials of my use case. The worst case then would be that I come up with a file format which is not as widely applicable as it could be, but I don't risk ending up - after a long time - with a format that is too big to be applicable in practice anywhere. So I'm aiming only slightly higher than "ad-hoc" and "de facto" :)

stevepepper · 2015-11-23T15:47:37Z

So, your goal is not to be able to represent every detail. That's fair enough, and it's probably the right decision. But it is an assumption that needs to be stated explicitly, if you want outsiders (like me) to contribute. That's where a formal model helps.

LinguList · 2015-11-23T17:00:30Z

But isn't glossing always a sloppy business where nothing is really defined apart from the fact that one has at least two lines one which shows the original langauge and one which shows the target language, and things are aligned using whatever people can think of, especially in word?

Judging from known usecases, however, @xrotwang's example should leave at least for a second line of glosses, ideally for as many as one wants, since people may want to gloss more than two languages, as for example in those cases, where one has a specific writing system and a transliteration.

HedvigS · 2015-11-23T18:51:33Z

Slightly besides the point, I am aware, I just wanted to say that my ever-favourite glossed text collection is this one and it seems like it was also good fit for the less-techy people who were using it
http://www.univie.ac.at/negation/sprachen/annot-en.html

d97hah · 2015-11-23T18:53:09Z

It is also my impression that although some things in glossing have
well-defined rules, not even those are seriously adhered to. I remember
Nordhoff giving the figure 40% for the IGT examples in a modern grammar
(Teiwa by Klamer) that actually follow the specification claimed.
Apparently many deviations were bracketed extra information of various
kinds. Maybe worth anticipating such a need in the format. all the best, H

2015-11-23 18:00 GMT+01:00 Johann-Mattis List notifications@github.com:

But isn't glossing always a sloppy business where nothing is really
defined apart from the fact that one has at least two lines one which shows
the original langauge and one which shows the target language, and things
are aligned using whatever people can think of, especially in word?

Judging from known usecases, however, @xrotwang
https://github.com/xrotwang's example should leave at least for a
second line of glosses, ideally for as many as one wants, since people may
want to gloss more than two languages, as for example in those cases, where
one has a specific writing system and a transliteration.

—
Reply to this email directly or view it on GitHub
#10 (comment).

HedvigS · 2015-11-23T18:57:16Z

Another potentially superfluous observation, but if you haven't already take a look at some of the articles in this volume: http://nflrc.hawaii.edu/ldc/?p=263

to get a grasp of what the end-users think about. I bet most in this thread already has read everything there, but just in case. i've recently discovered that many people haven't seen those articles.

Btw, personally, I like gb4e
https://www.ctan.org/tex-archive/macros/latex/contrib/gb4e

LinguList · 2015-11-23T19:00:09Z

My impression of language typology is being challenged hard these days. I was innocently assuming that all these things like glossing, describing the features of a language, and the like were nicely resolved, and thinking badly about the aberrancies of phonetic transcription and the dark sides of IPA, concept translation in Swadesh lists, but now it turns out this is a repeating patterns in all aspects of linguistics. The question is, however, should I feel some kind of evil Schadenfreude as a historical linguist, or should I just be sad about typology and historical linguistics?

@HedvigS, thanks for posting gb4e, I was trying to remember the whole day which package I used for glossing in LaTex...

HedvigS · 2015-11-23T19:03:46Z

Well, if you want to get all philosophical @LinguList I've also been thinking about this a lot (with less experience behind me of course) and I got to the conclusion that maybe diversity in methods and tools is not necessarily a bad thing - it might even be good. Diversity in analysis, as long as accessible and sufficiently motivated, keeps us from getting locked in into one perspective..

but perhaps let's save that discussion for another thread :)

LinguList · 2015-11-23T19:08:34Z

yes, another thread, since this can become an endless discussion, but I think, too much freedom is at the core of all evil, although I hate to say this, since it contradicts all my political convictions...

But for another thread: we should think of following up on diversity (due to encoder variability) problems we encounter in typology but also in lexicostatistics (we have all similar problems there but people tend to ignore them). Maybe this could become an interesting investigation in the context of GlottoBank.

HedvigS · 2015-11-23T19:12:28Z

@LinguList Agreed!

I vaguely rambled about similar things here: http://humans-who-read-grammars.blogspot.nl/p/help-linguistics-is-hard.html

xrotwang · 2015-11-23T19:50:15Z

@d97hah The WALS experience certainly supports the Nordhoff figure of (at most) 40% regular IGTs. That's why the clld db table for sentences has an xhtml column - that's where you stick all the weird things, paradigm tables, etc. I also think it's important to cater to this need, if only to make sure people don't have to sneak stuff into the more regulated fields. But again, JSON seems to have the easiest extensibility story of the alternatives considered so far.

nthieberger · 2015-11-25T04:10:54Z

Hi,

Sorry to be late to the discussion. I am keen to build an online set of IGT
with media, and developed EOPAS.org for that purpose. The specifications
are here:
http://www.eopas.org/help

It currently simply imports a particular Toolbox XML output or Elan .eaf
(using a given template). I will extend it to allow for FlexText format.
Given that this is how most linguists are creating their IGT it is easiest
to allow that to be imported.

Nick

On 24 November 2015 at 06:50, Robert Forkel notifications@github.com
wrote:

@d97hah https://github.com/d97hah The WALS experience certainly
supports the Nordhoff figure of (at most) 40% regular IGTs. That's why the
clld db table for sentences has an xhtml column
https://github.com/clld/clld/blob/master/clld/db/models/sentence.py#L57

that's where you stick all the weird things, paradigm tables, etc. I also
think it's important to cater to this need, if only to make sure people
don't have to sneak stuff into the more regulated fields. But again, JSON
seems to have the easiest extensibility story of the alternatives
considered so far.

—
Reply to this email directly or view it on GitHub
#10 (comment).

stevepepper · 2015-11-25T07:46:12Z

@xrotwang Do you really expect authors to edit JSON text like the example you gave above? Seriously?

xrotwang · 2015-11-25T07:53:26Z

@stevepepper No. JSON would be the best option when we have to provide
tooling ourselves.
Am 25.11.2015 08:46 schrieb "stevepepper" notifications@github.com:

@xrotwang https://github.com/xrotwang Do you really expect authors to
edit JSON text like the example you gave above
#10 (comment)?
Seriously?

—
Reply to this email directly or view it on GitHub
#10 (comment).

stevepepper · 2015-11-26T08:12:44Z

@xrotwang OK. I'm relieved to hear that :-) I guess I was confused by what you wrote earlier:

what I'm after is the best representation for authoring/editing such data, where "best" is measured by

how well known the necessary tools may be to the average user/linguist,

how much validation could be off-loaded to existing tools,

etc.

So the question of the best representation for authoring/editing IGT data remains to be answered, yes?

I made a proposal a few days ago for how one might do this in Excel but no-one has so far commented on it. I have since shown it to a field linguist working on a dictionary for Dictionaria here at the University of Oslo and he confirmed that he would be comfortable working with such a format. Are there alternative proposals?

xrotwang · 2015-11-26T08:38:39Z

@stevepepper I do think your proposed spreadsheet format is a good representation for authoring and editing. It's just a bit inconvenient to process for tools, because of the variable number of columns.

stevepepper · 2015-11-26T08:42:25Z

@xrotwang I've always felt that 3 or 4 more lines of code is a small price to pay for a format that is maximally convenient and intuitive for dozens, if not hundreds, of users... :-)

xrotwang · 2015-11-26T08:52:58Z

@stevepepper I would guess "maximally convenient" for most users means "whatever they used before". So this why ELAN/Toolbox/Flex importable and exportable formats are on this list. With a couple lines of code it may also be possible, to fine-tune a JSON editor enough to be usable.

xrotwang · 2015-12-09T09:44:11Z

@nthieberger Thank you for the info on EOPAS - I think any format that already has applications supporting it should be prioritized. One question though: I'm told the Tolbox default markers for the tiers of IGT are \tx, \mb, \gl, and \ft. Why does EOPAS only understand \tx, \mr, \mg, \fg?

nthieberger · 2015-12-09T10:02:48Z

Indeed! These were just the fieldmarkers that I was using at the time it was developed (over 10 years ago for the first version). But of course this can be changed if there were an accepted standard for fieldmarkers, and there is going to also be an import routine for flextext files sometime soon.

nthieberger · 2016-02-22T22:59:14Z

More on this. From the perspective of an archive holding all kinds of material, each with its own way of being presented, a filetype for IGT would allow an EOPAS-style presentation of the text and media, while providing persistence and citability at whatever level of granularity was given in the text. In 2016 PARADISEC will experiment with a new filetype (.ixt) that will call a viewer (to be built) that knows how to present IGT and linked media. The .ixt format will be XML, based roughly on a Toolbox XML as follows (but open to discussion). We will write converters from Flextext to this format. We will also include manuscript images of text as time-aligned objects with the text (see e.g. http://162.243.107.145/transcripts/2). As this is only just starting development, input of good ideas is welcome!

<phrase endTime='1395.22' id='o_37' startTime='1388.22'>
<transcription>Go kiplake pa, kiplake pan, ranru matur. </transcription>
<wordlist>
<word>
<text>Go</text>
<morphemelist>
<morpheme>
<text kind='morpheme'>go</text>
<text kind='gloss'>and</text>
</morpheme>
</morphemelist>
</word>
<word>
<text>kiplake</text>
<morphemelist>
<morpheme>
<text kind='morpheme'>ki=</text>
<text kind='gloss'>3S.PS=</text>
</morpheme>
<morpheme>
<text kind='morpheme'>plak</text>
<text kind='gloss'>with</text>
</morpheme>
<morpheme>
<text kind='morpheme'>-e</text>
<text kind='gloss'>-TS</text>
</morpheme>
<morpheme>
<text kind='morpheme'>-ø</text>
<text kind='gloss'>-3S.O</text>
</morpheme>
</morphemelist>
</word>
[...]
</wordlist>
<translation>He took her and went, they slept until she became that man's wife.</translation>
</phrase>

xrotwang · 2016-02-29T11:26:54Z

@nthieberger Would startTime and endTime be attributes available on word elements as well? In general it seems that synchronization with an audio recording may be better off as content of a separate resource? If so, the question would be which parts of an IGT resource should be addressable via IDs from other resources.

nthieberger · 2016-02-29T19:21:55Z

Yes, time is currently at the level of the chunk (sentence, IU ..), but in EOPAS morphemes are citable (http://www.eopas.org/transcripts/212#!/p2/w5 cites a word) but within the chunk which is also citable (http://www.eopas.org/transcripts/212#t=19.44,22.24). I think that IGT must cite timecodes and be immediately playable in order to provide verifiability of the primary material.

goodmami · 2016-03-01T01:32:13Z

Hi, I stumbled upon this page looking for something else, but since I see you mentioned the Xigt format we're using in the ODIN project, I thought I'd chip in. In case I repeat something already mentioned, I apologize for not reading every reply here carefully.

First, IGT data is vaguely tabular, in that it's intended to be read in aligned columns, but the annotation structure is actually more like a tree. One phrase is made up of many words, and each word can be several morphemes, and each morpheme may have several glosses. A translation usually follows the glosses, but it's better thought of as an annotation of the phrase than of words or morphemes. You can probably model an IGT with CSV/TSV files, but I think it would be difficult to do so accurately while keeping the general look of an IGT. Therefore, I think using spreadsheet software like Excel would be terribly limiting for producers of IGT. Toolbox is nice because of it's automatic "parsing" (i.e. morphological analysis) functionality. Other tools focus more on, say, aligning text to audio/video, or managing dictionaries, etc. Linguists will often use several tools in the process of creating IGT.

Second, the Leipzig Glossing Rules, despite the name, is a set of conventions for linguists to follow. Linguists often deviate from these "rules", so you can't expect any given IGT to fully comply with the LGR. You can't even reliably expect there to be the same number of space-separated tokens on the morpheme and gloss lines, or the same number of hyphens. Such as assumption will get you pretty far, but you'll have to abandon it if you want to cover many sources of data.

The purpose of the Xigt format is to better enable NLP tasks using IGT as the data source, so it might not be best for, e.g., an archival format. But if you're interested I'd be happy to explain how it could be used. It is canonically an XML format, but we also have a JSON format, on top of which we are building a REST+JSONP server for IGT corpora, and some other tools.

If you want to use Toolbox SFM files, the NLTK project has a Toolbox reader. I also created a Toolbox reader for the Xigt/ODIN stuff, with some functions to help return the proper annotation structure when the author didn't follow the LGR strictly.

This comment is getting long so I'll stop, but let me know if you have any questions.

xrotwang · 2016-03-02T16:38:02Z

@goodmami Thanks for the pointers, especially the one to sleipnir. A format (xigtjson) that already has tool support is certainly a good candidate for an interchange format.

goodmami · 2016-03-13T02:21:31Z

I looked through the replies a bit more, and I should add that we haven't yet done anything in particular for audio data. Xigt is very free in what and how it annotates things, so there's nothing stopping someone from describing audio data, but we have so far only been concerned with text data and have not implemented audio (or video) playback or annotation in our applications. It would, however, be straightforward to port audio annotations from some other format into Xigt.

Also, @nthieberger, it's nice to see you here. I was inspired by your '09 paper titled "Culture clash – Humanities research and computing: a case study of Interlinear Glossed Text (IGT)" when we were creating Xigt.

sylvainloiseau · 2018-04-25T15:50:01Z

I would be very interested in a module for interlinear glossed texts. I'm not sure it would be useful for editing, but it will most certainly be useful for quantitative analyses. Are you planning to include such a module in a future version? I have worked on a tool (an R package:https://github.com/sylvainloiseau/interlineaR) for turning IGT (Emeld or toolbox) into a set of tables, with a relational data model. It also include a function for turning LIFT dictionaries into a set of tables in the line of the Dictionary module.
Best,
Sylvain

thiagochacon · 2018-04-25T19:20:47Z

Wow! That would be a great asset! Em 25 de abr de 2018 12:50 PM, sylvainloiseau <notifications@github.com> escreveu: I would be very interested in a module for interlinear glossed texts. I'm not sure it would be useful for editing, but it will most certainly be useful for quantitative analyses. Are you planning to include such a module in a future version? I have worked on a tool (an R package) for turning IGT (Emeld or toolbox) into a set of tables, with a relational data model. It also include a function for turning LIFT dictionaries into a set of tables in the line of the Dictionary module. Best, Sylvain — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#10 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIbGqDbYqxcyoI0EGHzet9Yab52Zr9N9ks5tsJsqgaJpZM4GnWXm>.

xrotwang · 2018-04-30T09:54:57Z

@sylvainloiseau There is a CLDF component for IGT. It exploits the fact that the CSVW spec provides a mechanism to specify secondary delimiters in CSV files, e.g. a separator for words in an IGT line. While this probably isn't enough to specify IGT exhaustively, it serves the purpose you mention: Making it simpler for tools like R to access (well understood) IGT corpora.

xrotwang · 2020-01-23T07:26:27Z

So, meanwhile, exploration has continued, and for the use in a paper we came up with pyigt - a python library to access IGT data included in a CLDF dataset.

It's pretty minimalistic, but I think it follows closely the design prociple of CLDF:

Only specify things that have actual (computational) use cases.

I.e. we only exploit/support the "simple", Leipzig-Glossing-Rules case: An IGT is an ordered set of (word, gloss) pairs, possibly with aligned morpheme markers in word and gloss.

The use case this is built for is "dataset enrichment", i.e. extracting wordlists/dictionaries from a corpus of IGT.

xrotwang · 2020-01-23T07:29:32Z

I'm closing this issue now, since there already is a CLDF component for IGT, with at least one application. I don't mean to shut down discussion, though! Feel free to criticize shortcomings of the current implementation in new issues :)

xrotwang added this to the CLDF 2.0 milestone Sep 20, 2017

xrotwang closed this as completed Jan 23, 2020

johnwdubois mentioned this issue Aug 2, 2020

IGT import resources (including Xigt & ToolBox) johnwdubois/rezonator#455

Open

Interlinear glossed text #10

Interlinear glossed text #10

Comments

xrotwang commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xrotwang commented Nov 23, 2015

LinguList commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xrotwang commented Nov 23, 2015

LinguList commented Nov 23, 2015

xflr6 commented Nov 23, 2015

LinguList commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xflr6 commented Nov 23, 2015

xrotwang commented Nov 23, 2015

stevepepper commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xrotwang commented Nov 23, 2015

stevepepper commented Nov 23, 2015

LinguList commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xrotwang commented Nov 23, 2015

xflr6 commented Nov 23, 2015

stevepepper commented Nov 23, 2015

xrotwang commented Nov 23, 2015

stevepepper commented Nov 23, 2015

xrotwang commented Nov 23, 2015

stevepepper commented Nov 23, 2015

xrotwang commented Nov 23, 2015

stevepepper commented Nov 23, 2015

LinguList commented Nov 23, 2015

HedvigS commented Nov 23, 2015

d97hah commented Nov 23, 2015

HedvigS commented Nov 23, 2015

LinguList commented Nov 23, 2015

HedvigS commented Nov 23, 2015

LinguList commented Nov 23, 2015

HedvigS commented Nov 23, 2015

xrotwang commented Nov 23, 2015

nthieberger commented Nov 25, 2015

stevepepper commented Nov 25, 2015

xrotwang commented Nov 25, 2015

stevepepper commented Nov 26, 2015

xrotwang commented Nov 26, 2015

stevepepper commented Nov 26, 2015

xrotwang commented Nov 26, 2015

xrotwang commented Dec 9, 2015

nthieberger commented Dec 9, 2015

nthieberger commented Feb 22, 2016

xrotwang commented Feb 29, 2016

nthieberger commented Feb 29, 2016

goodmami commented Mar 1, 2016

xrotwang commented Mar 2, 2016

goodmami commented Mar 13, 2016

sylvainloiseau commented Apr 25, 2018 • edited

thiagochacon commented Apr 25, 2018 via email

xrotwang commented Apr 30, 2018

xrotwang commented Jan 23, 2020

xrotwang commented Jan 23, 2020

sylvainloiseau commented Apr 25, 2018 •

edited