Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interlinear glossed text #10

Closed
xrotwang opened this issue Nov 23, 2015 · 55 comments
Closed

Interlinear glossed text #10

xrotwang opened this issue Nov 23, 2015 · 55 comments
Milestone

Comments

@xrotwang
Copy link
Contributor

Cross-linguistic datasets often contain examples as interlinear glossed text (IGT) according to the Leipzig Glossing Rules (LGR). While this kind of data clearly can be modelled as tabular data, thus would fit into csv files, this format would probably go against the idea of having formats that can be easily edited by hand. To satisfy this requirement, a format that groups examples as blocks of aligned IGT constituents might be better suited. One candidate for such a format is the Toolbox variant exported by ELAN:

\utterance_id ...
\ELANBegin 5.708
\ELANEnd 8.974
\ELANParticipant
\utterance <text in source language>
\gramm_units <morphemes split by whitespace>
\rp_gloss <glosses split by whitespace>
\ft <translation>
\comment
@xrotwang
Copy link
Contributor Author

The ODIN project came up with a custom XML format to store IGT (which doesn't seem to add much over Toolbox):

<igt id="i840">
  <metadata type="xigt-meta">
    <meta type="language" name="czech" iso-639-3="ces" tiers="phrases words morphemes"/>
    <meta type="language" name="English" iso-639-3="eng" tiers="translations"/>
    <meta type="odin-source" doc-id="2990" line-range="3297-3299" line-types="L+DB G+DB T+DB"/>
  </metadata>
  <tier id="o" type="odin-raw">
    <item id="o1" line="3297" tag="L+DB">  (52) Pjdu        tam [já a ty]                                                    NP2 saw Bill , Alice VP, Alice V Bill , Alice saw NP, Alice saw</item>
    <item id="o2" line="3298" tag="G+DB">       will.go-1 SG there I and you                                                 Bill }                                          (Goodall 1987:2.21)</item>
    <item id="o3" line="3299" tag="T+DB">      'You and I wil l go there.'</item>
  </tier>
  <tier id="c" type="odin-clean">
    <item id="c1" line="3297" tag="L+DB">Pjdu        tam [já a ty]                                                    NP2 saw Bill , Alice VP, Alice V Bill , Alice saw NP, Alice saw</item>
    <item id="c2" line="3298" tag="G+DB">will.go-1 SG there I and you                                                 Bill }                                          (Goodall 1987:2.21)</item>
    <item id="c3" line="3299" tag="T+DB">You and I wil l go there.</item>
  </tier>
  <tier id="p" type="phrases" content="c">
    <item id="p0" content="c1[0:140]"/>
  </tier>
  <tier id="g" type="glosses" content="c">
    <item id="g0" content="c2[0:144]" alignment="p1"/>
  </tier>
  <tier id="t" type="translations" content="c">
    <item id="t0" content="c3[0:25]" alignment="p1"/>
  </tier>
</igt>

@xrotwang
Copy link
Contributor Author

I think we clearly do not want to deal with any document-specific ordering/numbering/grouping of examples (which seems to account for much of the variation within the ODIN corpus).

@LinguList
Copy link
Contributor

One may argue that for alignments we have a similar situation, since we can annotated it in csv, but for manual editing, the most convenient way is to use tab-separated formats, unless one uses a specific tool to edit alignments. So we have for alignments the way to handle it in CSV, and the other formats that we did not really discuss so far, but which have been used, for example, in the alignment benchmark.

@xrotwang
Copy link
Contributor Author

Btw. are there any standard packages for Python, R, etc. to read and write Toolbox files? The one I came up with is here: https://github.com/clld/clldutils/blob/master/clldutils/sfm.py

@xrotwang
Copy link
Contributor Author

@LinguList I'd argue that one difference to the situation for alignments is that IGT are typically written by hand, whereas alignments are often(?) created automatically and consumed by programs.

@LinguList
Copy link
Contributor

Guillaume Jacques has hired some programmers (Céline Buret has done the main part) to convert from Toolbox to LaTeX and the like, they have one library here: https://pypi.python.org/pypi/pylmflib/1.0

Here's the github: https://github.com/buret/pylmflib

I have started to work on a toolbox-LingPy converter, without knowing of the work by Guillaume, since people told me it would be useful to have one:

https://github.com/dighl/lift

@xflr6
Copy link
Member

xflr6 commented Nov 23, 2015

There is also some prior work (and data) here: https://github.com/langsci/lsp-xml
XML though, so not for manual editing.

@LinguList
Copy link
Contributor

@xrotwang I wouldn't be completely sure about that: if classical linguists make alignments, they will do them in excel spreadsheets. But I agree that we should encourage people to make alignments in the tools that we are developing for them, since otherwise it get's messy anyway.

@xrotwang
Copy link
Contributor Author

The code from pylmflib seems to be pretty much tied to MDF dictionaries, i.e. makes assumptions about the semantics of markers. I don't know enough about MDF to tell whether it has standard markers for all the layers/tiers/lines we would want to have for IGTs.

@xflr6
Copy link
Member

xflr6 commented Nov 23, 2015

There are some pecularities with how Toolbox counts (combining) characters for aligment when reading/writing UTF-8-encoded files (AFAIR, ELANs import/export might even differ in some cases). Hand-editing in a normal text-editor is an issue there. So one might want to define a strict subset (or rather a sane variant?) of the syntax (or rather use something else?).

Here is an R-Package that loads Toolbox-Files: http://bitbucket.org/tzakharko/toolboxsearch/

@xrotwang
Copy link
Contributor Author

@xflr6 My first idea would have been csv, with a column MORPHEMES_AND_GLOSS, storing analyzed text and gloss as two lines of \t-separated chunks. But at least looking at such a file in LibreOffice, it didn't seem like a good idea. How does Excel fare with multiline cell values?

@stevepepper
Copy link

Excel doesn't handle multiline cell values at all. On import it treats a newline within a cell as an end of record, even if the cell is delimited with quotes.

@xrotwang
Copy link
Contributor Author

@stevepepper oops. Good call. I guess this a showstopper.

@xrotwang
Copy link
Contributor Author

JSON may be another good option, primarily because it

  • supports lists natively, so no need to markup morpheme boundaries with some custom markers,
  • lends itself to editing via JavaScript tools, which could easily be hosted online.

The format could look like this

{
  "iso_code": "nmn",
  "sentences": [
    {
      "gloss": [
        "3SG",
          "dry-PFV/STAT"
      ],
      "morph": [
          "ha",
          "\u01c1oo-a"
      ],
      "text": "Ha \u01c1ooa.",
      "trans": "It is dry."
    }
  ]
}

@stevepepper
Copy link

Does IGT really lend itself to being "modelled as tabular data"? A single instance might consist of:

(ID?, language?, source?, utterance?, segmented, glossed, translation?)

This could be represented as seven columns, but not intuitively (since they would be displayed horizontally, whereas they are typically presented vertically). In theory, the elements of the segmented utterance could be represented one per column with the gloss below them, but then you'd either need two rows, or line breaks within cells, which would be screwed up by Excel.

I think the ELAN representation is the better starting point, but I personally would prefer to see a minimalistic form of XML tagging rather than Latex-style commands:

<igt id="53" lang="ces" src="xyz">
<u>Pjdu tam [já a ty]</u>
<s>Pjdu tam [já a ty]</s>
<g>will.go.1SG there I and you</g>
<t>You and I will go there.</t>
</igt>

In <s> and <g> morphemes are delimited by white-space. In this example, <u> is superfluous because there are no morpheme boundaries within words. The whole example could be represented as JSON_DATA in a CSV file, could it not?

@LinguList
Copy link
Contributor

If you handle IGT with whitespace as delimiter in the columns as specified by @stevepepper , all you have to do is to define how whitespace inside a gloss will be distinguished from whitespace as delimiter. Than you have basically the same format we have for alignments and multitiered sequence representation in phonetic entries, where we also may have multiple values virtually aligned but put in different columns, like the word "th o ch t e r", it's simplified version in sound classes "T O X T E R", etc. Seems to be straightforward like that, given that we need to allow for one delimiter inside the cells anyway (at least I assume we will do so for the phonetic alignments).

@xrotwang
Copy link
Contributor Author

AFAICT the Leipzig Glossing Rules do not allow whitespace in the gloss, other than to indicate word boundaries (morpheme boundaries are to be expressed as hyphens). So I guess, in terms of the data model, basically all alternatives are expressive enough. What remains is the problem of "manipulating by hand with a text editor".

Coming back to my use case: Currently, I have the following workflow in mind for submissions to dictionaria:

  1. Author submits files.
  2. Technical editor assesses technical correctness of files, including
    • linked multimedia files are available,
    • correctness of IGTs, i.e. number of morphemes in gloss and segmented text is the same,
    • etc.
  3. Author fixes technical problems.

In many dictionaries the same examples are used for multiple entries. The typical toolbox way of dealing with this seems to be to copy&paste. So if we want authors to fix examples, they either have to do this for all instances of the same example, or we could offer a file with unique examples, extracted from their toolbox file. That's my favourite variant, in particular, because it would also work for submission formats other than toolbox; and that's my motiviation for requiring this format to be editable "with bare hands".

@xrotwang
Copy link
Contributor Author

I start to think that JSON is the way to go. Another advantage over Toolbox (and XML as well) is that JSON requires a unicode file encoding, which will make some classes of errors when handling encodings impossible.

@xflr6
Copy link
Member

xflr6 commented Nov 23, 2015

Maybe a good idea to include a version field in the format so the exact schema can be evolving.

@stevepepper
Copy link

@xrotwang -- For the specific use case of Authors-Fixing-Examples, a spreadsheet is almost certainly the best tool, with one morpheme per column, as here:

A B C D E F
meta 53; ces; Forkel (2015:311)
utterance Pjdu tam [já a ty]
segmented Pjdu tam a ty
glossed will.go.1SG there I and you
translation You and I will go there.

The first column is a label corresponding to the element type in the XML that I proposed above (or the style name in a Word or Writer document). This example would display better in MS Excel and OpenOffice Calc than it does here, because cell content that exceeds the cell boundary is visible when the following cell is empty. (In other words, column B would not have to be wider than the other columns, as it is here.)

Most importantly: One must distinguish between the underlying data model and the way it is represented for any particular purpose. The latter can vary: JSON lists in a database, CSV for simple editing (as above), specific styles in word processing documents, XML for some forms of interchange, etc. I believe we should define an abstract format (most easily expressed as XML, in my opinion) and mappings from it to multiple application-specific representations.

P.S. XML also requires Unicode.

P.P.S. Should we not also consider other use cases?

@xrotwang
Copy link
Contributor Author

@stevepepper I think in this case the underlying data model is fairly well-defined by the LGR and what I'm after is the best representation for authoring/editing such data, where "best" is measured by

  • how well known the necessary tools may be to the average user/linguist,
  • how much validation could be off-loaded to existing tools,
  • etc.

@stevepepper
Copy link

@xrotwang I would like to see a formal definition of the data model, both to be sure that it really is well-defined and to establish common terminology. Don't you think that would be useful?

As for the average user/linguist, we can assume that (s)he knows Word and Excel, period. Many (but not all) field linguists will know Toolbox, so we should definitely cater for it, but it's is a legacy tool, so shouldn't build solutions around it. I would not want to have to explain to an average user how to edit JSON directly.

Most validation probably has to come from providing good visual feedback, but some could be implemented using relatively simple Excel macros with the model I proposed above.

@xrotwang
Copy link
Contributor Author

I think the full data model is exemplified by this, using terminology of the LGR document as keys:

    {
      "segmented_words": [
          ["ha"],
          ["\u01c1oo", "a"]
      ],      
      "glosses": [
          ["3SG"],
          ["dry", "PFV/STAT"]
      ],
      "text": "Ha \u01c1ooa.",
      "translation": "It is dry."
    }

@stevepepper
Copy link

Thanks. You may be right, but it's impossible to tell unless you provide the interpretation for this example as well. Can you show how it would be formatted in a grammar or dictionary? And what is the meaning of the slash in "PFV/STAT". I don't find the slash as a delimiter in LGR... Also, what kind of thing is "ha", and what is the relationship between X and Y in (a) [X], [Y] and (b) [X, Y]? We need names for these things.

In addition, one example is not enough. The definition of the model should take the form of a schema of some kind, against which we can test multiple examples.

I'm not trying to be difficult :-) I'm just worried that some important details might get overlooked if we don't have a formal description. I'll be happy once we have a schema AND a demonstration that every example in the LGR document can be represented in every detail...

@xrotwang
Copy link
Contributor Author

@stevepepper I'm not too worried that important details might get overlooked, as long as I don't overlook essentials of my use case. The worst case then would be that I come up with a file format which is not as widely applicable as it could be, but I don't risk ending up - after a long time - with a format that is too big to be applicable in practice anywhere. So I'm aiming only slightly higher than "ad-hoc" and "de facto" :)

@stevepepper
Copy link

So, your goal is not to be able to represent every detail. That's fair enough, and it's probably the right decision. But it is an assumption that needs to be stated explicitly, if you want outsiders (like me) to contribute. That's where a formal model helps.

@LinguList
Copy link
Contributor

But isn't glossing always a sloppy business where nothing is really defined apart from the fact that one has at least two lines one which shows the original langauge and one which shows the target language, and things are aligned using whatever people can think of, especially in word?

Judging from known usecases, however, @xrotwang's example should leave at least for a second line of glosses, ideally for as many as one wants, since people may want to gloss more than two languages, as for example in those cases, where one has a specific writing system and a transliteration.

@HedvigS
Copy link

HedvigS commented Nov 23, 2015

Slightly besides the point, I am aware, I just wanted to say that my ever-favourite glossed text collection is this one and it seems like it was also good fit for the less-techy people who were using it
http://www.univie.ac.at/negation/sprachen/annot-en.html

@d97hah
Copy link

d97hah commented Nov 23, 2015

It is also my impression that although some things in glossing have
well-defined rules, not even those are seriously adhered to. I remember
Nordhoff giving the figure 40% for the IGT examples in a modern grammar
(Teiwa by Klamer) that actually follow the specification claimed.
Apparently many deviations were bracketed extra information of various
kinds. Maybe worth anticipating such a need in the format. all the best, H

2015-11-23 18:00 GMT+01:00 Johann-Mattis List notifications@github.com:

But isn't glossing always a sloppy business where nothing is really
defined apart from the fact that one has at least two lines one which shows
the original langauge and one which shows the target language, and things
are aligned using whatever people can think of, especially in word?

Judging from known usecases, however, @xrotwang
https://github.com/xrotwang's example should leave at least for a
second line of glosses, ideally for as many as one wants, since people may
want to gloss more than two languages, as for example in those cases, where
one has a specific writing system and a transliteration.


Reply to this email directly or view it on GitHub
#10 (comment).

@HedvigS
Copy link

HedvigS commented Nov 23, 2015

Another potentially superfluous observation, but if you haven't already take a look at some of the articles in this volume: http://nflrc.hawaii.edu/ldc/?p=263

to get a grasp of what the end-users think about. I bet most in this thread already has read everything there, but just in case. i've recently discovered that many people haven't seen those articles.

Btw, personally, I like gb4e
https://www.ctan.org/tex-archive/macros/latex/contrib/gb4e

@LinguList
Copy link
Contributor

My impression of language typology is being challenged hard these days. I was innocently assuming that all these things like glossing, describing the features of a language, and the like were nicely resolved, and thinking badly about the aberrancies of phonetic transcription and the dark sides of IPA, concept translation in Swadesh lists, but now it turns out this is a repeating patterns in all aspects of linguistics. The question is, however, should I feel some kind of evil Schadenfreude as a historical linguist, or should I just be sad about typology and historical linguistics?

@HedvigS, thanks for posting gb4e, I was trying to remember the whole day which package I used for glossing in LaTex...

@HedvigS
Copy link

HedvigS commented Nov 23, 2015

Well, if you want to get all philosophical @LinguList I've also been thinking about this a lot (with less experience behind me of course) and I got to the conclusion that maybe diversity in methods and tools is not necessarily a bad thing - it might even be good. Diversity in analysis, as long as accessible and sufficiently motivated, keeps us from getting locked in into one perspective..

but perhaps let's save that discussion for another thread :)

@LinguList
Copy link
Contributor

yes, another thread, since this can become an endless discussion, but I think, too much freedom is at the core of all evil, although I hate to say this, since it contradicts all my political convictions...

But for another thread: we should think of following up on diversity (due to encoder variability) problems we encounter in typology but also in lexicostatistics (we have all similar problems there but people tend to ignore them). Maybe this could become an interesting investigation in the context of GlottoBank.

@HedvigS
Copy link

HedvigS commented Nov 23, 2015

@LinguList Agreed!

I vaguely rambled about similar things here: http://humans-who-read-grammars.blogspot.nl/p/help-linguistics-is-hard.html

@xrotwang
Copy link
Contributor Author

@d97hah The WALS experience certainly supports the Nordhoff figure of (at most) 40% regular IGTs. That's why the clld db table for sentences has an xhtml column - that's where you stick all the weird things, paradigm tables, etc. I also think it's important to cater to this need, if only to make sure people don't have to sneak stuff into the more regulated fields. But again, JSON seems to have the easiest extensibility story of the alternatives considered so far.

@nthieberger
Copy link

Hi,

Sorry to be late to the discussion. I am keen to build an online set of IGT
with media, and developed EOPAS.org for that purpose. The specifications
are here:
http://www.eopas.org/help

It currently simply imports a particular Toolbox XML output or Elan .eaf
(using a given template). I will extend it to allow for FlexText format.
Given that this is how most linguists are creating their IGT it is easiest
to allow that to be imported.

Nick

On 24 November 2015 at 06:50, Robert Forkel notifications@github.com
wrote:

@d97hah https://github.com/d97hah The WALS experience certainly
supports the Nordhoff figure of (at most) 40% regular IGTs. That's why the
clld db table for sentences has an xhtml column
https://github.com/clld/clld/blob/master/clld/db/models/sentence.py#L57

  • that's where you stick all the weird things, paradigm tables, etc. I also
    think it's important to cater to this need, if only to make sure people
    don't have to sneak stuff into the more regulated fields. But again, JSON
    seems to have the easiest extensibility story of the alternatives
    considered so far.


Reply to this email directly or view it on GitHub
#10 (comment).

@stevepepper
Copy link

@xrotwang Do you really expect authors to edit JSON text like the example you gave above? Seriously?

@xrotwang
Copy link
Contributor Author

@stevepepper No. JSON would be the best option when we have to provide
tooling ourselves.
Am 25.11.2015 08:46 schrieb "stevepepper" notifications@github.com:

@xrotwang https://github.com/xrotwang Do you really expect authors to
edit JSON text like the example you gave above
#10 (comment)?
Seriously?


Reply to this email directly or view it on GitHub
#10 (comment).

@stevepepper
Copy link

@xrotwang OK. I'm relieved to hear that :-) I guess I was confused by what you wrote earlier:

what I'm after is the best representation for authoring/editing such data, where "best" is measured by

  • how well known the necessary tools may be to the average user/linguist,
  • how much validation could be off-loaded to existing tools,
  • etc.

So the question of the best representation for authoring/editing IGT data remains to be answered, yes?

I made a proposal a few days ago for how one might do this in Excel but no-one has so far commented on it. I have since shown it to a field linguist working on a dictionary for Dictionaria here at the University of Oslo and he confirmed that he would be comfortable working with such a format. Are there alternative proposals?

@xrotwang
Copy link
Contributor Author

@stevepepper I do think your proposed spreadsheet format is a good representation for authoring and editing. It's just a bit inconvenient to process for tools, because of the variable number of columns.

@stevepepper
Copy link

@xrotwang I've always felt that 3 or 4 more lines of code is a small price to pay for a format that is maximally convenient and intuitive for dozens, if not hundreds, of users... :-)

@xrotwang
Copy link
Contributor Author

@stevepepper I would guess "maximally convenient" for most users means "whatever they used before". So this why ELAN/Toolbox/Flex importable and exportable formats are on this list. With a couple lines of code it may also be possible, to fine-tune a JSON editor enough to be usable.

@xrotwang
Copy link
Contributor Author

xrotwang commented Dec 9, 2015

@nthieberger Thank you for the info on EOPAS - I think any format that already has applications supporting it should be prioritized. One question though: I'm told the Tolbox default markers for the tiers of IGT are \tx, \mb, \gl, and \ft. Why does EOPAS only understand \tx, \mr, \mg, \fg?

@nthieberger
Copy link

Indeed! These were just the fieldmarkers that I was using at the time it was developed (over 10 years ago for the first version). But of course this can be changed if there were an accepted standard for fieldmarkers, and there is going to also be an import routine for flextext files sometime soon.

@nthieberger
Copy link

More on this. From the perspective of an archive holding all kinds of material, each with its own way of being presented, a filetype for IGT would allow an EOPAS-style presentation of the text and media, while providing persistence and citability at whatever level of granularity was given in the text. In 2016 PARADISEC will experiment with a new filetype (.ixt) that will call a viewer (to be built) that knows how to present IGT and linked media. The .ixt format will be XML, based roughly on a Toolbox XML as follows (but open to discussion). We will write converters from Flextext to this format. We will also include manuscript images of text as time-aligned objects with the text (see e.g. http://162.243.107.145/transcripts/2). As this is only just starting development, input of good ideas is welcome!

<phrase endTime='1395.22' id='o_37' startTime='1388.22'>
<transcription>Go kiplake pa, kiplake pan, ranru matur. </transcription>
<wordlist>
<word>
<text>Go</text>
<morphemelist>
<morpheme>
<text kind='morpheme'>go</text>
<text kind='gloss'>and</text>
</morpheme>
</morphemelist>
</word>
<word>
<text>kiplake</text>
<morphemelist>
<morpheme>
<text kind='morpheme'>ki=</text>
<text kind='gloss'>3S.PS=</text>
</morpheme>
<morpheme>
<text kind='morpheme'>plak</text>
<text kind='gloss'>with</text>
</morpheme>
<morpheme>
<text kind='morpheme'>-e</text>
<text kind='gloss'>-TS</text>
</morpheme>
<morpheme>
<text kind='morpheme'>-ø</text>
<text kind='gloss'>-3S.O</text>
</morpheme>
</morphemelist>
</word>
[...]
</wordlist>
<translation>He took her and went, they slept until she became that man's wife.</translation>
</phrase>

@xrotwang
Copy link
Contributor Author

@nthieberger Would startTime and endTime be attributes available on word elements as well? In general it seems that synchronization with an audio recording may be better off as content of a separate resource? If so, the question would be which parts of an IGT resource should be addressable via IDs from other resources.

@nthieberger
Copy link

Yes, time is currently at the level of the chunk (sentence, IU ..), but in EOPAS morphemes are citable (http://www.eopas.org/transcripts/212#!/p2/w5 cites a word) but within the chunk which is also citable (http://www.eopas.org/transcripts/212#t=19.44,22.24). I think that IGT must cite timecodes and be immediately playable in order to provide verifiability of the primary material.

@goodmami
Copy link

goodmami commented Mar 1, 2016

Hi, I stumbled upon this page looking for something else, but since I see you mentioned the Xigt format we're using in the ODIN project, I thought I'd chip in. In case I repeat something already mentioned, I apologize for not reading every reply here carefully.

First, IGT data is vaguely tabular, in that it's intended to be read in aligned columns, but the annotation structure is actually more like a tree. One phrase is made up of many words, and each word can be several morphemes, and each morpheme may have several glosses. A translation usually follows the glosses, but it's better thought of as an annotation of the phrase than of words or morphemes. You can probably model an IGT with CSV/TSV files, but I think it would be difficult to do so accurately while keeping the general look of an IGT. Therefore, I think using spreadsheet software like Excel would be terribly limiting for producers of IGT. Toolbox is nice because of it's automatic "parsing" (i.e. morphological analysis) functionality. Other tools focus more on, say, aligning text to audio/video, or managing dictionaries, etc. Linguists will often use several tools in the process of creating IGT.

Second, the Leipzig Glossing Rules, despite the name, is a set of conventions for linguists to follow. Linguists often deviate from these "rules", so you can't expect any given IGT to fully comply with the LGR. You can't even reliably expect there to be the same number of space-separated tokens on the morpheme and gloss lines, or the same number of hyphens. Such as assumption will get you pretty far, but you'll have to abandon it if you want to cover many sources of data.

The purpose of the Xigt format is to better enable NLP tasks using IGT as the data source, so it might not be best for, e.g., an archival format. But if you're interested I'd be happy to explain how it could be used. It is canonically an XML format, but we also have a JSON format, on top of which we are building a REST+JSONP server for IGT corpora, and some other tools.

If you want to use Toolbox SFM files, the NLTK project has a Toolbox reader. I also created a Toolbox reader for the Xigt/ODIN stuff, with some functions to help return the proper annotation structure when the author didn't follow the LGR strictly.

This comment is getting long so I'll stop, but let me know if you have any questions.

@xrotwang
Copy link
Contributor Author

xrotwang commented Mar 2, 2016

@goodmami Thanks for the pointers, especially the one to sleipnir. A format (xigtjson) that already has tool support is certainly a good candidate for an interchange format.

@goodmami
Copy link

I looked through the replies a bit more, and I should add that we haven't yet done anything in particular for audio data. Xigt is very free in what and how it annotates things, so there's nothing stopping someone from describing audio data, but we have so far only been concerned with text data and have not implemented audio (or video) playback or annotation in our applications. It would, however, be straightforward to port audio annotations from some other format into Xigt.

Also, @nthieberger, it's nice to see you here. I was inspired by your '09 paper titled "Culture clash – Humanities research and computing: a case study of Interlinear Glossed Text (IGT)" when we were creating Xigt.

@xrotwang xrotwang added this to the CLDF 2.0 milestone Sep 20, 2017
@sylvainloiseau
Copy link

sylvainloiseau commented Apr 25, 2018

I would be very interested in a module for interlinear glossed texts. I'm not sure it would be useful for editing, but it will most certainly be useful for quantitative analyses. Are you planning to include such a module in a future version? I have worked on a tool (an R package:https://github.com/sylvainloiseau/interlineaR) for turning IGT (Emeld or toolbox) into a set of tables, with a relational data model. It also include a function for turning LIFT dictionaries into a set of tables in the line of the Dictionary module.
Best,
Sylvain

@thiagochacon
Copy link

thiagochacon commented Apr 25, 2018 via email

@xrotwang
Copy link
Contributor Author

@sylvainloiseau There is a CLDF component for IGT. It exploits the fact that the CSVW spec provides a mechanism to specify secondary delimiters in CSV files, e.g. a separator for words in an IGT line. While this probably isn't enough to specify IGT exhaustively, it serves the purpose you mention: Making it simpler for tools like R to access (well understood) IGT corpora.

@xrotwang
Copy link
Contributor Author

So, meanwhile, exploration has continued, and for the use in a paper we came up with pyigt - a python library to access IGT data included in a CLDF dataset.

It's pretty minimalistic, but I think it follows closely the design prociple of CLDF:

Only specify things that have actual (computational) use cases.

I.e. we only exploit/support the "simple", Leipzig-Glossing-Rules case: An IGT is an ordered set of (word, gloss) pairs, possibly with aligned morpheme markers in word and gloss.

The use case this is built for is "dataset enrichment", i.e. extracting wordlists/dictionaries from a corpus of IGT.

@xrotwang
Copy link
Contributor Author

I'm closing this issue now, since there already is a CLDF component for IGT, with at least one application. I don't mean to shut down discussion, though! Feel free to criticize shortcomings of the current implementation in new issues :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants