Skip to content

Development considerations, documentation and code for KanripoX

Notifications You must be signed in to change notification settings

kanripox/kanripox-dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Development for KanripoX

Overview

KanripoX is the next generation format for the Kanseki Repository. It combines the previous experiences with various encoding standards, text collections, philological methods and user requirements into a collection of texts, related data and tools.

It thus draws freely from existing material, where possible acknowledging the debt. Apologies in advance if this at times has not been actualized, any oversight is not intentional.

Anatomy of a text repository

Individual texts have all information bundled in a repository, just as in the Kanseki repository. All versions and all related information is in the same branch, thus always visible and accessable (at least in theory, but maybe not in practice, as the information might reside somewhere else and/or has usage restrictions). The specifics are quite variable and detailed in the Manifext.xml, which is one of the required files.

Textual sources

Textual sources are treated slightly different depending on their status as documentary or interpretative. Apart from that, they are treated the same.

Text content might be encoded in XML using the TEI Guidelines, or in the Mandoku text format of the Kanseki Repository. Processing for additional formats might be added in the future.

In the case of a XML file, the text content is distributed across several files, usually one file for a smaller unit of the text, such as a juan or booklet, in the same way this is done in the Mandoku format of the Kanseki Repository. These files are put into subfolders according to the edition they belong to, a parent XML files includes all of them into one virtual document (using XInclude). -> for other files use alignment table that indicates were to split.

The language of the texts in the Kanseki Repository in general will be “Literary Chinese”, for which the ISO 639-3 language code is lzh. For other languages, the identifiers should be used as fitting.

Machine readable identifiers for the text should be derived from their status and editions, eg. KR3a0007_SBCK, KR3a0007_tls-en -> Edition labels (SBCK, master, tls-en) can not include “_” and need to be unique across all files. Identifiers within the text can be unique to that text only.

Documentary files (doc)

Documentary files document a specific edition, usually a print edition. It can exist as a transcription, a digital facsimile or a combination of both. The XML files for this are prepared according to the TEI Guidelines.

Interpretative files (int)

Versions of the text that are not intended to directly replicate an existing version, but rather derive from such a version, for example by normalizing character representation, translation (published translations that are reproduced as such, without editorial intervention, should go into the documentary section), punctuation etc.

Commentaries (com)

Other derived texts, summaries etc. will go here. Or put these into the above categories?

Auxiliary files (aux)

These are files that are required or useful for making use of the the text files. They include machine readable lists of text tokens, link lists that are used to align the various versions of a text and other types of annotations, some might be algorithmically derived from the texts, others hand-crafted.

Structural divisions (div)

The structural divisions link an abstract notion of the divisions (books, chapters, sections) in a work to the addressable, machine readable locations in a file. At the same time, they also serve to structurally align the different texts at an abstract level.

Link tables (lnk)

Various types:

  • corresponding sections of text
    • within this text, but also with other texts
  • link text to commentary
  • link text to translation
  • logical structure
  • virtual anthology, excerpts

Nexus tables (list of related passages)

For every text, we will produce a file that contains the token alignments for all the other editions, based on a textual unit. This will currently either be <line> or <seq>, while alignments on larger units like paragraphs do not seem to work very well.

<nexus xml:id='KR5c0073_tls.seg1-2' tp='3' tcount='3'>
<locationRef ed='CH1a0918a_chant' target='CH1a0918_CHANT_001-1a.3' tp='5' tcount='3'/>
<locationRef ed='CH1a0918b_chant' target='CH1a0918_CHANT_082-1a.8' tp='18' tcount='39'/>
<locationRef ed='CH8x3004_chant' target='CH8x3004_CHANT_002-1a.7' tp='3051' tcount='4'/>
<locationRef ed='CH8x3005_chant' target='CH8x3005_CHANT_001-1a.49' tp='118' tcount='3'/>
<locationRef ed='CH8x3006_chant' target='CH8x3006_CHANT_001-1a.3' tp='2' tcount='3'/>
<locationRef ed='CH8x3007_chant' target='CH8x3007_CHANT_002-1a.7' tp='3048' tcount='4'/>
<locationRef ed='KR5c0057_tls' target='KR5c0057_tls_001-1a.4' tp='5' tcount='3'/>
<locationRef ed='KX5c0045_HFL' target='KX5c0045_HFL_001-001a.03' tp='15' tcount='3'/>
<locationRef ed='KX5c0045_ZTDZ' target='KX5c0045_SJB_001-110474b.03' tp='15' tcount='3'/>
<locationRef ed='KX5c0046_HFL' target='KX5c0046_HFL_000-001a.03' tp='18' tcount='3'/>
<locationRef ed='KX5c0046_ZTDZ' target='KX5c0046_SJB_000-110482a.03' tp='18' tcount='3'/>
<locationRef ed='KX5c0065_SBCK' target='KX5c0065_SBCK_001-1a.04' tp='929' tcount='35'/>
<locationRef ed='KX5c0065_ZTDZ' target='KX5c0065_SJB_001-120001a.04' tp='29' tcount='35'/>
<locationRef ed='KX5c0073_HFL' target='KX5c0073_HFL_001-001a.03' tp='17' tcount='3'/>
<locationRef ed='KX5c0073_ZTDZ' target='KX5c0073_SJB_001-120272c.03' tp='17' tcount='3'/>
</nexus>

In the second example, the corresponding segment in the other texts is completely missing, therefore there is no link possible to a corresponding section in these texts. For the sake of processing with collatex, we introduce a dummy ID here, identified with ‘d’ (for deleted) in place of the text location.

Syntactic Word Locations (swl)

Annotations from the TLS project, according to the format used there.

Token files (tok)

Tokenized versions of the XML files. These files have all the Chinese characters of the text files, but punctuation and other non-textual content is either retained in attributes, or removed. The format allows also a rudimentary form of structural hierarchy (nested <tg> elements) and page or line-breaks to mark locations in a text. For processing in nature language processing (NLP) application, one can simple grab all the <t> elements and work with that. These text files can serve as proxy for more elaborate versions of a digital text, in a similar way to a thumbnail works for a high-resolution image, for search or preview.

<tlist ed="CH1a0907_CHANT" xmlns="http://kanripo.org/ns/KRX/Token/1.0">
<tg>
    <t tp="1" n="CH1a0907_CHANT_001-1a.1-h" role="h" pos="1">君</t>
    <t tp="2" n="CH1a0907_CHANT_001-1a.1-h" role="h" pos="2">道</t>
    <tg>
        <pb ed="CH1a0907_CHANT" n="CH1a0907_CHANT_001-1a"/>
    <t tp="3" n="CH1a0907_CHANT_001-1a.2" p="1.1" role="p" pos="1">晉</t>
    <t tp="4" n="CH1a0907_CHANT_001-1a.2" role="p" pos="2">平</t>
    <t tp="5" n="CH1a0907_CHANT_001-1a.2" role="p" pos="3">公</t>
    <t tp="6" n="CH1a0907_CHANT_001-1a.2" role="p" pos="4">問</t>
    <t tp="7" n="CH1a0907_CHANT_001-1a.2" role="p" pos="5">於</t>
    <t tp="8" n="CH1a0907_CHANT_001-1a.2" role="p" pos="6">師</t>
    <t tp="9" n="CH1a0907_CHANT_001-1a.2" role="p" pos="7">曠</t>
    <t tp="10" n="CH1a0907_CHANT_001-1a.2" f=":" role="p" pos="8">曰</t>
    </tg>
    <tg>
    <t tp="11" n="CH1a0907_CHANT_001-1a.3" p="「" role="p" pos="1">人</t>
    <t tp="12" n="CH1a0907_CHANT_001-1a.3" role="p" pos="2">君</t>
    <t tp="13" n="CH1a0907_CHANT_001-1a.3" role="p" pos="3">之</t>
    <t tp="14" n="CH1a0907_CHANT_001-1a.3" role="p" pos="4">道</t>
    <t tp="15" n="CH1a0907_CHANT_001-1a.3" role="p" pos="5">如</t>
    <t tp="16" n="CH1a0907_CHANT_001-1a.3" f="?」" role="p" pos="6">何</t>
    </tg>
[...]
</tlist>

Ideas

  • mark one edition as “pivot”, this will be the one with segments marked.

About

Development considerations, documentation and code for KanripoX

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published