Skip to content

Latest commit

 

History

History
109 lines (70 loc) · 26.4 KB

DLL-seminar-paper.md

File metadata and controls

109 lines (70 loc) · 26.4 KB

#[DRAFT] The TEI Critical Apparatus Module and Digital Critical Editions, or Why Your Digital Edition Should Have a Data Model [DRAFT]

The myriad possibilities offered by digital methods have a way of crowding out the requirements when it comes to thinking about digital critical editions, to the point where the edition itself tends to disappear. The requirements of a critical edition are fairly well-understood, even though the details vary. For Classical Latin texts in the anglo-american tradition, a critical edition is a text that represents the editor’s attempt to (re)construct an ancient work1 using the sources and scholarship available. That text (the editor’s hypothesis) is supported by an apparatus consisting of notes which indicate where source texts, previous editions, and other conjectures vary from or support what the editor has printed, or where the editor makes or records conjectures that they do not want to print in the text. The apparatus uses a highly-abbreviated syntax whose abbreviations are either defined in the edition’s introduction or conform to longstanding practice. The introduction also discusses things like the history and nature of the text’s transmission.

A digital edition might do these basic things, but also many more. It might present images and transcriptions of the source manuscripts. It might link to prior editions. It might permit automated differencing between any of these. It might contain the editor’s collation of manuscripts. It might permit user annotation and commentary. It might expand the apparatus, allowing users to follow the full trail of scholarship around any aspect of textual variation. It might permit users to experiment with the editor’s decisions about the text and to agree or disagree with them, making their own edition. It might eschew any attempt at reconstruction and instead present only diplomatic editions of the sources.

A printed critical edition conceals a vast amount of work under its relatively straightforward presentation of the text. The real promises of the digital edition are threefold: that the threads of that vast work might be made visible, and even manipulable, to readers, that the work itself might be distributed and therefore accomplished faster and at a lower cost to contributors, and that the end result might be reusable and re-purposable by other scholars. Much of the groundwork for digital editions has been done. There exist tools for presenting and annotating digitized manuscripts. Many out-of-copyright editions have been scanned and OCR’ed by entities like the Internet Archive. There are tools for collating multiple copies of texts automatically and for visualizing the differences between them.

In order for an interactive, online edition to be possible where the reader can choose to promote or demote readings and conjectures, that edition has both to know how those textual variances relate to one another and to be able to remember what choices have been made among the variants. It must be able to maintain its state (that is, remember what changes have been made to it) and encode a dependency graph to express the interdependencies between variances. It must, in addition, record links to various data sources and support linking to itself for annotation purposes.

A Digression on formats and workflows

The difficulties involved in creating critical editions of most ancient works are hideous. Collating all of the various sources, deciding readings are relevant or not, considering all of the relevant scholarship, and finally deciding what to put in the text and what to put in the apparatus all take years of work. And as we’ve noted, in the digital environment, the choices of what to put in and what to leave out are even harder, because there are fewer constraints. In order to think clearly about what the affordances of an online critical edition should be, we will need to consider how we would want to represent all the aspects of that edition. Let us take it as a given that we can do in HTML visually anything that we could do in print, mutatis mutandis. There is nothing stopping us from creating an online analog to the printed edition that looks much the same as the original—the Loeb online editions do this rather well, for example. And, further, there is nothing preventing us from adding functionality to those HTML editions. There is a problem with this approach though: once you’ve produced an edition like this, it is, in many ways, as much of a dead end as a printed edition would be. Because the semantics of HTML are no real improvement over those of print, where form signals function, we’re left with a format that will only work properly within its setting.

The question, then, is whether we want to have a proper data model for our edition, and if so, how do we do it?

The latest version of HTML gives us some 79 elements that are usable in the body of an HTML document. Many of these have very specific semantics that make them unlikely to apply to our use case, while others are quite vague. Often, HTML’s semantics are quite strange: <ul>, for example, represents a list whose order is not significant, while <ol> represents one whose order is significant. This is a bizarre distinction. All lists have order, by definition. Sometimes we number the items and sometimes we don’t. The semantics of HTML’s ordered and unordered lists are a kind of post facto rationalization of the original intent, to be able to display lists that have either numbers or bullets. HTML <p> is, similarly, not a paragraph in the sense of its normal definition: “a subdivision of a written composition that consists of one or more sentences, deals with one point or gives the words of one speaker, and begins on a new usually indented line” [Merriam-Webster]. Rather, it marks any typographic block of text. You can’t have an HTML list inside a paragraph, for example, even though this is a) a perfectly natural and normal phenomenon and b) lists need not be formatted as blocks of text themselves, even in HTML.

We can characterize these elements, at least in terms of their appearance, using the class attribute along with an accompanying CSS stylesheet. But there is no way to explicitly mark, e.g., a line of verse as such. If we want our HTML structure to constitute a data model, we will have to add some sort of formal definition—a set of guidelines—saying that, again, e.g., <p class="line"> indicates a line of verse. Such a thing is an achievable goal, but a very large one, and we should note that most of the work has already been done in the form of the Text Encoding Initiative Guidelines.

TEI has what HTML lacks, a (mostly) very well-considered and mature set of semantic tags for encoding texts. It lacks what HTML has, however, in the form of rules for how to display and interact with its elements in a web browser. And this is where the question of workflow comes in: it is typical, in TEI workflows, to mark up your text and then to transform it, using XSLT into whatever forms you need to support your digital publication. This flow is very useful where multiple outputs are desired. You can, for example, generate HTML and print views, along with index documents for custom search engines, and visualizations of various kinds, all from the same source. In converting TEI to HTML, however, we tend to throw away all of the semantic distinctions in the markup in favor of typographic distinctions in the display. We have a data model in mind while doing the markup, but it fails to carry over to the online version.

The solution is obvious, although it represents a lot of work and was impractical or impossible until recent advances in browser technology: we need to be able to keep and use the semantics of TEI while styling and adding functionality to the document in the same ways we do HTML. Various ways of implementing this solution have been proposed. These include the modification of HTML elements using RDFa or microformats to impose TEI semantics onto it, and the use of an in-browser XSLT transformation to wrap the TEI document in an HTML envelope and to make it able to be decorated using standard CSS (TEI Boilerplate). The demonstration for this paper uses an approach inspired by TEI Boilerplate (and reuses its CSS), namely the conversion of TEI into HTML Custom Elements. Instead of sticking with the semantically poor element set of HTML, this solution reframes a simple, 1::1 transformation of TEI as HTML, by registering modified TEI elements with the browser. This is a “bleeding edge” feature that doesn’t work in all browsers, but it displays fine even in, e.g., Safari, which doesn’t support Custom Elements, because browsers are built to handle all manner of terrible, incorrect, messy HTML.

The publishing workflow is, potentially at least, streamlined under this model because the encoded document is carried through into the presentation, rather than transformed into something else. It is likely to be easier to develop re-usable solutions because most customizations will involve only modifying the CSS rather than adding or modifying one or more XSLT templates. The information loss entailed by a typical XSLT TEI-to-HTML transformation also makes the reuse of code to customize the output potentially tricky, meaning solutions have to focus on multiple parts of the toolchain.2

The edition

Do we, in a digital environment, still need critical editions like those embodied in the OCT? Should we not instead be focused on digitally publishing all of our source texts and defer (or even forget about) the project of producing new critical editions? There are many extant editions available for free online (albeit lacking critical apparatus). Are these not sufficient? It would be silly to argue that having transcriptions of every extant source would not be a very good thing. But do those sources in aggregate replace the critical edition somehow? I would assert that they do not, and that the fundamental problem is one of usability: a domain expert will be pleased to have many or even all of the source texts available. But how many users of critical editions are actually domain experts? I suspect most potential readers of texts would be paralyzed by the number of choices available. And indeed this is my own experience of sites like the William Blake archive. On the one hand, the range of choices is fantastic. On the other, the question arises: “How on Earth do I use this?”. Almost all users, in fact, will want to start with a single edition where those choices have been filtered by an expert. A critical edition with apparatus does this, and gives you useful notes on what variants exist and how to investigate further, should you wish to. The great promise of the digital medium is that such further investigation might involve only clicking the right links rather than visiting libraries and archives in multiple countries. What we have now, though are editions that lack those links to their sources and an increasing number of unfiltered sources. My argument then is that the need for critical editions in the digital medium is undiminished.

It has been argued that digital editions are perhaps fundamentally different from traditional print editions,3 and it is clear that in terms of the technologies and formats applied to their publication, this is true. But it sets up an opposition that does not really exist. As Barbara Bordalejo argues, a digital critical edition is not an ontologically different beast than a printed critical edition, despite the differences in the methods of its production. The whole argument of TEI is that it permits the capture and preservation of meaning in a transcribed text, independent of its representation. A TEI transcription, to be sure, permits more finely-grained, machine-actionable interpretations to be encoded, but it is not ontologically different from a print transcription. The important difference between a printed document and one marked up using TEI is that the latter presents a machine-actionable data model that complements the editor’s (and reader’s) mental model. But to say that we can do things digitally that are not possible in print is not to say that we are doing anything fundamentally different, just that it is possible to do it better. Our efforts are therefore perhaps best focused on deciding how to adapt the existing methods of production to the digital environment.

The hard part

If we are to use TEI, we have to confront the problems of TEI’s critical apparatus module. These are solvable, but complicated by both the complexity involved in marking up a critical edition, and by the insufficiencies of the module itself. TEI provides the following elements for the encoding of textual variation:

  • <app>: the container element for elements recording variance
  • <lem>: the element for recording the reading of the base text
  • <rdg>: the element for recording the variant reading
  • <rdgGrp>: an element that permits the grouping of readings (useful, for example, if a witness contains two readings)
  • <note>: May be used in an <app> to record editorial notes. <note>s may use a target attribute to link to the reading the comment on, or inside an app, may be assumed to comment on the immediately preceding lem or rdg.
  • <wit>: an element for recording sigla as they appeared in a source edition. A <wit> applies to the <lem> or <rdg> immediately preceding it.
  • <witDetail>: an element for attaching standoff notes to a reading.

In a TEI <app>, the variant possibilites in the text are set in parallel, optionally at the locus of variation. So for example:

<app xml:id="d1e280">
  <lem xml:id="d1e281" wit="#N #Λ">ut</lem>
  <rdg xml:id="d1e283" wit="#A">et</rdg>
  <rdg xml:id="d1e285" source="#Nodell">ac</rdg>
  <rdg xml:id="d1e287" source="#Heyworth" cert="low">quam</rdg>
  <note target="#d1e287">perhaps</note>
</app>

states that at this location, witnesses N and Λ read “ut”, and the editor accepts this as the reading of the text, witness A has “et”, Nodell conjectured “ac”, and Heyworth (the editor of the text) also sets forth a possible conjecture “quam”, though he is not confident in it. Whereas Heyworth’s printed apparatus reads “8 ut NΛ: et A: ac Nodell: fort. quam”, which says the same thing, but as a note rather than as a model of the variance. Why would we choose the latter rather than the traditional and well-understood apparatus note? The TEI form offers us the possibility of generating new ways to visualize the variance, whereas the note is static and can be nothing other than what it appears to be. In the TEI version, the text stream may be said to fork into a number of branches, one of which is marked (because it is the <lem>) as preferred. What we would like, then, is to be able to visualize this branching, and perhaps to be able to read the text in different ways by changing the preferred branch.

This can be done with the TEI version because it gives us a data model of the text. Our markup of the text has transformed it into a document which, when loaded into a processor (such as a web browser), has state. That state can be altered, resulting in different readings: if we swap a <rdg> for its corresponding <lem>, for example, we will see a different text in the output of the processor. We need to ensure, during this process, that the transition is atomic, i.e. when the <rdg> becomes a <lem>, the <lem> always becomes a reading. We cannot end up with two preferred readings in the text! Our document therefore has some characteristics of a Finite State Machine: it has a (possibly very large) number of possible states, and it provides the means to transition between those states (swapping the element names of <lem> and a <rdg>). A better way to think about it may be that each <app> represents an independent FSM, and the document’s state is the composite of the states of its <app>s. The TEI apparatus are not wholly independent, however, as we will see below.

Perhaps the biggest problem with the current expression of the TEI Guidelines is that they only permit the encoding of phrase level content in apparatus. Structural variations, such as the omission or transposition of whole lines of verse, cannot currently be represented using inline apparatus in TEI. Recording variations at the location where they occur is the simplest way to approach encoding variation, however. Efforts to change this state of affairs are underway, but are complicated by the fact that structural <app> could be used to violate TEI’s Abstract Model, allowing things like nested paragraphs. It is, therefore, a somewhat complex validation problem, as well as a question of improving documentation and guidance to steer users away from such infractions.

Other problems with the module are less of an obstacle, but still need to be dealt with. The module was created to model textual variation, but has elements named for components of the print critical apparatus, which, I would argue, does not model textual variation but merely describes it. The TEI Critical Apparatus module is therefore doing something ontologically different from a printed apparatus, but, perhaps as a result of confusion due to the naming scheme, has had aspects of printed apparatus grafted onto it. The <wit> element, for example, exists expressly to record the form of sigla used in a critical edition being transcribed4, but <app> and friends are, according to the author of the Guidelines chapter in question,5 not meant to be used for the representation of a printed critical apparatus, rather, as stated above, for the creation of a model of textual variation. The distinction is crucial: the transcription of a printed apparatus necessitates the recording of its typography and physical layout on the page. A model of textual variation, on the other hand, cares about recording where and how textual variation occurs, not how it happened to be represented in a particular print edition.

The Guidelines also lack guidance on how to represent variation in the form of transpositions, deletions, interpolations, and the like. This may be due to the fact that these phenomena often occur at structural levels, such as the verse line. This, then, is both a documentation problem and depends on the resolution of the problem of modeling structural variations. Should the question of structural variation be sucessfully resolved, then a line transposition, for example, could be marked up like:

<app xml:id="app-lem-l25-l26" exclude="#app-rdg-Housman-l25-26">
  <lem xml:id="d1e462" source="#Heyworth">
    <l n="25" xml:id="l25">desine iam <app xml:id="d1e468"><lem xml:id="d1e467">reuocare</lem><rdg xml:id="d1e469" source="#Francius">renouare</rdg></app> tuis periuria verbis,</l>
    <l n="26" xml:id="l26">Cynthia, et oblitos parce <app xml:id="d1e477"><lem xml:id="d1e476">movere</lem><rdg xml:id="d1e478" source="#Passerat">monere</rdg></app> deos;</l>
  </lem>
</app>
... lines 27–31
<l n="32" xml:id="l32">...</l>
<app xml:id="app-rdg-Housman-l25-26" exclude="#app-lem-l25-l26">
  <rdg xml:id="d1e603" source="#Housman">
    <l copyOf="#l25"/>
    <l copyOf="#l26"/>
  </rdg>
  <note>Housman put these lines after 32.</note>
</app>

Here, we use the @copyOf attribute on the transposed lines to indicate that they are identical to the ones in the base text, and the @exclude attribute on the container <app>s to indicate that they are in alternation, i.e. that if one applies to the text, the other cannot. If we were to translate the TEI model into English, it would say something like: “At lines 25–26 we give the base text. This is a locus of variation that is in alternation with the locus following line 32, where the reading attributed to Housman places these lines.” This example also illustrates a case where <app>s intersect, and are therefore not wholly independent of one another. An <app> may contain other <app>s, meaning that in the case of transpositions, a state change in, e.g. the <app> with id “d1e468”, where the reading is either “reuocare” or “renouare” must mean that a state change in that <app> cascades to any copies, such as will be instantiated when the <l> elements which have @copyOf attributes are copied from the originals for display.

##An example: Encoding Propertius 1.15 http://hcayless.github.io/appcrit/Propertius-1-15.html In order to determine the range of changes needed to the TEI, as well as to demonstrate the possibilities of model-based critical editions, I have chosen to encode Propertius 1.15, using Heyworth’s 2007 Oxford Classical Text as its basis. I have altered the TEI schema generation process to produce a schema that permits structural elements inside <lem> and <rdg> in order to illustrate the possibilities of this kind of revision. As I mentioned above, the demonstrator uses a new (and still under revision) addition to the HTML specification, called Custom Elements. These permit the definition of new HTML elements, which browsers will recognize as such. An XSLT stylesheet takes the source TEI and converts it into HTML, substituting the TEI elements for HTML elements of the same name with the prefix “tei-”. The demo’s appearance is based on a copy of TEI Boilerplate modified to accommodate the new element names. Javascript is used to generate a traditional-looking critical apparatus at the bottom of the page and also to add buttons in the right margin which allow users to view individual apparatus entries and also, if the button is clicked, to use a dialog to modify the display of the text, clicking on alternate readings in the apparatus to activate them in the text.

The demo works by modifying the data model of the document, which results in an automatic re-rendering of the content. When an alternate reading is selected, a Javascript routine (named swapLem) is triggered that takes the selected reading as input, replaces its <rdg> tag with a <lem>, finds the corresponding <lem>, and replaces it with a <rdg>. No effort is made, nor needs to be made, to re-style the document, move text from one place to another, etc. The change in the page’s Document Object Model (DOM) means that the pages styling rules are automatically re-applied.

This way of doing things has a number of potentially useful implications: because the swapLem function is atomic, and because the state of a loaded document depends entirely on its settings when first it was loaded plus the list of swapLem function calls that have been made, we can save and re-create the manipulated state of a coument simply by logging the calls made to swapLem and replaying them, meaning it will be possible for users to tweak the document as they wish and to save its state for later use or redistribution. Further, because the browser DOM contains a document isomorphic to the original TEI document, with only the addition of derived HTML elements such as <span> and <button> tags to enable proper display and functionality, it would be easily possible to save the document itself, convert it back to TEI by stripping all of the HTML, removing the “tei-” prefix from the TEI elements, and putting them back in their proper namespace. We can also easily do things like apply all the changes suggested by Housman, or all of the readings from Ω (with the caveat that the readings in the document are Heyworth’s selection, not all the possible variants).

##Conclusion The proposals presented here cover a lot of ground: they involve the acceptance of TEI as an appropriate format for modeling critical editions, the revision of TEI to permit operations currently unsupported, the use of web standards that are still in the development phase, and which are not supported in all browsers, and the adoption of what TEI Simple calls a “processing model” for TEI—meaning the acceptance of a defined presentation for TEI elements. Whether this is a viable way forward for publishing interactive critical editions will depend on whether editors are willing to produce TEI (or work with encoders to do so) and whether having the kinds of functionality represented here are worth the additional work and expense of doing that encoding. If, however, the main obstacle to the wide adoption and publication of digital critical editions has been the lack of a clear way to do it, then perhaps this paper points a way forward out of the morass of choices.

Notes

1. Here, I am using the term “work” in more or less the way Bordalejo (2013) does: “the work is a conception in the mind of an author at a particular point in time that serves as a minimal denominator to identify its remaining physical manifestations.”

2. If our publication workflow involves (1) a TEI document, processed by (2) XSLT, output to (3) HTML, which is decorated using (4) CSS and Javascript, then customization of the publication hinges on what can be done to #3, which depends not just on #4, but what #1 contains and what #2 is able to output. Customization therefore implicates (potentially) the entire workflow.

3. Bodard and Garcés argue that “the model we are proposing here is for digital critical editions to be recognized as a deeper, richer and potentially very different kind of publication from printed editions of texts.” Pierazzo hedges her bets in her 2011 article, demonstrating clearly the differences involved in digital documentary editions and concluding that “ultimately their very natures are substantially, if not ontologically, different.” Bordalejo argues, correctly I think, that the key distinction is not one of substance, but of subject, that there is indeed no ontological distinction between digital and non-digital editions.

4. Moreover, it does not play well with <note> and <witDetail>, which also record information from the apparatus.

5. Peter Robinson, in a comment on a draft of notes toward the revision of the Critical Apparatus chapter of the Guidelines.

References (incomplete)

Bodard, Gabriel & Juan Garcés, “Open Source Critical Editions: a Rationale,” in Text Editing, Print, and the Digital World, edd. Marilyn Deegan & Kathryn Sutherland, Ashgate, (2009), pp. 83–98.

Bordalejo, Barbara, “The Texts We See and the Works We Imagine: the Shift of Focus of Textual Scholarship in the Digital Age,” Ecdotica 10 (2013), pp 64–76.

Pierazzo, Elena, “A rationale of digital documentary editions,” LLC 26.4 (2011), pp. 463–477.