Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review frus1964-68v23 (Congo) #1

Closed
10 tasks done
joewiz opened this issue Feb 24, 2014 · 8 comments
Closed
10 tasks done

Review frus1964-68v23 (Congo) #1

joewiz opened this issue Feb 24, 2014 · 8 comments

Comments

@joewiz
Copy link
Owner

joewiz commented Feb 24, 2014

  • Passes FRUS TEI Schematron check
  • Passes RelaxNG schema check
  • Commit schema-compliant file to SVN
  • Upload the FRUS TEI file into localhost
  • Upload page images to S3
  • Table of contents
  • Front matter
  • Back matter
  • Random sample
  • Volume metadata
@joewiz
Copy link
Owner Author

joewiz commented Feb 24, 2014

The Schematron passed just fine, but the RelaxNG schema produced 159 problems initially. I fixed these using a combination of approaches:

  • The schema flagged hi/@rend="smallcaps" and hi/@rend="roman". Added these values to our ODD file, frus.odd.
  • The schema flagged <opener> and <salute>; these will be nice to have to translate into flush-left paragraphs as an alternative to p/@rend="flushleft" and a good complement to closer/signed. Added these elements to our ODD.
  • Added <gap> to our ODD, constraining the attributes to just @quantity and @unit; will need to notify DSCS to use @quantity instead of @extent.
  • Deleted instances of orgName, affiliation not allowed in our ODD; while nice from a semantic perspective, they don't add particular analytical value in the mode applied by DSCS.
  • Changed ref/@ana in d290 to ref/@target

Also spotted these problems in the course of the schema review:

  • Line breaks <lb/> needed on pgII between lines, "DEPARTMENT OF STATE Office of the Historian...". Added the missing line break elements.

@joewiz
Copy link
Owner Author

joewiz commented Feb 24, 2014

From previous SVN commits:

  • Added missing <lb> line break elements in multi-line signatures. Found these instances with this XPath in oXygen: //closer[.//affiliation and not(.//lb)]. This leverages vendor's use of <affiliation> elements for the 2nd and subsequent lines following the signature.
  • Also, worked on Published & Unpublished Sources headings in the source note.
  • Added missing @type attributes to subject and participant lists (vendor seems to have been thrown by lists whose headings were variants of the usual entries, i.e., PRESENT, PRECIS, RE, CRYPTONYM, etc.) TODO add to our guidelines.
  • Fixed missing space in d83fn4: "Congo Crisis,Document 71"
  • In scanning cross references to other volumes, found generally good tagging, but (TODO) we should standardize our style guide for linked cross references. There's a lot of room for interpretation about how much of, and which portions of a cross reference to tag, and when to take enumerated volume, document, or footnote numbers.
  • Another issue is paragraphs that were tightly spaced (vertically) in the PDF but are tagged simply as paragraphs, indistinguishable from other paragraphs. We often use this tight spacing to set off lists, quotes, etc. from the normal flow of paragraphs. We should decide if tight spacing needs to be tagged or not, and whether to continue with the current practice (of list/item, sans @type). (TODO)
  • Added space missing at start of numbered paragraphs (frus1949v01 #19-27) in d579, e.g.: <p>27.The US enjoys...
  • Similarly, a space was missing between document number and heading of d580: <head>580.Memorandum From...
  • Based on these two examples, I searched with this regular expression: \d\.[A-Z] (i.e., one digit followed by a period and a capital letter) and found other instances of this in d342, d495, d496, d501, d504-6, d509-10, d512-5, d517-8, d520-2, d524-5, d528-9, d531, d533, d535, d539, d541, d544, d548, d550, d560, d565, d568-70. Besides conjoined document numbers/headings, this phenomenon was manifest in cable numbers, e.g.: <p>2402.Ref... The document heading cases could be a candidate for a schematron error. The paragraph-level instances could be a warning, since they're not strictly forbidden?

@joewiz
Copy link
Owner Author

joewiz commented Feb 26, 2014

Initial notes on the random sample:

  • The PDF has 921 pages. 5% = 46 pages.
  • Setting aside the front matter, which I already looked at closely, the body has 887 pages. 5% = 44 pages. Pages 1-44 would cover documents 1-32. Going by documents, 5% of 582 documents would be 29 documents.
  • Best to take a random 30 documents. How about documents 1-5 of each 100 documents? (Other reviews could take other approaches - best that we vary our approaches.)
  • d1: for page 2 broke in the middle of the word. Our guidelines have always been not to break a word, but to place the pb after the final word of a page.
  • d3: dang, I should've replaced (in signatures) with .
  • d3: "Conakat" not tagged as a term (CONAKAT is in the terms list)

@joewiz
Copy link
Owner Author

joewiz commented Feb 26, 2014

#d100-#104

  • no issues

#d200-#d204

  • Noticed in this range of documents that here and throughout Stan and Leop were not tagged with the <gloss> element. Added it in this range, but should be added elsewhere.
  • Silently corrected typo in #d208: assasinate > assassinate
  • #d203 for tight spacing text in 4A-E, changed <p> to <list>-nested <item> elements sans @type. (TODO: clarify guidelines on this, esp. wrt. @type.)
  • #d204 tagged ChiCom as <gloss>. Also caught 5 instances elsewhere with regex search for \schicoms?\s (whitespace + chicom + optional s + whitespace). Wondering why this (and Stan and Leop) were missed - perhaps because of case variation? If so, perhaps this was a prudent, intentional omission. And this could point to something we should be on the lookout for.

Also

  • fixed all instances of <pb> breaking in the middle of words, moving the <pb> to the end of the word: Find: ([^\s]+)(<pb[^>]+?>)([^\s]+) Replace with: $1$3 $2

@joewiz
Copy link
Owner Author

joewiz commented Feb 26, 2014

#d300-#d304

  • #d300: Noticed that Leo was tagged - this matched case of entry in terms list. But noticed that there is a "Leo G. Cyr" in the persons list. A possibility for mistagging, especially in cryptic telegrams? Similarly, many names are tagged, even if only the last name is present. I recall our guidance was to tag people only if the full name or title + last name was present. The concern about tagging instances where only the last name is present is that there could be ambiguity and thus mistagging.
  • #d301: Noticed smooshed spacing in item C, between <hi> and <gloss>. TODO: add check for sibling elements like these, which results in a space being inserted if serialization parameter indent=no. Similarly, sibling <gloss> elements (e.g., #d402 "AmbLeo")

@joewiz
Copy link
Owner Author

joewiz commented Feb 27, 2014

#d400-#d404

  • #d402: odd double accent mark on the "e" in "Chargé" in the PDF was luckily not preserved in the XML!
  • #d402: noticed extraneous @corresp on the <signed> element - `. Deleted all 235 instances of this in the volume. Tell DSCS to omit this in the future.

#d500-#d504

  • #d501: since the decision options follow the signature (and TEI doesn't allow paragraph content to appear following a <closer> element, DSCS followed our previous practice and tagged the signature with a <p rend="right">'. but we now encase the material following the closer like this decision option block in aelement, which is allowed following a. (TODO: document use of, as well asfrus:attachmentif we don't just usein its stead - perhaps better to use a core TEI element rather than creating a new element, but only if we're not abusing the tag.) Applied this closer/postscript change to #d86, #d226, #d246 (I moved the interesting right-aligned phrase right above the signature from its own paragraph into the signed element... I'm thinking closers should make bold explicit instead of implicit; and should @rend="roman" reset both italic and bold or just italic?). there are still about 20 cases of this, which can be found with//p[@rend='right']` - should be addressed when we flesh out the guidelines on this. many good cases of this here that can be used as illustrations for the guidelines.
  • #d501: the 3 options in <p> elements at the end @rend="flushleft" to ensure they're rendered flushleft.
  • #d503: telegraph number (?) - the thing to the left of the dateline - needs @rend="flushleft"

In summary:

  • No significant issues in the volume to hold up release, but many areas where DSCS can improve for next time, illustrating where our guidelines could be tighter.

@joewiz joewiz closed this as completed Feb 27, 2014
@joewiz
Copy link
Owner Author

joewiz commented Feb 27, 2014

Spotted a few things during the ebook review:

  • #d142 has a table - do we need borders? no, but there is a "total" line that is missing. TODO figure out how to encode/render these total/subtotal lines.
  • #d569 fn2 is empty - indeed, the footnote is missing in the PDF too. resolved: delete the empty footnote.

@joewiz
Copy link
Owner Author

joewiz commented Aug 18, 2016

This issue was moved to HistoryAtState/frus#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant