Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retagging Joyce’s dialogue #19

Open
5 of 7 tasks
yellwork opened this issue Feb 8, 2017 · 13 comments
Open
5 of 7 tasks

Retagging Joyce’s dialogue #19

yellwork opened this issue Feb 8, 2017 · 13 comments

Comments

@yellwork
Copy link
Collaborator

yellwork commented Feb 8, 2017

This is a to-do issue to pick out the various tasks discussed in #9:

  • Convert all double-hyphen dialogue dashes to the quotation dash or horizontal bar.

  • Shift the </said> tags in <said>―</said> structures to the end of character speech. Add all intermedial <said> tagging.

  • Proof the </said> tagging for every episode. How? We will visualize all of the episodes in a browser and colour just the </said> tagged dialogue.
    Episodes remaining: 1. “Telemachus” 2. “Nestor” 3. “Proteus” 4. “Calypso” 5. “Lotus Eaters” 6. “Hades” 7. “Aeolus” 8. “Lestrygonians” 9. “Scylla and Charybdis” 10. “Wandering Rocks” 11. “Sirens” 12. “Cyclops” 13. “Nausicaa” 14. “Oxen of the Sun” 15. “Circe” 16. “Eumaeus” 17. “Ithaca” 18. “Penelope”

  • Disambiguate the appropriate <emph> to <said> tagging.
    [there might be a few other stragglers]

  • Add @who attribution for every instance of <said> (or in “Circe” <sp>). Use character names for the values.

  • Switch @who values to @xml:id.

  • Compile a <listPerson> dossier of speakers.

@c-forster
Copy link

c-forster commented Feb 8, 2017

Having a to-do list for this seems wise. FYI: The following ack one-liner will extract names from the who attribute of said tags.

ack -o "(?<=<said who=\")[\w\'\. ]*" *.xml

This will compile a sorted list of all the names across the corpus:

ack -ho "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq 

I was using it as a sanity check to catch misspellings when I marked up "Telemachus."

Could we also assign, or let people claim, episodes to mark up with dialogue on this, or another issue? I am going to tackle another episode as soon as I can, and want to avoid reduplicating labor.

@yellwork
Copy link
Collaborator Author

yellwork commented Feb 9, 2017

Good idea. Can we formally assign them or do we just call dibs here?

After you started <said> tagging, Chris, I snagged a lot of the low-hanging fruit (the less chatty, shorter episodes). Claiming the longer ones now makes sense because they’re likely to take a considerable bit of time to mark up.

Those ack commands will come in very handy once we start figuring out the speaking parts.

@yellwork
Copy link
Collaborator Author

yellwork commented Feb 9, 2017

Going to do the @who attribution on “Proteus” now.

@yellwork
Copy link
Collaborator Author

yellwork commented Feb 9, 2017

Going to tackle @who on “Aeolus” now.

@JonathanReeve
Copy link
Member

JonathanReeve commented Feb 15, 2017

@c-forster, that ack hack is great. I use ag, "The Silver Searcher," myself, and was able to get it to work the same way using ag --nofilename -o "(?<=<said who=\")[\w\'\. ]*" *.xml | sort | uniq. I'll put this into a makefile so that we can run these sorts of things easily.

yellwork added a commit that referenced this issue Feb 16, 2017
Caught/tweaked a few <foreign> as well.
yellwork added a commit that referenced this issue Feb 16, 2017
yellwork added a commit that referenced this issue Feb 16, 2017
yellwork added a commit that referenced this issue Feb 22, 2017
Also <emph> to <quote> or <name>
yellwork added a commit that referenced this issue Mar 2, 2017
yellwork added a commit that referenced this issue Mar 5, 2017
@yellwork
Copy link
Collaborator Author

yellwork commented Nov 2, 2017

I’m simplifying this. A ⟨listPerson⟩ for the entire novel would be incredible, but … too much work for now. So I’m going to switch all @who values to character initials and put the key in the separate plaintext file persons.txt.

yellwork added a commit that referenced this issue Nov 2, 2017
@JonathanReeve
Copy link
Member

JonathanReeve commented Nov 3, 2017

Sounds good. I'm not seeing the key in persons.txt, though? Anyway when it's there, if it's in some kind of regular format, like comma- or tab-separated, then it'll be easy to make a list of these keys to add to the header.

@yellwork
Copy link
Collaborator Author

yellwork commented Nov 3, 2017

I’m doing it all offline while I go through all eighteen episodes. I’ll merge them all into the repository once done.

My local persons.txt looks like this:

db [tab]Davy Byrne
dbc [tab]Davy Byrne's curate
dbm [tab]D.B. Murphy
dd [tab]Dan Dawson
did [tab]Dilly Dedalus

That could be the basis for a <listPerson> – information I’d love to see added but too much for us right now (I feel).

@JonathanReeve
Copy link
Member

JonathanReeve commented Nov 3, 2017

@yellwork
Copy link
Collaborator Author

yellwork commented Nov 3, 2017

Some content that was marooned in the closed #9 was your suggestion, Jonathan, for unclear @who values. Something like:

<lb n="060004"/><said xml:id="060004-a" who="Cunningham">―Come on, Simon.
<certainty target="#060004-a" match="@who" locus="value" assertedValue="Power" degree="0.5">
    <desc>It's unclear here whether it's Cunningham or Power speaking.</desc>
</certainty> 
</said>

I’m going to go ahead and use this encoding whenever an unclear speaker is limited to a handful of candidates. Unless you’ve another idea?

yellwork added a commit that referenced this issue Nov 4, 2017
persons.txt contains a list of all speakers in the novel.
yellwork added a commit that referenced this issue Nov 4, 2017
Several lgs nested in quote or said/quote.
Added speaker ambiguity at U 01.671.
yellwork added a commit that referenced this issue Nov 4, 2017
yellwork added a commit that referenced this issue Nov 4, 2017
Has some lingering unclear speakers (see U 6.116–118 and 6.139, 6.215,
6.384).
yellwork added a commit that referenced this issue Nov 4, 2017
yellwork added a commit that referenced this issue Nov 4, 2017
There are some lingering unclears in these episodes.
yellwork added a commit that referenced this issue Nov 4, 2017
yellwork added a commit that referenced this issue Nov 4, 2017
@JonathanReeve
Copy link
Member

JonathanReeve commented Nov 5, 2017

@yellwork
Copy link
Collaborator Author

yellwork commented Nov 5, 2017

How do we attribute dialogue in an exchange between several people ? There’s a spot like this in Hades where no speakers are given for several lines of dialogue:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?</said>
<lb n="060117"/><said who="unclear">―We're stopped.</said>
<lb n="060118"/><said who="unclear">―Where are we?</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

The unclears can only be Cunningham, Power or Simon Dedalus (with Bloom, perhaps, chiming in at U 6.117). How best would that be encoded?

@JonathanReeve
Copy link
Member

JonathanReeve commented Nov 8, 2017

I read the TEI docs on <certainty> again but this is the best I could think of:

<lb n="060114"/><said who="lb">―I met M'Coy this morning,</said> Mr Bloom said. <said who="lb">He said he'd try to come.</said></p>
<p><lb n="060115"/>The carriage halted short.
<lb n="060116"/><said who="unclear">―What's wrong?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060117"/><said who="unclear">―We're stopped.
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said>
<lb n="060118"/><said who="unclear">―Where are we?
<certainty match="@who" locus="value" assertedValue="Power" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Cunningham" degree="0.33" />
<certainty match="@who" locus="value" assertedValue="Simon Dedalus" degree="0.33" /> 
</said></p>
<p><lb n="060119"/>Mr Bloom put his head out of the window.
<lb n="060120"/><said who="lb">―The grand canal,</said> he said.</p>

...which is super kludgey and not very DRY. Ideally we could do target="#060116 #060118 #060119" on a single <certainty> set, and avoid all this repetition, but it doesn't look like XML can handle multiple attribute values.

@tcatapano, any ideas?

yellwork added a commit that referenced this issue Nov 17, 2017
yellwork added a commit that referenced this issue Nov 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants