Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the solidus to separate morpheme segments is against OSIS philosophy #50

Open
DavidHaslam opened this issue Dec 30, 2017 · 3 comments

Comments

@DavidHaslam
Copy link

DavidHaslam commented Dec 30, 2017

The general philosophy of OSIS is to use XML elements for all the semantic markup.

Using the solidus within the text to separate morpheme segments within Hebrew words goes against this OSIS philosophy. One friend has described this as "bad, bad, very bad".

cf. The XML files for the CrossWire WLC module are more conformant with this principle where they used the XML seg element for this purpose. The original data was obtained from the website tanach.us but further preprocessing was done before building the latest version of module, which differs from it's earliest version in this respect.

e.g. Taken from the mod2imp output of the CrossWire WLC module, they are generally like this:

$$$Genesis 1:1
<w><seg type="x-morph">בְּ</seg><seg type="x-morph">רֵאשִׁ֖ית</seg> </w>
<w><seg type="x-morph">בָּרָ֣א</seg> </w>
<w><seg type="x-morph">אֱלֹהִ֑ים</seg> </w>
<w><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הַ</seg><seg type="x-morph">שָּׁמַ֖יִם</seg> </w>
<w><seg type="x-morph">וְ</seg><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הָ</seg><seg type="x-morph">אָֽרֶץ</seg> </w>
<w type="x-sofpasuq">׃ </w>

NB. In this extract, the output was also converted to Word Per Line format afterwards.

Aside: That is not to say that the WLC module is perfect.
Irrespective of any text critical issues, at least these mistakes were made when it was first built.

  1. The Hebrew text should not have been normalized to NFC.
  2. There should not be a space either before or after each MAQAF.
  3. The space between Hebrew words should be outside the w elements.

These are not your responsibility. I mention them merely in passing.

Those defects were rectified in the WLC module after I created this issue in 2017.

@DavidHaslam
Copy link
Author

@dowens76 @DavidTroidl

Does nobody involved in this project take any notice of issues?

This was posted in December 2017 so what's going on?

@jag3773
Copy link
Member

jag3773 commented Jul 26, 2023

Hi @DavidHaslam, I suspect many people agree with you on that, myself included. Making such a change in the text as it is now would certainly cause all sorts of backwards incompatibility issues.

I'd be in favor of offering an alternate version of the files in the repo that has the fields separated according to OSIS philosophy. If you want to put in PR with the changes as you suggest I think we'd be willing to incorporate it.

@DavidHaslam
Copy link
Author

@jag3773

Since I added this issue in 2017, the website tanach.us has had a change of title.

  • Instead of Westminster Leningrad Codex
  • it's now Unicode XML Leningrad Codex

There are other significant changes, but one relevant to this issue is that all the solidus / markers that used to separate morphological segments have all been removed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants