<img align="right" src="images/tf-small.png"/>

# Strong numbers

# Application: Strong numbers
Stephen Ku has prepared a Strong number mapping for version `4`, based on 
[OpenScriptures Bible Lexicon](https://github.com/openscriptures/HebrewLexicon).

This provides us with a nice use case:
can we apply the Strong number mapping for version `4` to versions `3`, `4b` and `2016`
as well?
Below we will get a pretty good view on the differences between the versions.
We use the
[BHSA transcription](https://shebanq.ancient-data.org/shebanq/static/docs/BHSA-transcription.pdf)
to write down the diffs.

In [None]:
import os,collections
from tf.fabric import Fabric

We need a map from a version to its previous version.

In [None]:
versions = ['4', '4b', '4c']
locations = {
    '4': '~/github/text-fabric-data-legacy',
    '4b': '~/github/text-fabric-data-legacy',
    '4c': '~/github/text-fabric-data', 
}

preVersion = dict(((v, versions[i]) for (i,v) in enumerate(versions[1:])))
preVersion

Load all versions in one go!
For each version we load the `omap` feature that maps the slots from the previous version to the slots of this version.

In [None]:
TF = {}
api = {}
for v in versions:
    omap = '' if v == '4' else 'omap@{}-{}'.format(preVersion[v], v)
    TF[v] = Fabric(locations=locations[v], modules='hebrew/etcbc{}'.format(v))
    api[v] = TF[v].load('''
        {} lex
    '''.format(omap))

A4 = api['4']
A4b = api['4b']
A4c = api['4c']

# Strong numbers

Let us apply the maps for the purpose of assigning Strong numbers to the words of the versions 4b and 4c.
We have a mapping for 4, compiled as a csv file by Stephen Ku from the OpenScriptures data.

First we perform a basic check on the Strong numbers as provided for version 4.

In [None]:
STRONG = 'hebrew/strong'
strongDir = '{}/{}'.format(os.path.expanduser(locations['4c']), STRONG)
strongFile = '{}/{}'.format(strongDir, 'MonadStrong.csv')
strongs = {}

In [None]:
strongs['4'] = {}
first = True
with open(strongFile, encoding='utf-16') as fh:
    for line in fh:
        if first:
            first = False
            continue
        (slot, strong) = line.rstrip().split(',', 1)
        strongs['4'][int(slot)] = strong

## Consistency check

Do slots with the same lexemes get identical Strong numbers?

In [None]:
def checkConsistency(v):
    strongFromLex = collections.defaultdict(set)
    lexFromStrong = collections.defaultdict(set)

    for n in api[v].F.otype.s('word'):
        if n in strongs[v]:
            strongFromLex[api[v].F.lex.v(n)].add(strongs[v][n])
            lexFromStrong[strongs[v][n]].add(api[v].F.lex.v(n))


    multipleStrongs = set()
    for (lx, strongset) in strongFromLex.items():
        if len(strongset) > 1:
            multipleStrongs.add(lx)

    multipleLexs = set()
    for (st, lexset) in lexFromStrong.items():
        if len(lexset) > 1:
            multipleLexs.add(lx)

    print('{} lexemes with multiple Strong numbers'.format(len(multipleStrongs)))
    print('{} Strong numbers with multiple lexemes'.format(len(multipleStrongs)))
    for lx in sorted(multipleStrongs)[0:10]:
        print('{}: {}'.format(lx, ', '.join(sorted(strongFromLex[lx]))))

In [None]:
checkConsistency('4')

Obviously not. The ETCBC lexemes and the Strong numbers are different classification systems for word occurrences in the Bible!

# Map the Strong numbers

In [None]:
strongs['4b'] = {}
for (n, s) in strongs['4'].items():
    for m in A4b.Es('omap@4-4b').f(n):
        strongs['4b'][m] = s

In [None]:
strongs['4c'] = {}
for (n, s) in strongs['4b'].items():
    for m in A4c.Es('omap@4b-4c').f(n):
        strongs['4c'][m] = s

# Check consistency again

Now in the new versions.

In [None]:
checkConsistency('4b')

In [None]:
checkConsistency('4c')

That looks good.

# Writing the Strong numbers

In [None]:
nodeFeatures = {}
provenance = dict(
    source='Strong numbers provided by https://github.com/openscriptures/HebrewLexicon',
    author='Compiled for ETCBC by Stephen Ku; transferred across versions by Dirk Roorda',
)

for v in versions:
    metaData = {
        '': provenance,
        'otext@strong': {
            'about': 'Provides Strong numbers to Hebrew Words',
            'see': 'https://github.com/ETCBC/text-fabric/blob/master/Versions/strong.ipynb',
            'fmt:lex-strong-plain': '{strong} ',
        },
        'strong': {
            'valueType': 'str',
        },
    }
    nodeFeatures = dict(strong=strongs[v])
    TF[v].save(
        module='hebrew/strong/{}'.format(v),
        nodeFeatures=nodeFeatures,
        metaData=metaData,
    )

# Using Strong numbers

Let us load the new `strong` feature in the newest ETCBC version, `4c`.

In [None]:
TF = Fabric(modules=['hebrew/etcbc4c', 'hebrew/strong/4c'])
api = TF.load('''
        g_word_utf8
        lex strong
''')
api.makeAvailableIn(globals())

We print a few verses of Genesis in lexeme and in strong representation.
The module `strong` defines a new text format!

In [None]:
(book, chapter) = ('Genesis', 1)

for verse in range(1,4):
    vn = T.nodeFromSection((book, chapter, verse))
    words = L.d(vn, otype='word')
    for fmt in ('lex-trans-plain', 'lex-strong-plain'):
        print('{} {}:{} ({})\n\t{}'.format(
            book, chapter, verse, fmt,
            T.text(words, fmt=fmt)
        ))

# Divergence between lexemes and Strong

As we noted when constructing the Strong features, there is no 1-1 correspondence between ETCBC lexemes and Strong numbers. Let us inspect a few cases where they diverge.

We reimplement something like `consistencyCheck()` above, but now based on the active `strong` feature.
And we collect the slots that exhibit one lexeme with several Strong numbers and vice versa.

So let's just collect all relevant information.

In [None]:
strongLex = collections.defaultdict(lambda: collections.defaultdict(set))
lexStrong = collections.defaultdict(lambda: collections.defaultdict(set))

indent(reset=True)
info('Gathering lexemes and Strongs')
for n in F.otype.s('word'):
    lex = F.lex.v(n)
    sng = F.strong.v(n)
    if sng != None:
        strongLex[sng][lex].add(n)
        lexStrong[lex][sng].add(n)
info('Done: {} lexemes and {} Strongs'.format(len(lexStrong), len(strongLex)))

Now rank the lexemes by the number of Strongs they are associated with, and the Strongs by the number of lexemes they
are associated with.

In [None]:
lexRanked = sorted(lexStrong, key=lambda x: -len(lexStrong[x]))
sngRanked = sorted(strongLex, key=lambda x: -len(strongLex[x]))

Inspect the top 10 of both.

In [None]:
def inspectTop(dataRanked, data, amount):
    for d in dataRanked[0:amount]:
        print(d)
        related = data[d]
        for r in related:
            occs = sortNodes(related[r])
            print('\t{} ({} occs)'.format(r, len(occs)))
            n = occs[0]
            s = L.u(n, otype='sentence')[0]
            ws = L.d(s, otype='word')
            print('\t\te.g. {} {}:{} - {} in {}'.format(
                *T.sectionFromNode(n),
                F.g_word_utf8.v(n),
                T.text(ws, fmt='text-orig-full'),
            ))

In [None]:
inspectTop(lexRanked, lexStrong, 10)

In [None]:
inspectTop(sngRanked, strongLex, 10)