# 3.3 Looking at the Lexical Vocabulary from the Perspective of the Literary Material

In section 3.2 we asked whether we can see differences between Old Babylonian literary compositions in their usage of vocabulary (lemmas and MWEs) attested in the lexical corpus. In this notebook we will change perspective and ask: are there particular lexical texts (or groups of lexical texts) that show a greater engagement with literary vocabulary than others?

In large part, this notebook uses the same techniques and the same code as section 3.2 did, and the reader is referred there for further explanation. In some aspects, however, the process is different.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
from nltk import trigrams, bigrams
import zipfile
import json

In [2]:
lit_lines = pd.read_pickle('output/litlines.p')
lit_lines

Unnamed: 0,id_text,id_line,lemma,lemma_mwe
0,epsd2/literary/P209784,4,niŋšu[goods]n ŋal[be]v/i,niŋšu[goods]n ŋal[be]v/i
1,epsd2/literary/P209784,5,ibi[smoke]n,ibi[smoke]n
2,epsd2/literary/P209784,6,an[sky]n e[leave]v/i,an[sky]n e[leave]v/i
3,epsd2/literary/P251427,4,x[na]na x-x[na]na gal[big]v/i anki[universe]n ...,x[na]na x-x[na]na gal[big]v/i anki[universe]n ...
4,epsd2/literary/P251427,5,utu[1]dn nirŋal[authoritative]aj dumu[child]n ...,utu[1]dn nirŋal[authoritative]aj dumu[child]n ...
...,...,...,...,...
44196,epsd2/literary/X010001,58,lal[syrup]n ŋeštin[vine]n ulušin[beer]n kurun[...,lal[syrup]n ŋeštin[vine]n ulušin[beer]n kurun[...
44197,epsd2/literary/X010001,60,kirugu[notation]n ešakamak[third]nu,kirugu[notation]n ešakamak[third]nu
44198,epsd2/literary/X010001,62,šid[count]v/t 4(u)[na]na 9(diš)[na]na mu[name]n,šid[count]v/t 4(u)[na]na 9(diš)[na]na mu[name]n
44199,epsd2/literary/X010001,63,širnamšubak[subscript]n gula[1]dn,širnamšubak[subscript]n gula[1]dn


In [3]:
with open('output/lit_lex_vocab.txt', 'r', encoding = 'utf8') as l:
    lit_lex_vocab = l.read().splitlines()

Make ngrams: unigrams, bigrams, and trigrams. Represent bigrams and trigrams as MWEs, connected by underscores. Create a full list of all lemmas and ngrams, omitting all non-lemmatized words (or ngrams that include non-lemmatized words).

In [4]:
def make_ngrams(lemmas):
    lemmas = lemmas.split()
    lemmas_bi = bigrams(lemmas)
    lemmas_tri = trigrams(lemmas)
    lemmas_n = list(lemmas_bi) + list(lemmas_tri)
    lemmas_n = ['_'.join(lem) for lem in lemmas_n]
    lemmas = set(lemmas + lemmas_n)
    lemmas = [lem for lem in lemmas if not '[na]na' in lem]
    lit_vocab.extend(lemmas)
    return

In [5]:
lit_vocab = []
lit_lines['lemma'].progress_apply(make_ngrams)
lit_vocab = list(set(lit_vocab))
lit_vocab.sort()
lit_vocab[:25]

HBox(children=(FloatProgress(value=0.0, max=44201.0), HTML(value='')))




['a.igi.lu[boatman]n',
 'a.igi.lu[boatman]n_šir[song]n',
 'a.igi.lu[boatman]n_šir[song]n_dug[good]v/i',
 'a.zi&zi.lagab[grass]n',
 'a.zi&zi.lagab[grass]n_a[water]n',
 'a.zi&zi.lagab[grass]n_a[water]n_de[pour]v/t',
 'a.zi&zi.lagab[grass]n_duašaga[1]sn',
 'a.zi&zi.lagab[grass]n_e[speak]v/t',
 'a.zi&zi.lagab[grass]n_gid[long]v/i',
 'a.zi&zi.lagab[grass]n_gid[long]v/i_ašag[field]n',
 'a.zi&zi.lagab[grass]n_mu[grow]v/i',
 'a.zi&zi.lagab[grass]n_munud[bed]n',
 'a[arm]n',
 'a[arm]n_ak[do]v/t',
 'a[arm]n_ak[do]v/t_enlil[1]dn',
 'a[arm]n_al[cvne]n',
 'a[arm]n_al[cvne]n_e[speak]v/t',
 'a[arm]n_ala[manacles]n',
 'a[arm]n_ala[manacles]n_la[hang]v/t',
 'a[arm]n_an[1]dn',
 'a[arm]n_an[1]dn_šum[give]v/t',
 'a[arm]n_an[sky]n',
 'a[arm]n_an[sky]n_bad[open]v/t',
 'a[arm]n_ana[what?]qp',
 'a[arm]n_ana[what?]qp_si[fill]v/t']

> Note: This step can be done with Countvectorizer, with setting ngrams = (1,3). Disadvantages of that approach:
> - we don not need a full DTM for the literary corpus
> - the DTM should be made on *lines* instead of *documents* to prevent words from consecutive lines to form bigrams or trigrams. Afterwards use groupby and agg to make DTM on document level


# Read Lexical Corpus

In [6]:
lex_lines = pd.read_pickle('output/lexlines.p')
lex_lines['lemma'] = [lemma.replace(' ', '_') for lemma in lex_lines['lemma']]
lex_lines = lex_lines.loc[~lex_lines.lemma.str.contains('\[na\]na')]
lex_lines

Unnamed: 0,id_text,id_line,lemma
0,dcclt/P117394,2,kid[mat]n
1,dcclt/P117394,3,kid[mat]n_andul[shade]n
2,dcclt/P117394,4,kid[mat]n_antadul[cloak]n
3,dcclt/P117395,2,ŋeše[key]n
4,dcclt/P117395,3,pakud[~tree]n
...,...,...,...
69313,dcclt/signlists/Q000056,531,gakkul[mash-tub]n
69315,dcclt/signlists/Q000056,534,kilib[total]n
69316,dcclt/signlists/Q000056,535,šuniŋin[total]n
69317,dcclt/signlists/Q000056,536,šuniŋin[total]n


### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "dcclt/Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "dcclt/Q000043". 

In [7]:
Ura6 = ["dcclt/P227657",
"dcclt/P227743",
"dcclt/P227791",
"dcclt/P227799",
"dcclt/P227925",
"dcclt/P227927",
"dcclt/P227958",
"dcclt/P227967",
"dcclt/P227979",
"dcclt/P228005",
"dcclt/P228008",
"dcclt/P228200",
"dcclt/P228359",
"dcclt/P228368",
"dcclt/P228488",
"dcclt/P228553",
"dcclt/P228562",
"dcclt/P228663",
"dcclt/P228726",
"dcclt/P228831",
"dcclt/P228928",
"dcclt/P229015",
"dcclt/P229093",
"dcclt/P229119",
"dcclt/P229304",
"dcclt/P229332",
"dcclt/P229350",
"dcclt/P229351",
"dcclt/P229352",
"dcclt/P229353",
"dcclt/P229354",
"dcclt/P229356",
"dcclt/P229357",
"dcclt/P229358",
"dcclt/P229359",
"dcclt/P229360",
"dcclt/P229361",
"dcclt/P229362",
"dcclt/P229365",
"dcclt/P229366",
"dcclt/P229367",
"dcclt/P229890",
"dcclt/P229925",
"dcclt/P230066",
"dcclt/P230208",
"dcclt/P230230",
"dcclt/P230530",
"dcclt/P230586",
"dcclt/P231095",
"dcclt/P231128",
"dcclt/P231424",
"dcclt/P231446",
"dcclt/P231453",
"dcclt/P231458",
"dcclt/P231742",
"dcclt/P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "dcclt/Q000043"

In [8]:
lex_comp = lex_lines.groupby(['id_text']).agg({'lemma': ' '.join}).reset_index()
lex_comp

Unnamed: 0,id_text,lemma
0,dcclt/P117394,kid[mat]n kid[mat]n_andul[shade]n kid[mat]n_an...
1,dcclt/P117395,ŋeše[key]n pakud[~tree]n raba[clamp]n
2,dcclt/P117396,hašhur[apple]n hašhur[apple]n_baza[dwarf]n haš...
3,dcclt/P117397,laqipu[1]dn ninkugnunak[1]dn ninagrunak[1]dn
4,dcclt/P117404,ig[door]n_eren[cedar]n ig[door]n_dib[board]n i...
...,...,...
756,dcclt/signlists/P333171,nun[object]n nun[prince]n nun[object]n gurud[t...
757,dcclt/signlists/P447993,ba[allot]v/t zaʾe[you]ip ŋaʾe[i]ip sag[good]v/...
758,dcclt/signlists/P447994,zah[disappear]v/i zah[disappear]v/i zah[disapp...
759,dcclt/signlists/P447997,lahar[ewe]n sag[good]v/i ne[brazier]n zah[mark...


Since the data are drawn from multiple (sub)projects, it is possible that there are duplicates. We take the version with the largest number of (lemmatized) words.

In [9]:
lex_comp['id_text'] = [i[-7:] for i in lex_comp['id_text']]
lex_comp['length'] = [len(lem.split()) for lem in lex_comp['lemma']]
lex_comp = lex_comp.sort_values(by = 'length', ascending = False)
lex_comp = lex_comp.drop_duplicates(subset = 'id_text', keep = 'first')
lex_comp

Unnamed: 0,id_text,lemma,length
703,Q000043,a[water]n ninda[bread]n kaš[beer]n tu[soup]n t...,1098
707,Q000050,izi[fire]n ne[brazier]n didal[ashes]n didal[as...,1020
705,Q000047,lu[person]n lugal[king]n namdumu[status]n sukk...,902
709,Q000055,a[water]n duru[wet]v/i a[water]n a[water]n aya...,777
699,Q000039,taškarin[boxwood]n esi[tree]n ŋešnu[tree]n hal...,706
...,...,...,...
743,P333147,umun[insect]n,1
741,P333145,lahhušu[base]n,1
657,P427591,har[ring]n,1
560,P333845,heši[darken]v/i,1


> NOTE Instead of a binary dtm, perhaps better regular dtm and compute number of n_matches with 
```python
lex_df['n_matches'] = lex_df[vocab].astype(bool).sum(axis = 1, numeric_only=True)
```

In [10]:
cv = CountVectorizer(preprocessor = lambda x: x, tokenizer = lambda x: x.split(), vocabulary = lit_vocab, binary=True)
dtm = cv.fit_transform(lex_comp['lemma'])
lex_df = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=lex_comp["id_text"])

Confusion: lit_lex_vocab, saved form notebook 3.2 includes all lemmas and all MWEs shared by lit and lex. The countvectorizer only sees the *entries* because they have been connected by underscores. Perhaps: do not connect, CountVectorizer() with ngram_range = (1,4) and use CV on lexical *lines* then combine lines to compositions in DTM

In [11]:
lex_df.shape

(760, 112791)

In [12]:
lex_df = lex_df.loc[: , lex_df.sum(axis=0) != 0].copy()
vocab = lex_df.columns

In [13]:
lex_df.shape, len(vocab)

((760, 3265), 3265)

In [14]:
set(lit_lex_vocab) - set(vocab)

{'a[time]n',
 'abbur[nook]n',
 'absaŋ[strap]n',
 'ad[bead]n',
 'ad[log]n',
 'ada[contest?]n',
 'adus[plank]n',
 'agade[1]sn',
 'agar[meadow]n',
 'ahan[vomit]n',
 'akar[implement]n',
 'allub[crab]n',
 'an[sky]n_bala[turn]v/t_ki[place]n_bala[turn]v/t',
 'ana[upper]aj',
 'anaŋ[drink]n',
 'anaš[why?]qp',
 'angam[consequently]ma',
 'anki[universe]n',
 'anta[companion]n',
 'anta[upper]aj',
 'anubda[quarter]n',
 'anumun[water]n',
 'anzag[horizon]n',
 'aria[steppe]n',
 'aslum[sheep]n',
 'azgu[neck-stock]n',
 'aš[six]nu',
 'aše[now]n',
 'aškud[extremities]n',
 'aškud[~door]n',
 'ašte[chair]n',
 'aʾu[water]n',
 'babada[porridge]n',
 'bahar[potter]n',
 'bar[cvve]v/t',
 'barhuda[tool]n',
 'barhuš[fish]n',
 'barim[land]n',
 'barsal[sheep]n',
 'birig[sneer]v/t',
 'buluggur[scythe?]n',
 'bursaŋ[building]n',
 'dagaltuma[~wool]aj',
 'dalhamun[storm]n',
 'dib[board]n',
 'dima[object]n',
 'dirig[excess]n',
 'diš[one]nu',
 'du[all]v/i',
 'dub[heap]v/t',
 'dub[pole-pin]n',
 'dubban[stone]n',
 'dubla[tower]

In [15]:
lex_df["n_matches"] = lex_df[vocab].sum(axis = 1, numeric_only=True)

In [16]:
lex_df

Unnamed: 0_level_0,a[arm]n,a[arm]n_ak[do]v/t,a[arm]n_bad[open]v/t,a[arm]n_dar[split]v/t,a[arm]n_daŋal[wide]v/i,a[arm]n_durah[goat]n,a[arm]n_e[leave]v/i,a[arm]n_gab[left]n,a[arm]n_gal[big]v/i,a[arm]n_gud[ox]n,...,šutug[reed-hut]n,šutum[storehouse]n,šutur[garment]n,šuziʾana[1]dn,šuš[cover]v/t,šušin[1]sn,šušru[distressed]v/i,šuʾi[barber]n,šuʾura[goose]n,n_matches
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Q000043,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,134
Q000050,1,1,1,1,1,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,532
Q000047,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,1,0,404
Q000055,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,598
Q000039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,202
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
P333147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
P333145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
P427591,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
P333845,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [17]:
# First get the metadata. 
cat = {}
for proj in ['dcclt', 'dcclt/signlists', 'dcclt/nineveh', 'dcclt/ebla']:
    f = proj.replace('/', '-')
    file = f"jsonzip/{f}.zip" # The ZIP file was downloaded in notebook 3_1
    z = zipfile.ZipFile(file) 
    st = z.read(f"{proj}/catalogue.json").decode("utf-8")
    j = (json.loads(st))
    cat.update(j["members"])
cat_df = pd.DataFrame(cat).T
cat_df["id_text"] = cat_df["id_text"].fillna(cat_df["id_composite"])
cat_df = cat_df.fillna('')
cat_df = cat_df[["id_text", "designation", "subgenre"]]

In [18]:
lex = pd.merge(cat_df, lex_df['n_matches'], on = 'id_text', how = 'inner')
lex = pd.merge(lex, lex_comp[['length', 'id_text']], on = 'id_text', how = 'inner')

In [19]:
lex['norm'] = lex['n_matches'] / lex['length']
lex = lex.sort_values(by = 'norm', ascending = False)
lex.loc[lex.length > 250]

Unnamed: 0,id_text,designation,subgenre,n_matches,length,norm
174,P228842,"MSL 14, 018 Bb",OB Nippur Ea,333,410,0.812195
753,Q000055,OB Nippur Ea,Sign Lists,598,777,0.769627
754,Q000056,OB Nippur Aa,Sign Lists,221,396,0.558081
751,Q000050,OB Nippur Izi,Acrographic Word Lists,532,1020,0.521569
707,P447992,"OECT 04, 152",OB Diri Oxford,136,286,0.475524
755,Q000057,OB Nippur Diri,Sign Lists,262,577,0.454073
749,Q000047,OB Nippur Lu,Thematic Word Lists,404,902,0.447894
752,Q000052,Nippur Nigga,Acrographic Word Lists,227,511,0.444227
554,P310404,YBC 09868,,95,254,0.374016
584,P333149,"MSL 09, 124-137",OB Ea,101,286,0.353147


In [20]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex2 = lex.copy()
lex2['id_text'] = [anchor.format(val,val) for val in lex['id_text']]

In [21]:
@interact(sort_by = lex2.columns, rows = (1, len(lex2), 1), min_length = (1,500,5))
def sort_df(sort_by = "norm", ascending = False, rows = 25, min_length = 250):
    return lex2.loc[lex2.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style

interactive(children=(Dropdown(description='sort_by', index=5, options=('id_text', 'designation', 'subgenre', …

Next step: look at important words with tfidf.

Note: first make ngrams (as above) then TfidfVectorizer() with vocabulary.

In [22]:
lit_comp2 = lit_lines.groupby(['id_text']).agg({'lemma' : ' '.join}).reset_index()
lit_comp2['id_text'] = [i[-7:] for i in lit_comp2['id_text']]

In [23]:
tv = TfidfVectorizer(token_pattern = r'[^ ]+', ngram_range = (1,3))
dtm = tv.fit_transform(lit_comp2['lemma'])
lit_df = pd.DataFrame(dtm.toarray(), columns= tv.get_feature_names(), index=lit_comp2["id_text"])
#cols = [col for col in lit_df.columns if not '[na]na' in col]
#lit_df = lit_df[cols]

In [24]:
lit_df

Unnamed: 0_level_0,$a-$a[na]na,$a-$a[na]na ŋi[night]n,$a-$a[na]na ŋi[night]n ud[sun]n,$a[na]na,$a[na]na $e[na]na,$a[na]na $e[na]na x-ku-ne-ne-a[na]na,$a[na]na ab-x[na]na,$a[na]na ab-x[na]na x[na]na,$a[na]na x[na]na,$a[na]na x[na]na kag[mouth]n,...,šu₂-šu₂-gin₇[na]na me-e[na]na,šu₂-šu₂-gin₇[na]na me-e[na]na i-lu[na]na,šu₂-šu₂-ke₄[na]na,šu₂-šu₂-ke₄[na]na ma-ra-an-gi₄-gi₄-ne[na]na,šu₂-šu₂-ke₄[na]na ma-ra-an-gi₄-gi₄-ne[na]na mu-un-gar₃[na]na,šu₂[na]na,šu₂[na]na niŋak[magic]n,šu₂[na]na niŋak[magic]n x[na]na,šu₂[na]na ŋiri[foot]n,šu₂[na]na ŋiri[foot]n šu₂[na]na
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P209784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P251427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P251713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P251728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P252215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q000823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Q000824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Q000825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Q002338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
lit_df.columns = [voc.replace(' ', '_') for voc in lit_df.columns]
lit_lex_df = lit_df[vocab].copy() #select columns with terms in lexical vocabulary
lit_lex_df

In [None]:
mean = lit_lex_df.sum(axis=0) / lit_lex_df.astype(bool).sum(axis=0)
mean

In [None]:
lit_lex_tfidf = lex_df[:-1].mul(mean, axis = 1)

In [None]:
lit_lex_tfidf

In [None]:
lit_lex_tfidf['weighted'] = lit_lex_tfidf[vocab].sum(axis=1, numeric_only = True)

In [None]:
lit_lex_tfidf

In [None]:
lit_lex_tfidf = lit_lex_tfidf.loc[lit_lex_tfidf.sum(axis=1) > 0]

In [None]:
lex2 = pd.merge(cat_df, lit_lex_tfidf['weighted'], on = 'id_text', how = 'inner')
lex2 = pd.merge(lex2, lex[['length', 'n_matches', 'id_text']], on = 'id_text', how = 'inner')

Instead of dividing by length look at mean value of weighted
```python
lex2['norm'] = lex2['weigthed'] / lex2.astype(bool).sum(axis = 1)
```

In [None]:
lex2['norm'] = lex2['weighted'] / lex2['n_matches']
#lex2['norm'] = lex2['weighted'] / lex2['length']
lex2.sort_values(by = 'norm', ascending = False)

In [None]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex3 = lex2.copy()
lex3['id_text'] = [anchor.format(val,val) for val in lex2['id_text']]

In [None]:
@interact(sort_by = lex3.columns, rows = (1, len(lex3), 1), min_length = (1,500,5))
def sort_df(sort_by = "weighted", ascending = False, rows = 25, min_length = 200):
    return lex3.loc[lex3.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style