This is going to be 3.3

# 3.3 Looking at the Lexical Vocabulary from the Perspective of the Literary Material

In section 3.2 we asked whether we can see differences between Old Babylonian literary compositions in their usage of vocabulary (lemmas and MWEs) attested in the lexical corpus. In this notebook we will change perspective and ask: are there particular lexical texts (or groups of lexical texts) that show a greater engagement with literary vocabulary than others?

In [3.1](./3_1_Lit_Lex_Vocab.ipynb) and [3.2](./3_2_Lit_Lex.ipynb) we used Multiple Word Expressions, connecting words that are found in a lexical entry by underscrores (using `MWEtokenizer()` from the nltk module). The lemmas and MWE were used visualized in Venn diagrams to illustrate the intersection between lexical and literary vocabulary.

In this notebook we will use the ngram option of the `CountVectorizer()` function in order to find sequences of lemmas that are shared between lexical and literary texts. A ngram is a continuous sequence of *n* words (or lemmas). 


In large part, this notebook uses the same techniques and the same code as section 3.2 did, and the reader is referred there for further explanation. In some aspects, however, the process is different. In particular, we will use various aspects of `CountVectorizer()` and the related function `TfidfVectorizer()` to understand the relationship in more detail.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
import zipfile
import json
import numpy as np

Open the file `lexlines.p` which was produced in [3_1_Lit_Lex_Vocab.ipynb](./3_1_Lit_Lex_Vocab.ipynb). The file contains the pickled version of the DataFrame `lex_lines` in which the lexical ([dcclt](http://oracc.org/dcclt)) corpus is represented in line-by-line format.

The field `id_text` is represented as `dcclt/P227743` or `dcclt/signlists/Q000057`. In practice, we only need the last seven characters, the P, Q, or X number.

In [2]:
lex_lines = pd.read_pickle('output/lexlines.p')
lex_lines['id_text'] = lex_lines['id_text'].str[-7:]

### Special Case: OB Nippur Ura 6
The sixth chapter of the Old Babylonian Nippur version of the thematic list Ura deals with foodstuffs and drinks. This chapter was not standardized (each exemplar has its own order of items and sections) and therefore no composite text has been created in [DCCLT](http://oracc.org/dcclt). Instead, the "composite" of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) consists of the concatenation of all known Nippur exemplars of the list of foodstuffs. In our current dataframe, therefore, there are no lines where the field `id_text` equals "Q000043".

We create a "composite" by changing the field `id_text` in all exemplars of [OB Nippur Ura 6](http://oracc.org/dcclt/Q000043) to "Q000043". 

In [3]:
Ura6 = ["P227657",
"P227743",
"P227791",
"P227799",
"P227925",
"P227927",
"P227958",
"P227967",
"P227979",
"P228005",
"P228008",
"P228200",
"P228359",
"P228368",
"P228488",
"P228553",
"P228562",
"P228663",
"P228726",
"P228831",
"P228928",
"P229015",
"P229093",
"P229119",
"P229304",
"P229332",
"P229350",
"P229351",
"P229352",
"P229353",
"P229354",
"P229356",
"P229357",
"P229358",
"P229359",
"P229360",
"P229361",
"P229362",
"P229365",
"P229366",
"P229367",
"P229890",
"P229925",
"P230066",
"P230208",
"P230230",
"P230530",
"P230586",
"P231095",
"P231128",
"P231424",
"P231446",
"P231453",
"P231458",
"P231742",
"P266520"]
lex_lines.loc[lex_lines["id_text"].isin(Ura6), "id_text"] = "Q000043"

# Computing text length
In order to evaluate the number of matches between a lexical text and the literary corpus we need a measure of text length. Text length is defined here as the number of lemmatized words in a text.

First the lines of `lit_lines` are aggregated to lexical compositions in the DataFrame `lex_comp`. 

In [4]:
lex_comp = lex_lines.groupby(
    [lex_lines["id_text"]]).aggregate(
    {"lemma": ' '.join}).reset_index()

The function `lex_length()` computes the number of lemmas in each composition by first splitting the field `lemmas` into individual lemmas. A list comprehension removes all unlemmatized words, and the length of the resulting list is returned.

In [5]:
def lex_length(lemmas):
    lemmas = lemmas.split()
    lemmas = [lemma for lemma in lemmas if not '[na]na' in lemma] # remove unlemmatized words
    length = len(lemmas)
    return length

First add the new field `length` by calling the function `lex_length()` for every row.

The DataFrame `lex_comp` has data from all Old Babylonian lexical texts currently in [dcclt](http://oracc.org/dcclt). Not all of these texts are lemmatized. In particular, documents that have been linked to a composite text are usually not lemmatized. Such documents have no lemmatized contents and therefore have length 0. These documents are removed from `lex_comp`.

Since the lexical data are drawn from multiple (sub)projects, it is possible that there are duplicates. Duplicates have the same P, Q, or X number. We select the version with the largest number of (lemmatized) words and drop others. The DataFrame is then ordered by length (from large to small) and, if duplicate `text_id`s are found, only the first one is kept with the Pandas method `drop_duplicates()`.

In [6]:
lex_comp['length'] = lex_comp['lemma'].progress_map(lex_length)
lex_comp = lex_comp.loc[lex_comp['length'] > 0] # remove compositions that have no lemmatized content
lex_comp = lex_comp.sort_values(by = 'length', ascending=False)
lex_comp = lex_comp.drop_duplicates(subset = 'id_text', keep = 'first')

HBox(children=(FloatProgress(value=0.0, max=2135.0), HTML(value='')))




# Open list of Vocabulary Intersection
The file `lit_lex_vocab` is a list that includes all lemmas and Multiple Word Expressions that are shared by the literary corpus and the lexical corpus. This list was produced in [3_2_Lit_Lex.ipynb](./3_2_Lit_Lex.ipynb). In sections [3.1](./3_1_Lit_Lex_Vocab.ipynb) and [3.2](./3_2_Lit_Lex.ipynb) lexical *entries* were turned into MWEs by connecting the individual lemmas by underscores (as in `amar\[young\]n_ga\[milk\]n_gu\[eat\]v/t`). In this notebook we will take a different approach by using ngrams. For that reason we need to replace all those underscores by spaces.

This vocabulary is used in the next section for building a Document Term Matrix.

In [7]:
with open('output/lit_lex_vocab.txt', 'r', encoding = 'utf8') as l:
    lit_lex_vocab = l.read().splitlines()
lit_lex_vocab = [v.replace('_', ' ') for v in lit_lex_vocab]
lit_lex_vocab[:25]

['a[arm]n',
 'a[arm]n ak[do]v/t',
 'a[arm]n bad[open]v/t',
 'a[arm]n dar[split]v/t',
 'a[arm]n daŋal[wide]v/i',
 'a[arm]n durah[goat]n',
 'a[arm]n e[leave]v/i',
 'a[arm]n gab[left]n',
 'a[arm]n gal[big]v/i',
 'a[arm]n gud[ox]n',
 'a[arm]n gur[thick]v/i',
 'a[arm]n gur[turn]v/i',
 'a[arm]n il[raise]v/t',
 'a[arm]n kalag[strong]v/i',
 'a[arm]n kud[cut]v/t',
 'a[arm]n la[hang]v/t',
 'a[arm]n mah[great]v/i',
 'a[arm]n me[battle]n',
 'a[arm]n sag[good]v/i',
 'a[arm]n si[horn]n sa[equal]v/t',
 'a[arm]n sig[weak]v/i',
 'a[arm]n tal[broad]v/i',
 'a[arm]n tulu[slacken]v/t',
 'a[arm]n ud[sun]n',
 'a[arm]n zid[right]v/i']

### Document Term Matrix

The lexical corpus is transformed into a Document Term Matrix (or DTM), in the same way we did in [3.2](./3_2_Lit_Lex.ipynb) with the literary corpus - but with two important differences. First, as our data we will use the DataFrame `lex_lines` which represents the lexical corpus in *line* format, rather than in document format. This is important, because we do not want the ngrams (see below) to jump over line boundaries. The DTM will thus treat each line as a document. In the next step we will add up all the line-based data to create a proper DTM.

Second, the parameter `ngram_range` is set to (1, 5). `Countvectorizer()` will create a column for each word (ngram n=1), but also for each sequence of two words (bigram; n=2), or three words (trigram; n=3), etc. The entry `amar\[young\]n ga\[milk\]n gu\[eat\]v/t` (calf that eats milk) will be represented as :

| type             | representation  |
|------------------|-----------------|
| unigram        | amar\[young\]n |
|                    | ga\[milk\]n |
|                    | gu\[eat\]v/t |
| bigram             | amar\[young\]n ga\[milk\]n |
|                    | ga\[milk\]n gu\[eat\]v/t |
| trigram | amar\[young\]n ga\[milk\]n gu\[eat\]v/t|

For longer entries we may also get 4-grams and 5-grams. We will use `CountVectorizer()` on the representation of lexical texts in *lines* so that the ngrams do not extent over the end of an entry. Afterwards, lines are combined into compositions.

A three word entry which was treated as a single unit in [3.1](./3_2_Lit_Lex_Vocab.ipynb) and [3.2](./3_2_Lit_Lex.ipynb) now results in 6 potential columns in the Document Term Matrix. 

Potentially, this results in a very big (and very sparse) matrix. In order to limit the its size somewhat we use the vocabulary `lit_lex_vocab` which contains all lemmas and lexical entries shared by the lexical and literary corpora.

In [8]:
cv = CountVectorizer(preprocessor = lambda x: x, tokenizer = lambda x: x.split(), vocabulary = lit_lex_vocab, ngram_range=(1, 5))
dtm = cv.fit_transform(lex_lines['lemma'])
lex_lines_dtm = pd.DataFrame(dtm.toarray(), columns= cv.get_feature_names(), index=lex_lines["id_text"])

We may reformat the line-based DTM into a true DTM by using the Pandas functions `groupby()` and `aggregate()`. The `aggregate()` function, in this case, is `sum`: for every word or ngram add the frequencies of all the lines of a single lexical composition.

In [9]:
lex_comp_dtm = lex_lines_dtm.groupby('id_text').agg(sum).reset_index()
#vocab = lex_comp_dtm.columns[1:] instead, use lit_lex_vocab

In [10]:
lex_comp_dtm["n_matches"] = lex_comp_dtm[lit_lex_vocab].astype(bool).sum(axis = 1, numeric_only=True)

In [11]:
# Get the metadata. 
cat = {}
for proj in ['dcclt', 'dcclt/signlists', 'dcclt/nineveh', 'dcclt/ebla']:
    f = proj.replace('/', '-')
    file = f"jsonzip/{f}.zip" # The ZIP file was downloaded in notebook 3_1
    z = zipfile.ZipFile(file) 
    st = z.read(f"{proj}/catalogue.json").decode("utf-8")
    j = (json.loads(st))
    cat.update(j["members"])
cat_df = pd.DataFrame(cat).T
cat_df["id_text"] = cat_df["id_text"].fillna(cat_df["id_composite"])
cat_df = cat_df.fillna('')
cat_df = cat_df[["id_text", "designation", "subgenre"]]

In [12]:
lex = pd.merge(cat_df, lex_comp_dtm[['n_matches', 'id_text']], on = 'id_text', how = 'inner')
lex = pd.merge(lex, lex_comp[['length', 'id_text']], on = 'id_text', how = 'inner')

In [13]:
lex['norm'] = lex['n_matches'] / lex['length']
lex = lex.sort_values(by = 'norm', ascending = False)
lex.loc[lex.length > 250]

Unnamed: 0,id_text,designation,subgenre,n_matches,length,norm
178,P228842,"MSL 14, 018 Bb",OB Nippur Ea,333,410,0.812195
765,Q000055,OB Nippur Ea,Sign Lists,599,779,0.768935
766,Q000056,OB Nippur Aa,Sign Lists,231,408,0.566176
718,P447992,"OECT 04, 152",OB Diri Oxford,147,295,0.498305
763,Q000050,OB Nippur Izi,Acrographic Word Lists,688,1400,0.491429
418,P247810,IB 1514,OB Lu,133,274,0.485401
770,Q002268,OB Nippur Ugumu,Thematic Word Lists,163,348,0.468391
767,Q000057,OB Nippur Diri,Sign Lists,278,594,0.468013
764,Q000052,Nippur Nigga,Acrographic Word Lists,391,844,0.46327
762,Q000048,OB Nippur Kagal,Acrographic Word Lists,447,1015,0.440394


In [14]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex2 = lex.copy()
lex2['id_text'] = [anchor.format(val,val) for val in lex['id_text']]

In [15]:
@interact(sort_by = lex2.columns, rows = (1, len(lex2), 1), min_length = (1,500,5))
def sort_df(sort_by = "norm", ascending = False, rows = 25, min_length = 250):
    return lex2.loc[lex2.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style

interactive(children=(Dropdown(description='sort_by', index=5, options=('id_text', 'designation', 'subgenre', …

Next step: look at important words with tfidf.

Note: first make ngrams (as above) then TfidfVectorizer() with vocabulary.

In [16]:
lit_lines = pd.read_pickle('output/litlines.p')
lit_comp2 = lit_lines.groupby(['id_text']).agg({'lemma' : ' '.join}).reset_index()
lit_comp2['id_text'] = [i[-7:] for i in lit_comp2['id_text']]
tv = TfidfVectorizer(token_pattern = r'[^ ]+', ngram_range = (1,5), vocabulary =lit_lex_vocab)
dtm = tv.fit_transform(lit_comp2['lemma'])
lit_df = pd.DataFrame(dtm.toarray(), columns= tv.get_feature_names(), index=lit_comp2["id_text"])
lit_df

Unnamed: 0_level_0,a[arm]n,a[arm]n ak[do]v/t,a[arm]n bad[open]v/t,a[arm]n dar[split]v/t,a[arm]n daŋal[wide]v/i,a[arm]n durah[goat]n,a[arm]n e[leave]v/i,a[arm]n gab[left]n,a[arm]n gal[big]v/i,a[arm]n gud[ox]n,...,šutum[storehouse]n,šutur[garment]n,šuziʾana[1]dn,šuš[cover]v/t,šušana[one-third]nu,šuši[sixty]nu,šušin[1]sn,šušru[distressed]v/i,šuʾi[barber]n,šuʾura[goose]n
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P209784,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
P251427,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
P251713,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
P251728,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
P252215,0.070647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q000823,0.043850,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.010059,0.0,0.0,0.0,0.0,0.0,0.0
Q000824,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
Q000825,0.048206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.044233,0.0,0.0,0.0,0.0,0.0,0.0
Q002338,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
mean = lit_df[lit_lex_vocab].sum(axis=0) / lit_df[lit_lex_vocab].astype(bool).sum(axis=0) #total weights by total hits
mean = mean.array

In [18]:
lit_lex_tfidf = lex_comp_dtm.copy()
lit_lex_tfidf[lit_lex_vocab] = lit_lex_tfidf[lit_lex_vocab].mul(mean, axis = 1)

In [19]:
lit_lex_tfidf

Unnamed: 0,id_text,a[arm]n,a[arm]n ak[do]v/t,a[arm]n bad[open]v/t,a[arm]n dar[split]v/t,a[arm]n daŋal[wide]v/i,a[arm]n durah[goat]n,a[arm]n e[leave]v/i,a[arm]n gab[left]n,a[arm]n gal[big]v/i,...,šutur[garment]n,šuziʾana[1]dn,šuš[cover]v/t,šušana[one-third]nu,šuši[sixty]nu,šušin[1]sn,šušru[distressed]v/i,šuʾi[barber]n,šuʾura[goose]n,n_matches
0,P117394,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2
1,P117395,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1
2,P117396,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,3
3,P117397,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0
4,P117404,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2130,Q000302,0.298731,0.0,0.0,0.0,0.0,0.0,0.0,0.090876,0.0,...,0.0,0.0,0.109124,0.0,0.0,0.0,0.0,0.0,0.0,363
2131,Q002268,0.074683,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,163
2132,X000101,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,3
2133,X000345,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0


In [20]:
lit_lex_tfidf['weighted'] = lit_lex_tfidf[lit_lex_vocab].sum(axis=1, numeric_only = True)

In [21]:
lit_lex_tfidf

Unnamed: 0,id_text,a[arm]n,a[arm]n ak[do]v/t,a[arm]n bad[open]v/t,a[arm]n dar[split]v/t,a[arm]n daŋal[wide]v/i,a[arm]n durah[goat]n,a[arm]n e[leave]v/i,a[arm]n gab[left]n,a[arm]n gal[big]v/i,...,šuziʾana[1]dn,šuš[cover]v/t,šušana[one-third]nu,šuši[sixty]nu,šušin[1]sn,šušru[distressed]v/i,šuʾi[barber]n,šuʾura[goose]n,n_matches,weighted
0,P117394,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0.174882
1,P117395,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0.050622
2,P117396,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,0.519652
3,P117397,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0,0.000000
4,P117404,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,5,0.460578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2130,Q000302,0.298731,0.0,0.0,0.0,0.0,0.0,0.0,0.090876,0.0,...,0.0,0.109124,0.0,0.0,0.0,0.0,0.0,0.0,363,104.884014
2131,Q002268,0.074683,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,163,25.920406
2132,X000101,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,0.267095
2133,X000345,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0,0.000000


In [22]:
lit_lex_tfidf.shape

(2135, 3511)

In [23]:
lex2 = pd.merge(cat_df, lit_lex_tfidf[['weighted', 'id_text']], on = 'id_text', how = 'inner')
lex2 = pd.merge(lex2, lex[['length', 'n_matches', 'id_text']], on = 'id_text', how = 'inner')

Instead of dividing by length look at mean value of weighted
```python
lex2['norm'] = lex2['weigthed'] / lex2.astype(bool).sum(axis = 1)
```

In [24]:
#lex2['norm'] = lex2['weighted'] / lex2['n_matches']
lex2['norm'] = lex2['weighted'] / lex2['length']
lex2.sort_values(by = 'norm', ascending = False)

Unnamed: 0,id_text,designation,subgenre,weighted,length,n_matches,norm
47,P225075,"TIM 10, 099",exercise,1.259534,3,1,0.419845
544,P278700,N 3685,,0.984156,4,4,0.246039
698,P389436,"OLZ 17, 306 P375",OB Lu,0.540678,3,4,0.180226
77,P227805,"PBS 05, 150",Grammatical,7.388074,44,6,0.167911
439,P249388,"AUCT 5, 181",OB Ura,0.670868,4,3,0.167717
...,...,...,...,...,...,...,...
42,P225058,"TIM 10, 082",OB Ura,0.000000,1,0,0.000000
43,P225059,"TIM 10, 083",exercise,0.000000,3,0,0.000000
54,P225122,"TIM 10, 114",OB Ugumu,0.000000,3,0,0.000000
3,P117397,"MVN 13, 624",,0.000000,3,0,0.000000


In [25]:
anchor = '<a href="http://oracc.org/dcclt/{}", target="_blank">{}</a>'
lex3 = lex2.copy()
lex3['id_text'] = [anchor.format(val,val) for val in lex2['id_text']]

In [27]:
@interact(sort_by = lex3.columns, rows = (1, len(lex3), 1), min_length = (1,500,5))
def sort_df(sort_by = "norm", ascending = False, rows = 25, min_length = 200):
    return lex3.loc[lex3.length >= min_length].sort_values(by = sort_by, ascending = ascending).reset_index(drop=True)[:rows].style

interactive(children=(Dropdown(description='sort_by', index=6, options=('id_text', 'designation', 'subgenre', …