# TF-IDF

We'll rely on term frequency times inverted document frequency to measure meaningful similarity between documents. Let's start by generating a matrix for the separate constituent parts of _Stjórn_.

## TODO

Go back to the generate-scripts and filter out accented vowels as well as eth

In [None]:
import os,glob,json
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#from unidecode import unidecode

In [None]:
titles = ['prologue', 'introduction', 'gn', 'ex', 'lv', 'nm', 'dt', 'ios', 'idc', 'rt', '1sm', '2sm', '3rg', '4rg']
tokens = []
for title in titles:
    with open(f"nlp/{title}.txt") as raw:
        document = raw.read().replace('\n', ' ')
        tokens.extend(document.split())

work_indices = {
    'stjorn1': (650,124417),
    'stjorn2': (124417,147678),
    'stjorn3': (147678,156943,160719),
    'stjorn4': (156943,160719)
}

stjorn = dict()
for _work, _range in work_indices.items():
    if len(_range) == 2:
        stjorn[_work] = ' '.join(tokens[_range[0]:_range[1]])
    else:
        stjorn[_work] = ' '.join(tokens[_range[0]:_range[1]] + tokens[_range[2]:])

menota = dict()
for text in glob.glob('../menota/dipl/*txt'):
    ref = os.path.basename(text).replace('.txt', '')
    with open(text) as doc:
        menota[ref] = doc.read().replace('\n', '')

In [8]:
vectorizer = TfidfVectorizer(min_df=1)
model = vectorizer.fit_transform(stjorn.values())
df = pd.DataFrame(cosine_similarity(model), stjorn.keys(), stjorn.keys())
df

Unnamed: 0,stjorn1,stjorn2,stjorn3,stjorn4
stjorn1,1.0,0.811713,0.508459,0.842683
stjorn2,0.811713,1.0,0.394915,0.833537
stjorn3,0.508459,0.394915,1.0,0.46975
stjorn4,0.842683,0.833537,0.46975,1.0


This tells us _Stjórn III_ is the most distinct of the three. Remarkably, compared to _Stjórn IV_, which covers some of the same ground, it is the least similar constituent text. Perhaps further analysis can tell us how.

First let's add _Konungs skuggsjá_ from Menota, as well as Unger's own edition of the _Norwegian Homily Book_. Fingers crossed that we have got the normalization standard of the former to approach Unger's methods reasonably well.

In [9]:
nhb_titles = ['alcuin', 'hom', 'olafr', 'visio', 'paternoster', 'anhang1'] # this is the sequence matched in Menota
nhb = ''
for title in nhb_titles:
    filepath = f'../nhb/nlp/{title}.txt'
    with open(filepath) as doc:
        nhb = nhb + doc.read().replace('\n', '')
stjorn_plus = []
for v in stjorn.values():
    stjorn_plus.append(v)
stjorn_plus.extend([menota['nks235g_konungs_skuggsja'], nhb])
model = vectorizer.fit_transform(stjorn_plus)
df = pd.DataFrame(cosine_similarity(model), list(stjorn.keys()) + ['ks', 'nhb'], list(stjorn.keys()) + ['ks', 'nhb'])
df

Unnamed: 0,stjorn1,stjorn2,stjorn3,stjorn4,ks,nhb
stjorn1,1.0,0.786055,0.523931,0.812032,0.242538,0.385928
stjorn2,0.786055,1.0,0.40258,0.797535,0.228309,0.329922
stjorn3,0.523931,0.40258,1.0,0.474961,0.696934,0.827073
stjorn4,0.812032,0.797535,0.474961,1.0,0.244245,0.385029
ks,0.242538,0.228309,0.696934,0.244245,1.0,0.710246
nhb,0.385928,0.329922,0.827073,0.385029,0.710246,1.0


Next, let's model all of Menota along with Stjórn. Perhaps we'll leave Unger's _Homily Book_ in alongside the Menota edition, just for comparison's sake.

In [None]:
corpus = []
titles = []
for k,v in stjorn.items():
    titles.append(k)
    corpus.append(v)
titles.append('nhb')
corpus.append(nhb)
for k,v in menota.items():
    titles.append(k)
    corpus.append(v)
model = vectorizer.fit_transform(corpus)
df = pd.DataFrame(cosine_similarity(model), titles, titles)
df

Unnamed: 0,stjorn1,stjorn2,stjorn3,stjorn4,nhb,am132_egils_saga,am162btheta_njals_saga,holmPerg30_langslog,am1056IX_konungs_skuggsja_fragment,am78_kristinrettir,...,am132_hallfredar_saga,am35_heimskringla1,am242_codex_wormianus,dg8II_olafs_saga,am28_codex_runicus,holmPerg34_boejarlog,dg8I_landslog,nraNorrFragm55A_hakonar_saga,nraNorrFragm52_olafs_saga_helga_hin_elzta,holmPerg6_barlaams_saga
stjorn1,1.0,0.748352,0.463369,0.716519,0.321304,0.729321,0.285188,0.450608,0.460412,0.185049,...,0.638843,0.263255,0.772092,0.240468,0.308482,0.599206,0.555753,0.41925,0.20371,0.256146
stjorn2,0.748352,1.0,0.367732,0.690449,0.278872,0.683951,0.22812,0.471444,0.422574,0.194668,...,0.615179,0.228013,0.700719,0.225097,0.278706,0.635361,0.551214,0.362248,0.188396,0.242982
stjorn3,0.463369,0.367732,1.0,0.422768,0.804556,0.423967,0.735041,0.264177,0.232953,0.575662,...,0.358496,0.747495,0.434986,0.755522,0.118952,0.330666,0.343903,0.557097,0.643045,0.816467
stjorn4,0.716519,0.690449,0.422768,1.0,0.305745,0.627295,0.264138,0.405317,0.363065,0.182923,...,0.537848,0.241138,0.637794,0.219982,0.242823,0.509276,0.477872,0.374106,0.188195,0.214269
nhb,0.321304,0.278872,0.804556,0.305745,1.0,0.354436,0.653088,0.338534,0.236618,0.671409,...,0.277612,0.702398,0.406509,0.78854,0.08495,0.371127,0.402062,0.437926,0.642324,0.815319
am132_egils_saga,0.729321,0.683951,0.423967,0.627295,0.354436,1.0,0.35114,0.504718,0.443,0.26233,...,0.758933,0.378394,0.78232,0.353706,0.276509,0.682695,0.589778,0.548603,0.297073,0.318854
am162btheta_njals_saga,0.285188,0.22812,0.735041,0.264138,0.653088,0.35114,1.0,0.176453,0.141569,0.474757,...,0.295919,0.660115,0.285784,0.647034,0.071623,0.271676,0.212355,0.469486,0.583432,0.701552
holmPerg30_langslog,0.450608,0.471444,0.264177,0.405317,0.338534,0.504718,0.176453,1.0,0.353913,0.472909,...,0.423559,0.211825,0.558766,0.300529,0.309754,0.535706,0.627539,0.262156,0.179076,0.236361
am1056IX_konungs_skuggsja_fragment,0.460412,0.422574,0.232953,0.363065,0.236618,0.443,0.141569,0.353913,1.0,0.191966,...,0.38313,0.168137,0.519105,0.182938,0.172372,0.439638,0.395357,0.207105,0.152221,0.187431
am78_kristinrettir,0.185049,0.194668,0.575662,0.182923,0.671409,0.26233,0.474757,0.472909,0.191966,1.0,...,0.197786,0.553633,0.313521,0.656271,0.188478,0.391393,0.60287,0.349404,0.510014,0.654574


That's not a good sign, two editions of the _Norwegian Homily Book_ doing no better than `0.54` similarity... If I run `unidecode` (which strips out accented vowels, among other things), it only gets up to `0.55`.