# Multimodale Versuche der Alignierung historischer Texte

_Andreas Wagner und Manuela Bragagnolo, Max-Planck-Institut für europäische Rechtsgeschichte, Frankfurt/M._

&lt;<wagner@rg.mpg.de>&gt; &lt;<bragagnolog@rg.mpg.de>&gt;

# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Multimodale-Versuche-der-Alignierung-historischer-Texte" data-toc-modified-id="Multimodale-Versuche-der-Alignierung-historischer-Texte-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Multimodale Versuche der Alignierung historischer Texte</a></div><div class="lev2 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev1 toc-item"><a href="#Preparations" data-toc-modified-id="Preparations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparations</a></div><div class="lev1 toc-item"><a href="#TF/IDF-" data-toc-modified-id="TF/IDF--3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF/IDF </a></div><div class="lev1 toc-item"><a href="#Translations?" data-toc-modified-id="Translations?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Translations?</a></div><div class="lev2 toc-item"><a href="#New-Approach:-Use-Aligner-from-Machine-Translation-Studies-" data-toc-modified-id="New-Approach:-Use-Aligner-from-Machine-Translation-Studies--41"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>New Approach: Use Aligner from Machine Translation Studies </a></div><div class="lev1 toc-item"><a href="#Similarity-" data-toc-modified-id="Similarity--5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Similarity </a></div><div class="lev1 toc-item"><a href="#Word-Clouds-" data-toc-modified-id="Word-Clouds--6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word Clouds </a></div>

## Introduction

This file is the continuation of preceding work. Previously, I have worked my way through a couple of text-analysing approaches - such as tf/idf frequencies, n-grams and the like - in the context of a project concerned with Juan de Solórzano Pereira's *Politica Indiana*. This can be seen [here](TextProcessing_Solorzano.ipynb).

In the former context, I got somewhat stuck when I was trying to automatically align corresponding passages of two editions of the same work ... where the one edition would be a **translation** of the other and thus we would have two different languages. In vector terminology, two languages means two almost orthogonal vectors and it makes little sense to search for similarities there.

The present file takes this up, tries to refine an approach taken there and to find alternative ways of analysing a text across several languages. This time, the work concerned is Martín de Azpilcueta's *Manual de confesores*, a work of the 16th century that has seen very many editions and translations, quite a few of them even by the work's original author and it is the subject of the research project ["Martín de Azpilcueta’s Manual for Confessors and the Phenomenon of Epitomisation"](http://www.rg.mpg.de/research/martin-de-azpilcuetas-manual-for-confessors) by Manuela Bragagnolo. 

(There are a few DH-ey things about the project that are not directly of concern here, like a synoptic display of several editions or the presentation of the divergence of many actual translations of a given term. Such aspects are being treated with other software, like [HyperMachiavel](http://hyperprince.ens-lyon.fr/hypermachiavel) or [Lera](http://lera.uzi.uni-halle.de/).)

As in the previous case, the programming language used in the following examples is "python" and the tool used to get prose discussion and code samples together is called ["jupyter"](http://jupyter.org/). (A common way of installing both the language and the jupyter software, especially in windows, is by installing a python "distribution" like [Anaconda](https://www.anaconda.com/what-is-anaconda/).) In jupyter, you have a "notebook" that you can populate with text (if you want to use it, jupyter understands [markdown](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) code formatting) or code, and a program that pipes a nice rendering of the notebook to a web browser as you are reading right now. In many places in such a notebook, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion.

You can save your notebook online (the current one is [here at github](https://github.com/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb)) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address [https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb](https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb).

A final word about the elements of this notebook:

<div class="alert alertbox alert-success">At some points I am mentioning things I consider to be important decisions or take-away messages for scholarly readers. E.g. whether or not to insert certain artefacts into the very transcription of your text, what the methodological ramifications of a certain approach or parameter are, what the implications of an example solution are, or what a possible interpretation of a certain result might be. I am highlighting these things in a block like this one here or at least in <font color="green">**green bold font**</font>.</div>

<div class="alert alertbox alert-danger">**NOTE:** As I am continually improving the notebook on the side of the source text, wordlists and other parameters, it is sometimes hard to keep the prose description in sync. So while the actual descriptions still apply, the numbers that are mentioned in the prose (as where we have e.g. a "table with 20 rows and 1.672 columns") might no longer reflect the latest state of the sources, auxiliary files and parameters and you should take these with a grain of salt. Best double check them by reading the actual code ;-)

I apologize for the inconsistency.</div>

# Preparations

In [23]:
from typing import Dict
import lxml
from lxml import etree

document=etree.fromstring("""
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text>
  <body>
    <div n="1">
      <p>
      ... <milestone unit="number" n="9"/>aun que el amor de Dios ha de ser
      grandissimo ..., como despues de. S. Tho.
      <ref target="#nm-0406">b</ref><note xml:id="nm-0406"><p>1. Sec. quaestio
      109. ar. 3.</p></note>, poco ha lo tratamos
      <ref target="#nm-0407">c</ref><note xml:id="nm-0407"><p>in addit. ca.
      Quoniam. de consec. disti. 1. nu. 10.</p></note>. Anadimos, (virtual)
      <milestone unit="number" n="10"/>porque aquella basta, ...
      <ref target="#nm-0408">d</ref><note xml:id="nm-0408"><p>in 4. dis. 14.
      q. 1. art. 3.</p></note>, que pone exemplo ..., que Gabriel sigue
      <ref target="#nm-0409">e</ref><note xml:id="nm-0409"><p>in 4. dis. 14.
      q. 1. col. 12. &amp; 13. &amp; in. 3. di. 27. q. 1. co. 15.</p></note>.
      <milestone unit="other" rendition="#asterisk"/> Y aun, aquel doctissimo,
      ... <ref target="#nm-040a">f</ref><note xml:id="nm-040a"><p>In Codice de
      poeni. q. 2.</p></note>, y con razon, ..., el martyrio atribuya esto
      <ref target="#nm-040b">g</ref><note xml:id="nm-040b"><p>Lib. 2. c. 16.
      de natu. &amp; gra.</p></note>, porque mas haze para esto el amor, ...
      que lo que se padece <ref target="#nm-040c">h</ref><note xml:id="nm-040c">
      <p>Arg. c. 13. 1. ad Corinth.</p></note>. Y puede ser que mas ame, ...,
      como lo prueua bien Medina
      <ref target="#nm-040d">i</ref><note xml:id="nm-040d"><p>in predi.
      q. 2.</p></note>. Por lo qual largamente paresce quan lexos esta esto
      dela opinion de Luthero<milestone unit="other" rendition="#asterisk"/>.
      De lo dicho se collige la razon, ..., segun Syluestro
      <ref target="#nm-040e">k</ref><note xml:id="nm-040e"><p>verb. Contritio.
      q. 1.</p></note>. Diximos <milestone unit="number" n="11"/> (auer
      pecado,) porque el arrepentimiento ...
      </p>
    </div>
  </body>
</text>
</TEI>""")

def segment(chapter: lxml.etree._Element) -> Dict[str, str]:
    segments = {} # this will be returned
    t = []        # this is a buffer
    chap_label = str(chapter.get("n"))
    sect_label = "0"
    for element in chapter.iter():
        if element.get("unit")=="number":
            # milestone: fill and close the previous segment:
            label = chap_label + "_" + sect_label
            segments[label] = " ".join(t)
            # reset buffer
            t = []
            # if there is text after the milestone,
            # add it as first content to the buffer
            if element.tail:
                t.append(" ".join(str.replace(element.tail, "\n", " ").strip().split()))
            # prepare for next labelmaking
            sect_label = str(element.get("n"))
        else:
            if element.text:
                t.append(" ".join(str.replace(element.text, "\n", " ").strip().split()))
            if element.tail:
                t.append(" ".join(str.replace(element.tail, "\n", " ").strip().split()))
    # all elements are processed,
    # add text remainder/current text buffer content
    label = chap_label + "_" + sect_label
    segments[label] = " ".join(t)
    return segments

nsmap = {"tei": "http://www.tei-c.org/ns/1.0"}
xp_divs = etree.XPath("(//tei:body/tei:div)", namespaces = nsmap)

segmented = {}
divs = xp_divs(document)
segments = (segment(div) for div in divs)
for d in segments:
    print(d)

{'1_0': '  ... ', '1_9': 'aun que el amor de Dios ha de ser grandissimo ..., como despues de. S. Tho. b , poco ha lo tratamos 1. Sec. quaestio 109. ar. 3. c . Anadimos, (virtual) in addit. ca. Quoniam. de consec. disti. 1. nu. 10.', '1_10': 'porque aquella basta, ... d , que pone exemplo ..., que Gabriel sigue in 4. dis. 14. q. 1. art. 3. e . in 4. dis. 14. q. 1. col. 12. & 13. & in. 3. di. 27. q. 1. co. 15. Y aun, aquel doctissimo, ... f , y con razon, ..., el martyrio atribuya esto In Codice de poeni. q. 2. g , porque mas haze para esto el amor, ... que lo que se padece Lib. 2. c. 16. de natu. & gra. h  . Y puede ser que mas ame, ..., como lo prueua bien Medina Arg. c. 13. 1. ad Corinth. i . Por lo qual largamente paresce quan lexos esta esto dela opinion de Luthero in predi. q. 2. . De lo dicho se collige la razon, ..., segun Syluestro k . Diximos verb. Contritio. q. 1.', '1_11': '(auer pecado,) porque el arrepentimiento ...'}


In [79]:
document=etree.fromstring("""
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text><body>
  <div n="1">
    <p>... <milestone unit="number" n="9"/>aa ab ac<ref target="#nm-0406">ad</ref><note xml:id="nm-0406"><p>ae af</p></note> ag
    <ref target="#nm-0407">ah</ref><note xml:id="nm-0407"><p>ai aj</p></note> ak
    <milestone unit="number" n="10"/>ba bb bc<ref target="#nm-0408">bd</ref><note xml:id="nm-0408"><p>be bf</p></note> bg
    <ref target="#nm-0409">bh</ref><note xml:id="nm-0409"><p>bi bj</p></note><milestone unit="other" rendition="#asterisk"/> bk bl<ref target="#nm-040a">bm</ref><note xml:id="nm-040a"><p>bn bo</p></note> bp
    <ref target="#nm-040b">bq</ref><note xml:id="nm-040b"><p>br bs</p></note> bt
    <ref target="#nm-040c">bu</ref><note xml:id="nm-040c"><p>bv</p></note> bw<milestone unit="other" rendition="#asterisk"/>bx. by <milestone unit="number" n="11"/>ca cb ...</p>
  </div>
</body></text>
</TEI>""")

import lxml
from lxml import etree

def flatten(element: lxml.etree._Element):
    t = ""
    if element.text:
        t += " ".join(str.replace(element.text, "\n", " ").strip().split())
    if element.get("unit")=="number":
        t += t + "+ms_" + str(element.get("n")) + "+"
        if element.tail:
            t += " ".join(str.replace(element.tail, "\n", " ").strip().split())
    if element.getchildren():
        t += " ".join((flatten(child)) for child in element.getchildren())
    if element.tail and not(element.get("unit")=="number"):
        t += " ".join(str.replace(element.tail, "\n", " ").strip().split())
    # all elements are processed, add text remainder/current text buffer content
    return t

nsmap = {"tei": "http://www.tei-c.org/ns/1.0"}
xp_divs = etree.XPath("(//tei:body/tei:div)", namespaces = nsmap)
divs = xp_divs(document)

segments = "".join(flatten(div) for div in divs)
print(segments)

...+ms_9+aa ab ac ad ae afag ah ai ajak +ms_10+ba bb bc bd be bfbg bh bi bj bk bl bm bn bobp bq br bsbt bu bvbw bx. by +ms_11+ca cb ...


Unlike in the previous case, where we had word files that we could export as plaintext, in this case Manuela has prepared a sample chapter with four editions transcribed *in parallel* in an office spreadsheet. So we first of all make sure that we have good **UTF-8** comma-separated-value files, e.g. by uploading a **csv** export of our office program of choice to [a CSV Linting service](https://csvlint.io/). (As a side remark, in my case, exporting with LibreOffice provided me with options to select UTF-8 encoding and choose the field delimiter and resulted in a valid csv file. MS Excel did neither of those.) Below, we expect the file at the following position:

In [1]:
sourcePath = 'DHd2019/cap6_align_-_2018-01.csv'

Then, we can go ahead and open the file in python's csv reader:

In [2]:
import csv

sourceFile = open(sourcePath, newline='', encoding='utf-8')
sourceTable = csv.reader(sourceFile)

And next, we read each line into new elements of four respective lists (since we're dealing with one sample chapter, we try to handle it all in memory first and see if we run into problems):

*(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment 0, segment 1, ..., segment 19. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)*

In [3]:
import re

# Initialize a list of lists, or two-dimensional list ...
Editions = [[]]

# ...with four sub-lists 0 to 3
for i in range(3):
    a = []
    Editions.append(a)

# Now populate it from our sourceTable
sourceFile.seek(0)             # in repeated runs, restart from the beginning of the file
for row in sourceTable:
    for i, field in enumerate(row):    # We normalize quite a bit here already:
        p = field.replace('¶', ' ¶ ')                  # spaces around ¶ 
        p = re.sub("&([^c])"," & \\1", p)              # always spaces around &, except for &c
        p = re.sub("([,.:?/])(\S)","\\1 \\2", p)       # always a space after ',.:?/'
        p = re.sub("([0-9])([a-zA-Z])", "\\1 \\2", p)  # always a space between numbers and word characters
        p = re.sub("([a-z]) ?\\(\\1\\b", " (\\1", p)         # if a letter is repeated on its own in a bracketed
                                                       # expression it's a note and we eliminate the character
                                                       # from the preceding word
        p = " ".join(p.split())                        # always only one space
        Editions[i].append(p)

print(str(len(Editions[0])) + " rows read.\n")

# As an example, see the first seven sections of the third edition (1556 SPA):
for field in range(len(Editions[2])):
    print(Editions[2][field])

41 rows read.

1556 SPA
¶ Capitulo. 6. De las circunstancias del pecado.
Sumario.
1 Circunstancia que es? nu. I. y que ay siete especies della. nu. 2. Y que se ha de confessar de necessidad, la que muda la especie. nu. 3. Pero no la de aver pecado en confinança de se confessar. n. 4. /Circunstancia de homicidio, y de fornicacion en lugar sagrado se ha de confessar, y la vedada por otra ley diversa &c. nu. 5/ Circunstancia de mentira iocosa, y la que alivia el pecado quando se ha de confessar. nu. 6. 7. & 8. Y quando la del dia de fiesta, de ayuno, o de oracion, o del lugar sagrado. nu. 9 & 10. Y la de la proprioa persona, y de la religion. nu. 11. Y ha de pecar contra consciencia. nume. 12/ [p. 32, corretto 31; 24 pdf] Circunstancia como no es el numero de los pecados nu. 14. Pecaodo multipliarse tantas vezes, quantas se itera, como se ha de entender, y si crece el numero de los pecados por se interpolar la voluntad. nu. 16. Y por mudar el proposito, para no acabar el pecado con otras 

Actually, let's define two more list variables to hold information about the different editions - language and year of print:

In [4]:
numOfEds = 4
language = ["PT", "PT", "ES", "LA"] # I am using language codes that later on can be used in babelnet
year = [1549, 1552, 1556, 1573]

# TF/IDF <a name="tfidf"></a>

In the previous (i.e. Solórzano) analyses, things like tokenization, lemmatization and stop-word lists filtering are explained step by step. Here, we rely on what we have found there and feed it all into functions that are ready-made and available in suitable libraries...

First, we build our lemmatization resource and "function":

In [5]:
lemma = [{} for i in range(numOfEds)]
# lemma    = {}    # we build a so-called dictionary for the lookups

for i in range(numOfEds):
    
    wordfile_path = 'Azpilcueta/wordforms-' + language[i].lower() + '.txt'

    # open the wordfile (defined above) for reading
    wordfile = open(wordfile_path, encoding='utf-8')

    tempdict = []
    for line in wordfile.readlines():
        tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append
                                                # a tuple to a temporary list.

    lemma[i] = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the temp. list,
                                                    # we strip whitespace and make a key-value
                                                    # pair, appending it to our "lemma"
                                                    # dictionary
    wordfile.close

    print(str(len(lemma[i])) + ' ' + language[i] + ' wordforms known to the system.')


878594 PT wordforms known to the system.
878594 PT wordforms known to the system.
613097 ES wordforms known to the system.
1854 LA wordforms known to the system.


Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "diremos" is associated, or, in other words, what *value* our lemma variable returns when we query for the *key* "diremos":

In [6]:
lemma[language.index("PT")]['diremos']

'dizer'

And we are going to need the stopwords lists:

In [7]:
stopwords = []

for i in range(numOfEds):
    
    stopwords_path = 'DHd2019/stopwords-' + language[i].lower() + '.txt'
    stopwords.append(open(stopwords_path, encoding='utf-8').read().splitlines())

    print(str(len(stopwords[i])) + ' ' + language[i]
          + ' stopwords known to the system, e.g.: ' + str(stopwords[i][100:119]) + '\n')

690 PT stopwords known to the system, e.g.: ['ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xii', 'xiii', 'xiv', 'xv', 'acerca', 'ad', 'adeus', 'agora', 'ainda', 'alem']

690 PT stopwords known to the system, e.g.: ['ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xii', 'xiii', 'xiv', 'xv', 'acerca', 'ad', 'adeus', 'agora', 'ainda', 'alem']

756 ES stopwords known to the system, e.g.: ['cierta', 'ciertas', 'cierto', 'ciertos', 'cinco', 'claro', 'comentó', 'como', 'cómo', 'con', 'conmigo', 'conocer', 'conseguimos', 'conseguir', 'considera', 'consideró', 'consigo', 'consigue', 'consiguen']

396 LA stopwords known to the system, e.g.: ['ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud']



(In contrast to simpler numbers that have been filtered out by the stopwords filter, I have left numbers representing years like "1610" in place.)

And, later on when we try sentence segmentation, we are going to need the list of abbreviations - words where a subsequent period not necessarily means a new sentence:

In [8]:
abbreviations = [] # As of now, this is one for all languages :-(

abbrs_path = 'DHd2019/abbreviations.txt'
abbreviations = open(abbrs_path, encoding='utf-8').read().splitlines()

print(str(len(abbreviations)) + ' abbreviations known to the system, e.g.: ' + str(abbreviations[100:119]))

229 abbreviations known to the system, e.g.: ['in', 'ind', 'ing', 'Io', 'iul', 'Iuli', 'iur', 'iust', 'IV', 'iv', 'IX', 'ix', 'J', 'K', 'l', 'L', 'li', 'lib', 'M']


Next, we should find some very characteristic words for each segment for each edition. (Let's say we are looking for the "Top 20".) We should build a vocabulary for each edition individually and only afterwards work towards a common vocabulary of several "Top n" sets.

In [9]:
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

numTopTerms = 20

# So first we build a tokenising and lemmatising function (per language) to work as
# an input filter to the CountVectorizer function
def ourLaLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("LA")][wordform].lower().strip() if wordform in lemma[language.index("LA")] else wordform.lower().strip() for wordform in wordforms ]
def ourEsLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("ES")][wordform].lower().strip() if wordform in lemma[language.index("ES")] else wordform.lower().strip() for wordform in wordforms ]
def ourPtLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("PT")][wordform].lower().strip() if wordform in lemma[language.index("PT")] else wordform.lower().strip() for wordform in wordforms ]

def ourLemmatiser(lang):
    if (lang == "LA"):
        return ourLaLemmatiser
    if (lang == "ES"):
        return ourEsLemmatiser
    if (lang == "PT"):
        return ourPtLemmatiser

def ourStopwords(lang):
    if (lang == "LA"):
        return stopwords[language.index("LA")]
    if (lang == "ES"):
        return stopwords[language.index("ES")]
    if (lang == "PT"):
        return stopwords[language.index("PT")]

topTerms = []
for i in range(numOfEds):

    topTermsEd = []
    # Initialize the library's function, specifying our
    # tokenizing function from above and our stopwords list.
    tfidf_vectorizer = TfidfVectorizer(stop_words=ourStopwords(language[i]), use_idf=True, tokenizer=ourLemmatiser(language[i]), norm='l2')

    # Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
    tfidf_matrix = tfidf_vectorizer.fit_transform(Editions[i])

    # convert your matrix to an array to loop over it
    mx_array = tfidf_matrix.toarray()

    # get your feature names
    fn = tfidf_vectorizer.get_feature_names()

    # now loop through all segments and get the respective top n words.
    pos = 0
    for j in mx_array:
        # We have empty segments, i.e. none of the words in our vocabulary has any tf/idf score > 0
        if (j.max() == 0):
            topTermsEd.append([("", 0)])
        # otherwise append (present) lemmatised words until numTopTerms or the number of words (-stopwords) is reached
        else:
            topTermsEd.append(
                [(fn[x], j[x]) for x in ((j*-1).argsort()) if j[x] > 0] \
                [:min(numTopTerms, len(
                    [word for word in re.split('\W+', Editions[i][pos]) if ourLemmatiser(language[i])(word) not in stopwords]
                ))])
        pos += 1
    topTerms.append(topTermsEd)

# Translations?

Maybe there is an approach to inter-lingual comparison after all. After a first unsuccessful try with [conceptnet.io](http://conceptnet.io), I next want to try [Babelnet](http://babelnet.org) in order to lookup synonyms, related terms and translations. I still have to study the [API](http://babelnet.org/guide)...



For example, let's take this single segment 19:

In [10]:
segment_no = 18

And then first let's see how this segment compares in the different editions:

In [11]:
print("Comparing words from segments " + str(segment_no) + " ...")
print(" ")
print("Here is the segment in the four editions:")
print(" ")
for i in range(numOfEds):
    print("Ed. " + str(i) + ":")
    print("------")
    print(Editions[i][segment_no])
    print(" ")

print(" ")
print(" ")

# Build List of most significant words for a segment

print("Most significant words in the segment:")
print(" ")
for i in range(numOfEds):
    print("Ed. " + str(i) + ":")
    print("------")
    print(topTerms[i][segment_no])
    print(" ")

Comparing words from segments 18 ...
 
Here is the segment in the four editions:
 
Ed. 0:
------
6. ¶ A circunstancia do dia da festa (segundo algũs) [8] he necessaria na confissam : como o que em tal dia fornicou : porque fex obra servuil f. pecado mortal que he obra do diabo. Eam quebrou dous mandamentos . f. ho sexto & ho terceyro. Outros tem que nam he necessaria porque segundo S. Thomas 3 sententiarum : ho precepto de sanctificar ho sabado, entendido literalmente nam defende obra servil spiritual f. peccado Silvest. confessio. Ho doutor Navarro, de poenitentia d. 5. c. Consideret : tem que entonces a circunstancia do dia de festa se ha de confessar de necessidade : quando ho peccado fosse feyto a fim de fazer obra manual defesa em aquelle dia : ou quando fiz esse peccado mortal com [p. 27, 73 pdf] intençam & proposito de quebrantar a festa. Esta [9] opiniam parece razoavel & assaz secura.
 
Ed. 1:
------
9. ¶ A VIII que a circunstancia do dia [9] da festa não se ha de confessar ne

Now we look up the "concepts" associated to those words in babelnet. Then we look up the concepts associated with the words of the present segment from another edition/language, and see if the concepts are the same.

But we have to decide on some particular editions to get things started. Let's take the Spanish and Latin ones:

In [12]:
startEd = 1
secondEd = 2

And then we can continue...

In [13]:
import urllib
import json
from collections import defaultdict

babelAPIKey = '18546fd3-8999-43db-ac31-dc113506f825'
babelGetSynsetIdsURL = "https://babelnet.io/v5/getSynsetIds?" + \
                       "targetLang=LA&targetLang=ES&targetLang=PT" + \
                       "&searchLang=" + language[startEd] + \
                       "&key=" + babelAPIKey + \
                       "&lemma="

# Build lists of possible concepts
top_possible_conceptIDs = defaultdict(list)
for (word, val) in topTerms[startEd][segment_no]:
    concepts_uri = babelGetSynsetIdsURL + urllib.parse.quote(word)
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs[word].append(rel.get("id"))

print(" ")
print("For each of the '" + language[startEd] + "' words, here are possible synsets:")
print(" ")

for word in top_possible_conceptIDs:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs[word]))
    print(" ")

print(" ")
print(" ")
print(" ")

babelGetSynsetIdsURL2 = "https://babelnet.io/v5/getSynsetIds?" + \
                        "targetLang=LA&targetLang=ES&targetLang=PT" + \
                        "&searchLang=" + language[secondEd] + \
                        "&key=" + babelAPIKey + \
                        "&lemma="

# Build list of 10 most significant words in the second language
top_possible_conceptIDs_2 = defaultdict(list)
for (word, val) in topTerms[secondEd][segment_no]:
    concepts_uri = babelGetSynsetIdsURL2 + urllib.parse.quote(word)
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs_2[word].append(rel.get("id"))

print(" ")
print("For each of the '" + language[secondEd] + "' words, here are possible synsets:")
print(" ")
for word in top_possible_conceptIDs_2:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs_2[word]))
    print(" ")

 
For each of the 'PT' words, here are possible synsets:
 
festa: bn:00008828n, bn:00001736n, bn:00036825n, bn:15095656n, bn:00033859n, bn:00040340n, bn:00060836n, bn:10812634n, bn:00016986n, bn:00016987n, bn:04048895n, bn:00060835n, bn:00034150n, bn:10733905n, bn:00008436n, bn:00034151n, bn:02506874n, bn:00017345n, bn:00008433n, bn:00071089n, bn:06971214n, bn:18397962n, bn:10858695n
 
verbo: bn:00079779n, bn:00079778n, bn:00060722n, bn:00081546n
 
ne: bn:00006824n, bn:03518732n, bn:00035065n, bn:03295403n
 
z: bn:01487626n, bn:00032569n, bn:03226685n, bn:04052201n, bn:02173555n, bn:04959525n, bn:14056020n, bn:13940586n, bn:00682740n, bn:01436748n
 
artículo: bn:00006137n
 
ubi: bn:03316041n
 
obrar: bn:00084350v
 
aurea: bn:03183579n, bn:00034247n, bn:04307446n, bn:04508374n, bn:00844080n, bn:02427608n, bn:15419989n
 
resoluto: bn:00101015a
 
 
 
 
 
For each of the 'ES' words, here are possible synsets:
 
fiesta: bn:16356131n, bn:00036825n, bn:02951623n, bn:17131948n, bn:01840936n, b

In [14]:
# calculate number of overlapping terms
values_a = set([item for sublist in top_possible_conceptIDs.values() for item in sublist])
values_b = set([item for sublist in top_possible_conceptIDs_2.values() for item in sublist])
overlaps = values_a & values_b
print("Overlaps: " + str(overlaps))

babelGetSynsetInfoURL = "https://babelnet.io/v5/getSynset?key=" + babelAPIKey + \
                        "&targetLang=LA&targetLang=ES&targetLang=PT" + \
                        "&id="

for c in overlaps:
    info_uri = babelGetSynsetInfoURL + c
    response = urllib.request.urlopen(info_uri)
    words = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    
    senses = words['senses']
    for result in senses[:1]:
        lemma = result['properties'].get('fullLemma')
        resultlang = result['properties'].get('language')
        print(c + ": " + lemma + " (" + resultlang.lower() + ")")

# what's left: do a nifty ranking

Overlaps: {'bn:00036825n', 'bn:00060836n', 'bn:00034151n', 'bn:00016986n', 'bn:00034150n', 'bn:00079779n', 'bn:00079778n', 'bn:00008436n', 'bn:00033859n', 'bn:00060835n'}
bn:00036825n: solemnidad (es)
bn:00060836n: festa (pt)
bn:00034151n: festival (es)
bn:00016986n: celebración (es)
bn:00034150n: festival (es)
bn:00079779n: verbo (es)
bn:00079778n: verbo (pt)
bn:00008436n: banquete (es)
bn:00033859n: fiesta (es)
bn:00060835n: fiesta (es)


Actually I think this is somewhat promising - an overlap of four independent, highly meaning-bearing words, or of forty-something related concepts. At first glance, they should be capable of distinguishing this section from all the other ones. However, getting this result was made possible by quite a bit of manual tuning the stopwords and lemmatization dictionaries before, so this work is important and cannot be eliminated.

## New Approach: Use Aligner from Machine Translation Studies <a name="newApproach"/>

In contrast to what I thought previously, there is a couple of tools for automatically aligning parallel texts after all. After some investigation of the [literature](https://www.zotero.org/groups/2198990/hyperazpilcueta/items/collectionKey/KQ84ZD4G), the most promising candidate seems to be [*HunAlign*](https://github.com/danielvarga/hunalign). However, as this is a commandline tool written in C++ (there is [*LF Aligner*](https://sourceforge.net/projects/aligner/), a GUI, available), it is not possible to run it from within this notebook.

First results were problematic, due to the different literary conventions that our editions follow: Punctuation was used inconsistently (but sentence length is one of the most relevant factors for aligning), as were abbreviations and notes.

My current idea is to use this notebook to preprocess the texts and to feed a cleaned up version of them to hunalign...

Coming back to this after a first couple of rounds with *Hunalign*, I have the feeling that the fact that literary conventions are so divergent probably means that Aligning via sentence lengths is a bad idea in our from the outset. Probably better to approach this with GMA or similar methods. Anyway, here are the first attempts with *Hunalign*:

In [15]:
from nltk import sent_tokenize

## First, train the sentence tokenizer:
from pprint import pprint
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars, PunktTrainer
 
class BulletPointLangVars(PunktLanguageVars):
    sent_end_chars = ('.', '?', ':', '!', '¶')

trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
tokenizer = PunktSentenceTokenizer(trainer.get_params(), lang_vars = BulletPointLangVars())
for tok in abbreviations : tokenizer._params.abbrev_types.add(tok)

## Now we sentence-segmentize all our editions, printing results and saving them to files:

# folder for the several segment files:
outputBase = 'Azpilcueta/sentences'
dest       = None

# Then, sentence-tokenize our segments:
for i in range(numOfEds):
    dest = open(outputBase + '_' + str(year[i]) + '.txt',
                    encoding='utf-8',
                    mode='w')
    print("Sentence-split of ed. " + str(i) + ":")
    print("------")
    for s in range(0, len(Editions[i])):
        for a in tokenizer.tokenize(Editions[i][s]):
            dest.write(a.strip() + '\n')
            print(a)
        dest.write('<p>\n')
        print('<p>')
    dest.close()


Sentence-split of ed. 0:
------
1549 por
<p>
¶
[1]Capitulo sexto das circunstancias.
<p>
<p>
<p>
<p>
<p>
1. Preguntelhe tambẽ as circunstancias necessarias quando ho penitente as nam sabe dizer.
As quaes segundo sam Boãventura & Ricardo 4 dist 17 sam em tres maneyras :
hunas que mudan em otra specie.
Estas de necessidade se ham de confessar :
assi como em ho furto, a circumstancia do lugar sacrago, ou da cousa sagrada :
mudaa ē sacrilegio.
Polo qual o que furtou algũa cousa de lugar sagrado .
f. da igreja :
nam nastaria dizer que fes hum furto :
mas he necessario dizer que ho furtou da igreja :
ou que a cousa era sagrada.
posto que a uvesse tomado de lugar nam sagrado.
Ho mesmo he do homicidio quando ho ouuesse cometido na igreja.
<p>
<p>
<p>
<p>
<p>
2. [2] Ou se cometeo peccado de fornicaçã com molher casada :
nam satisfas dizendo que cometeo peccado de luxuria :
mas he necessario declarar, que era com casada, religiosa ou parenta.
Porque primeyro he adulterio :
o segundo sacrilegio :

... lemmatize/stopwordize it---

In [17]:
# folder for the several segment files:
outputBase = 'Azpilcueta/sentences-lemmatized'
dest       = None

# Then, sentence-tokenize our segments:
for i in range(numOfEds):
    dest = open(outputBase + '_' + str(year[i]) + '.txt',
                    encoding='utf-8',
                    mode='w')
    stp = set(stopwords[i])
    print("Cleaned/lemmatized ed. " + str(i) + " [" + language[i] + "]:")
    print("------")
    for s in range(len(Editions[i])):
        for a in tokenizer.tokenize(Editions[i][s]):
            dest.write(" ".join([x for x in ourLemmatiser(language[i])(a) if x not in stp]) + '\n')
            print(" ".join([x for x in ourLemmatiser(language[i])(a) if x not in stp]))
        dest.write('<p>\n')
        print('<p>')
    dest.close()


Cleaned/lemmatized ed. 0 [PT]:
------
1549
<p>


TypeError: string indices must be integers

With these preparations made, *Hunaligning* 1552 and 1556 reports "Quality 0.63417" for unlemmatized and "Quality 0.51392" for lemmatized versions of the texts for its findings which still contain many errors. Removing ":" from the sentence end marks gives "Quality 0.517048/0.388377", but from a first impression with fewer errors. Results can be output in different formats, xls files are [here](Azpilcueta/align_2018.07.05_16.10.43/sentences_1552-sentences_1556.xls) and [here](Azpilcueta/align_2018.07.05_15.45.13/sentences-lemmatized_1552-sentences-lemmatized_1556.xls).

# Similarity <a name="DocumentSimilarity"/>

It seems we could now create another matrix replacing lemmata with concepts and retaining the tf/idf values (so as to keep a weight coefficient to the concepts). Then we should be able to calculate similarity measures across the same concepts...

The approach to choose would probably be the "cosine similarity" of concept vector spaces. Again, there is a library ready for us to use (but you can find some documentation [here](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/), [here](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [here](https://en.wikipedia.org/wiki/Cosine_similarity).)

**However, this is where I have to take a break now. I will return to here soon...**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)

In [None]:
print("The two most similar segments in the corpus are")
print("segments", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
      "and", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
      ".")
print("They have a similarity score of")
print(similarities.values.max())

<div class="alert alertbox alert-success">Of course, in every set of documents, we will always find two that are similar in the sense of them being more similar to each other than to the other ones. Whether or not this actually *means* anything in terms of content is still up to scholarly interpretation. But at least it means that a scholar can look at the two documents and when she determines that they are not so similar after all, then perhaps there is something interesting to say about similar vocabulary used for different puproses. Or the other way round: When the scholar knows that two passages are similar, but they have a low "similarity score", shouldn't that say something about the texts's rhetorics?</div>

# Word Clouds <a name="WordClouds"/>

We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in Editions[1][3]]
freq = dict(zip(fn, frq))

wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)

# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...

In [None]:
outputDir = "Azpilcueta"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')

# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")

a = [[]]
a.clear()
dicts = []
w = []

# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(len(mx_array)):
    # this is like above in the single-segment example...
    a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
    dicts.append(dict(zip(fn, a[i])))
    w.append(WordCloud(background_color=None, mode="RGBA", \
                       max_font_size=40, min_font_size=10, \
                       max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
    # We write the wordcloud image to a file
    w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
    # Finally we write the column row
    htmlfile.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))

# And then we write the end of the html file.
htmlfile.write("""
        </table>
    </body>
</html>
""")
htmlfile.close()

This should have created a nice html file which we can open [here](./Solorzano/Overview.html).