<a href="https://colab.research.google.com/github/nicoschmidt/DigitalCentones/blob/master/SmartTexts2019/Practicum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Smart Texts 2019 Practicum May 27th:
# Christus Patiens and the opportunities for Digital Reading

Welcome to the Smart Texts practicum! Today we will investigate the possibilities of computer-aided intertextual reading.
In this session you will get to know some techniques to handle a digital version of the Greek cento Christus Patiens from late antiquity by Gregorius Nazianzenus. You can have a look at a digital version, an edition by Johann Georg Brambs from 1885 in the Scaife Viewer: https://scaife.perseus.org/reader/urn:cts:greekLit:tlg2022.tlg003.opp-grc1:1.1-1.30/

It is also available as TEI formatted xml file on GitHub (along with many other texts!) in the repository of the ***OpenGreekAndLatin*** project: https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml

We will make use of this formatted text for analyzing it with our software. As you have learned in the seminar today, a Cento is a patchwork poem made of many parts, lines or single words from other (source) texts. In the case of Christus Patiens the source texts to a large part are chosen from tragedies by Euripides, such as Medea, Bacchae and others.

We will use the alignment algorithm discussed during today's seminar - Smith-Waterman - to find for each line of the Christus Patiens the corresponding (matching) text in the source tragedies. For time reasons and because of limited availability of the source text as formatted xml files, we will restrict our analysis to the first 100 lines of the Christus Patiens and match them with the first 48 lines of Euripides' *Medea*.

As with most algorithms, the Smith-Waterman algorithm comes with some parameters which needs to be tuned in order to produce meaningful results. In the coming exercises you will play around with those parameters to tweak the algorithm. You will compare your results to the matches found by an expert researcher in this field (Tuillier 1969).

If you look for a more advanced programming exercise, you can even work on extensions of the alignment program - during this practicum or continue your work afterwards in a student project.


# 1) Get to know the data



## Prepare code
Steps:
- get code and data files from GitHub
- import code and other libs

In [0]:
!([ -d DigitalCentones ]) && rm -r DigitalCentones
!git clone https://github.com/nicoschmidt/DigitalCentones

from DigitalCentones.smith_waterman import SmithWaterman
from DigitalCentones.alignment import *
from IPython.display import HTML
import numpy as np

## Load Christus Patiens and Medea text files
Steps:
- copy xml file of Christus Patiens from GitHub repository ***First1KGreek*** by the ***OpenGreekAndLatin*** project
- parse the xml file
- remove preface lines
- load first 48 lines from Medea as plain text (no xml available yet :( )
---


### Exercise 1.1: Get familiar with the text
Have a look at the [xml file](https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml).
It contains all lines of the poem with lots of meta information, in the header as well as along each page:
- editorial information
- information about online edition
- line numbers
- footnotes

We parse the lines like this:
```
<l n="1">Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,</l>

```
Have a look at the function [load_xml(fname) on line 1667 of alignment.py](https://github.com/nicoschmidt/DigitalCentones/blob/3c1df2ce56f5d4776be395dd62773e70b715cd88/alignment.py#L167).

---


### Exercise 1.2: Get familiar with the data structures
The lines are loaded in a list-like structure.

While parsing, some pre-processing is done that will help with finding alignments:
- punctuation removed
- lower case
- diacritics removed


In [0]:
# get Christus Patiens xml file (if not present)
!([ ! -f tlg2022.tlg003.opp-grc1.xml ]) && wget https://raw.githubusercontent.com/OpenGreekAndLatin/First1KGreek/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml
lines_chr_pat = load_xml('tlg2022.tlg003.opp-grc1.xml')
for i in range(30, len(lines_chr_pat)):
    lines_chr_pat[i].no = -29+i # hack to re-number lines
lines_chr_pat = lines_chr_pat[30:100]
print('Christus Patiens:', lines_chr_pat[:3],'\n\n')

lines_medea = load_plain_text('DigitalCentones/data/Medea-48.txt', 'Medea', 'Medea', 'Euripides')
print('Medea:', lines_medea[:3], '\n\n')




---

**Print all characters with diacritics**



In [0]:
# print all characters in the corpus:
alphabet = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text])
print('ID, Character, Name')
for a in alphabet:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))

**Print all characters without diacritics**

In [0]:
alphabet_no_diacritics = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text_no_diacritics])
print('ID, Character, Name')
for a in alphabet_no_diacritics:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))


## Load all Matches found by Tuillier 1969
As a baseline we compare our alignment results later on to the matches found by an expert researcher in the field, Andre Tuillier, who, in his edition of the Christus Patiens from 1969, listed down all the corresponding lines he found relevant. Thanks to Lena, who analized these correspondences during her PhD thesis, we have these correspondences as a [machine-readable text file](https://github.com/nicoschmidt/DigitalCentones/blob/master/data/ChristusPatiens_Tulliers_matches.csv) and can load them here.

In [0]:
# Input file is a text file with one header row and remaining rows structured as
# '<target_line_number>,<source_work_number>,<source_work_line_number>...\n'
line_to_line_map = np.loadtxt('DigitalCentones/data/ChristusPatiens_Tulliers_matches.csv', int, delimiter=',', skiprows=1, usecols=[0,1,2])

# These are all matched works Toullier found
source_work_names = ['Agamemnon',
                     'Prometheus',
                     'Alkestis',
                     'Andromache',
                     'Bakchen',
                     'Hekabe',
                     'Helena',
                     'Hippolytos',
                     'Iphigenie in Aulis',
                     'Iphigenie in Tauris',
                     'Medea',
                     'Orestes',
                     'Ph\"onikerinnen',
                     'Rhesos',
                     'Troerinnen',
                     'Ilias',
                     'Alexandra']

# store map as a dictionary with target_line_number as keys and tuples (<source_work_id>, <source_work_number>) as values 
line_to_line_map = dict([(target_line_no,(source_work_names[source_work_id-1], source_line_id)) for target_line_no, source_work_id, source_line_id in line_to_line_map])
for k in list(line_to_line_map)[:50]:
    print('{}: {}'.format(k, line_to_line_map[k]))

# 2) Compare alignment schemes

## The expert's results
First we go through the lines listed by Tuillier to find exact line-to-line alignments and visualize them.

With our automatic alignment search later on, we seek to get as close as possible to these matches **without knowing anything about ancient Greek tragedies!** 

In [0]:
# find matches by searching in given line-to-line map
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
alignments_outfile = 'alignments_word-based_Toullier'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, lambda line:line.text_no_diacritics.split(' '), line_to_line_map)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)
HTML(filename='./'+ alignments_outfile + '.html')

## Exercise 2.1: The Smith-Waterman Algorithm
Check out the code at https://github.com/nicoschmidt/DigitalCentones/blob/master/smith_waterman.py and try to understand the basic working of the alignment function.



## Exercise 2.2: Find character-based line-to-line alignments with diacritics
Now we use the Smith-Waterman algorithm to find all alignments automatically.
Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [0]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text # we use the lower-case text with diacritics
alignments_outfile = 'alignments_character-based_with_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func)
HTML(filename='./'+ alignments_outfile + '.html')

## Exercise 2.3: Find character-based line-to-line alignments without diacritics
Now we change the input strings to be text without diacritics (have a look at the line_form_func and Exercise 1.2 as well as [the data structure](https://github.com/nicoschmidt/DigitalCentones/blob/e51e1100201cc08971d1c742adcc44275c5359e1/alignment.py#L75)).

Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [0]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics
alignments_outfile = 'alignments_character-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func, 4)
HTML(filename='./'+ alignments_outfile + '.html')

## Exercise 2.4: Find word-based line-to-line alignments without diacritics
Instead of treating the lines as sequences of characters, we can also treat them as sequences of words and do the matching on a word-level

Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments? What about processing speed?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [0]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics.split(' ')
alignments_outfile = 'alignments_word-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)
HTML(filename='./'+ alignments_outfile + '.html')

# 3) Advanced exercises and extensions

Now that you got the basic ideas and a feeling for how the algorithm behaves and how parameters affect the outputs, here are some suggestions how to further improve the software.




## Exercise 3.1: incorporating morphological information
Because 
Have a look at the formatting of the morphology data base files: 
```
<d>
  <token>
    <word-form-id>1</word-form-id>
    <word-form>ἀάατον</word-form>
    <word-form-without-diacritics>ααατον</word-form-without-diacritics>
    <lemma>ἀάατος</lemma>
    <lemma-without-diacritics>ααατος</lemma-without-diacritics>
    <morphological-analysis>a-s---fa-</morphological-analysis>
    <lemma-id>1</lemma-id>
    <translation>not to be injured, inviolable</translation>
    <dialect/>
  </token>
<t>
  <i>1</i>
  <f>ἀάατον</f>
  <b>ααατον</b>
  <l>ἀάατος</l>
  <e>ααατος</e>
  <p>a-s---fa-</p>
  <d>1</d>
  <s>not to be injured, inviolable</s>
  <a/>
</t>
```



Steps:
- load morphological data base from GitHub
- Map all words to their lemmatized form
- Find alignments using the lemmas


In [0]:
# download files if not present
!([ ! -f MorpheusGreek1-319492.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek1-319492.xml
!([ ! -f MorpheusGreek319493-638984.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek319493-638984.xml
!([ ! -f MorpheusGreek638985-958476.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek638985-958476.xml
morph_xml_fnames = ['MorpheusGreek1-319492.xml',
                    'MorpheusGreek319493-638984.xml',
                    'MorpheusGreek638985-958476.xml']

# load data base
morph_dict, max_id = load_morphology(morph_xml_fnames, False)
print('{} lemmas loaded'.format(max_id))

# add all words of Christus Patiens and Medea that were not in the data base
m = max_id
max_id = add_lemma_ids(lines_chr_pat, morph_dict, max_id, False)
print('{} lemmas added by Christus Patiens'.format(max_id - m))
m = max_id
max_id = add_lemma_ids(lines_medea, morph_dict, max_id, False)
print('{} lemmas added by Medea'.format(max_id - m))
print('{} total lemmas'.format(max_id))



## Exercise 3.2: Word count statistics
Modify the alignment scoring by introducing a weighted matching score scheme, where frequent (but potentially uninteresting) words get lower match scores and rare, (but potentially interesting words) get higher match scores.


## Exercise 3.3: Alternative alignment algorithms
Create an alternative alignment function, e.g. by just counting same words in lines to be matched, without taking care of the ordering of words. Does this improve the results? In which situations?