<a href="https://colab.research.google.com/github/nicoschmidt/DigitalCentones/blob/master/SmartTexts2019/Practicum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Smart Texts 2019 Practicum May 27th:
# Christus Patiens and the opportunities for Digital Reading

Welcome to the Smart Texts practicum! Today we will investigate the possibilities of computer-aided intertextual reading.
In this session you will get to know some techniques to handle a digital version of the Greek cento Christus Patiens from late antiquity by Gregorius Nazianzenus. You can have a look at a digital version, an edition by Johann Georg Brambs from 1885 in the Scaife Viewer: https://scaife.perseus.org/reader/urn:cts:greekLit:tlg2022.tlg003.opp-grc1:1.1-1.30/

It is also available as TEI formatted xml file on GitHub (along with many other texts!) in the repository of the ***OpenGreekAndLatin*** project: https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml

We will make use of this formatted text for analyzing it with our software. As you have learned in the seminar today, a Cento is a patchwork poem made of many parts, lines or single words from other (source) texts. In the case of Christus Patiens the source texts to a large part are chosen from tragedies by Euripides, such as Medea, Bacchae and others.

We will use the alignment algorithm discussed during today's seminar - Smith-Waterman - to find for each line of the Christus Patiens the corresponding (matching) text in the source tragedies. For time reasons and because of limited availability of the source text as formatted xml files, we will restrict our analysis to the first 100 lines of the Christus Patiens and match them with the first 48 lines of Euripides' *Medea*.

As with most algorithms, the Smith-Waterman algorithm comes with some parameters which needs to be tuned in order to produce meaningful results. In the coming exercises you will play around with those parameters to tweak the algorithm. You will compare your results to the matches found by an expert researcher in this field (Tuillier 1969).

If you look for a more advanced programming exercise, you can even work on extensions of the alignment program - during this practicum or continue your work afterwards in a student project.


# 1) Get to know the data



## Prepare code
Steps:
- get code and data files from GitHub
- import code and other libs

In [1]:
!([ -d DigitalCentones ]) && rm -r DigitalCentones
!git clone https://github.com/nicoschmidt/DigitalCentones

from DigitalCentones.smith_waterman import SmithWaterman
from DigitalCentones.alignment import *
from IPython.display import HTML
import numpy as np

Cloning into 'DigitalCentones'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects:   1% (1/76)   [Kremote: Counting objects:   2% (2/76)   [Kremote: Counting objects:   3% (3/76)   [Kremote: Counting objects:   5% (4/76)   [Kremote: Counting objects:   6% (5/76)   [Kremote: Counting objects:   7% (6/76)   [Kremote: Counting objects:   9% (7/76)   [Kremote: Counting objects:  10% (8/76)   [Kremote: Counting objects:  11% (9/76)   [Kremote: Counting objects:  13% (10/76)   [Kremote: Counting objects:  14% (11/76)   [Kremote: Counting objects:  15% (12/76)   [Kremote: Counting objects:  17% (13/76)   [Kremote: Counting objects:  18% (14/76)   [Kremote: Counting objects:  19% (15/76)   [Kremote: Counting objects:  21% (16/76)   [Kremote: Counting objects:  22% (17/76)   [Kremote: Counting objects:  23% (18/76)   [Kremote: Counting objects:  25% (19/76)   [Kremote: Counting objects:  26% (20/76)   [Kremote: Counting objects:  27% (2

## Load Christus Patiens and Medea text files
Steps:
- copy xml file of Christus Patiens from GitHub repository ***First1KGreek*** by the ***OpenGreekAndLatin*** project
- parse the xml file
- remove preface lines
- load first 48 lines from Medea as plain text (no xml available yet :( )
---


### Exercise 1.1: Get familiar with the text
Have a look at the [xml file](https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml).
It contains all lines of the poem with lots of meta information, in the header as well as along each page:
- editorial information
- information about online edition
- line numbers
- footnotes

We parse the lines like this:
```
<l n="1">Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,</l>

```
Have a look at the function [load_xml(fname) on line 1667 of alignment.py](https://github.com/nicoschmidt/DigitalCentones/blob/3c1df2ce56f5d4776be395dd62773e70b715cd88/alignment.py#L167).

---


### Exercise 1.2: Get familiar with the data structures
The lines are loaded in a list-like structure.

While parsing, some pre-processing is done that will help with finding alignments:
- punctuation removed
- lower case
- diacritics removed


In [2]:
# get Christus Patiens xml file (if not present)
!([ ! -f tlg2022.tlg003.opp-grc1.xml ]) && wget https://raw.githubusercontent.com/OpenGreekAndLatin/First1KGreek/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml
lines_chr_pat = load_xml('tlg2022.tlg003.opp-grc1.xml')
for i in range(30, len(lines_chr_pat)):
    lines_chr_pat[i].no = -29+i # hack to re-number lines
lines_chr_pat = lines_chr_pat[30:100]
for i in range(len(lines_chr_pat)): lines_chr_pat[i].idx = i
print('Christus Patiens:', lines_chr_pat[:3],'\n\n')

lines_medea = load_plain_text('DigitalCentones/data/Medea-48.txt', 'Medea', 'Medea', 'Euripides')
print('Medea:', lines_medea[:3], '\n\n')

Christus Patiens: [work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                0
no:                 1
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,
text:               εἴθ ὤφελ ἐν λειμῶνι μηδ ἕρπειν ὄφις
text_no_diacritics: ειθ ωφελ εν λειμωνι μηδ ερπειν οφις
text_lemma:         
lemma_ids:          [], work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                1
no:                 2
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων
text:               μηδ ἐν νάπαισι τοῦδ ὑφεδρεύειν δράκων
text_no_diacritics: μηδ εν ναπαισι τουδ υφεδρευειν δρακων
text_lemma:         
lemma_ids:          [], work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                2
no:                 3
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,
text:               ἀγκυλομήτης οὐ γὰρ ἂν πλευρᾶς φῦμα
text_no_diacritics: αγκυλο




---

**Print all characters with diacritics**



In [3]:
# print all characters in the corpus:
alphabet = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text])
print('ID, Character, Name')
for a in alphabet:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))

ID, Character, Name
0x20 " " SPACE
0x3ac "ά" GREEK SMALL LETTER ALPHA WITH TONOS
0x3ad "έ" GREEK SMALL LETTER EPSILON WITH TONOS
0x3ae "ή" GREEK SMALL LETTER ETA WITH TONOS
0x3af "ί" GREEK SMALL LETTER IOTA WITH TONOS
0x3b1 "α" GREEK SMALL LETTER ALPHA
0x3b2 "β" GREEK SMALL LETTER BETA
0x3b3 "γ" GREEK SMALL LETTER GAMMA
0x3b4 "δ" GREEK SMALL LETTER DELTA
0x3b5 "ε" GREEK SMALL LETTER EPSILON
0x3b6 "ζ" GREEK SMALL LETTER ZETA
0x3b7 "η" GREEK SMALL LETTER ETA
0x3b8 "θ" GREEK SMALL LETTER THETA
0x3b9 "ι" GREEK SMALL LETTER IOTA
0x3ba "κ" GREEK SMALL LETTER KAPPA
0x3bb "λ" GREEK SMALL LETTER LAMDA
0x3bc "μ" GREEK SMALL LETTER MU
0x3bd "ν" GREEK SMALL LETTER NU
0x3be "ξ" GREEK SMALL LETTER XI
0x3bf "ο" GREEK SMALL LETTER OMICRON
0x3c0 "π" GREEK SMALL LETTER PI
0x3c1 "ρ" GREEK SMALL LETTER RHO
0x3c2 "ς" GREEK SMALL LETTER FINAL SIGMA
0x3c3 "σ" GREEK SMALL LETTER SIGMA
0x3c4 "τ" GREEK SMALL LETTER TAU
0x3c5 "υ" GREEK SMALL LETTER UPSILON
0x3c6 "φ" GREEK SMALL LETTER PHI
0x3c7 "χ" GREEK SMALL L

**Print all characters without diacritics**

In [4]:
alphabet_no_diacritics = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text_no_diacritics])
print('ID, Character, Name')
for a in alphabet_no_diacritics:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))


ID, Character, Name
0x20 " " SPACE
0x3b1 "α" GREEK SMALL LETTER ALPHA
0x3b2 "β" GREEK SMALL LETTER BETA
0x3b3 "γ" GREEK SMALL LETTER GAMMA
0x3b4 "δ" GREEK SMALL LETTER DELTA
0x3b5 "ε" GREEK SMALL LETTER EPSILON
0x3b6 "ζ" GREEK SMALL LETTER ZETA
0x3b7 "η" GREEK SMALL LETTER ETA
0x3b8 "θ" GREEK SMALL LETTER THETA
0x3b9 "ι" GREEK SMALL LETTER IOTA
0x3ba "κ" GREEK SMALL LETTER KAPPA
0x3bb "λ" GREEK SMALL LETTER LAMDA
0x3bc "μ" GREEK SMALL LETTER MU
0x3bd "ν" GREEK SMALL LETTER NU
0x3be "ξ" GREEK SMALL LETTER XI
0x3bf "ο" GREEK SMALL LETTER OMICRON
0x3c0 "π" GREEK SMALL LETTER PI
0x3c1 "ρ" GREEK SMALL LETTER RHO
0x3c2 "ς" GREEK SMALL LETTER FINAL SIGMA
0x3c3 "σ" GREEK SMALL LETTER SIGMA
0x3c4 "τ" GREEK SMALL LETTER TAU
0x3c5 "υ" GREEK SMALL LETTER UPSILON
0x3c6 "φ" GREEK SMALL LETTER PHI
0x3c7 "χ" GREEK SMALL LETTER CHI
0x3c8 "ψ" GREEK SMALL LETTER PSI
0x3c9 "ω" GREEK SMALL LETTER OMEGA


## Load all Matches found by Tuillier 1969
As a baseline we compare our alignment results later on to the matches found by an expert researcher in the field, Andre Tuillier, who, in his edition of the Christus Patiens from 1969, listed down all the corresponding lines he found relevant. Thanks to Lena, who analized these correspondences during her PhD thesis, we have these correspondences as a [machine-readable text file](https://github.com/nicoschmidt/DigitalCentones/blob/master/data/ChristusPatiens_Tulliers_matches.csv) and can load them here.

In [5]:
# Input file is a text file with one header row and remaining rows structured as
# '<target_line_number>,<source_work_number>,<source_work_line_number>...\n'
line_to_line_map = np.loadtxt('DigitalCentones/data/ChristusPatiens_Tulliers_matches.csv', int, delimiter=',', skiprows=1, usecols=[0,1,2])

# These are all matched works Toullier found
source_work_names = ['Agamemnon',
                     'Prometheus',
                     'Alkestis',
                     'Andromache',
                     'Bakchen',
                     'Hekabe',
                     'Helena',
                     'Hippolytos',
                     'Iphigenie in Aulis',
                     'Iphigenie in Tauris',
                     'Medea',
                     'Orestes',
                     'Ph\"onikerinnen',
                     'Rhesos',
                     'Troerinnen',
                     'Ilias',
                     'Alexandra']

# store map as a dictionary with target_line_number as keys and tuples (<source_work_id>, <source_work_number>) as values 
line_to_line_map = dict([(target_line_no,(source_work_names[source_work_id-1], source_line_id)) for target_line_no, source_work_id, source_line_id in line_to_line_map])
for k in list(line_to_line_map)[:50]:
    print('{}: {}'.format(k, line_to_line_map[k]))

1: ('Medea', 1)
2: ('Medea', 3)
3: ('Medea', 6)
4: ('Medea', 20)
5: ('Hekabe', 1123)
6: ('Medea', 8)
8: ('Medea', 9)
14: ('Medea', 10)
15: ('Medea', 11)
17: ('Prometheus', 1027)
25: ('Alexandra', 151)
32: ('Medea', 14)
33: ('Medea', 15)
34: ('Medea', 13)
37: ('Medea', 16)
38: ('Medea', 17)
39: ('Agamemnon', 764)
40: ('Troerinnen', 605)
41: ('Troerinnen', 620)
42: ('Troerinnen', 621)
43: ('Medea', 20)
45: ('Prometheus', 1027)
46: ('Medea', 25)
47: ('Medea', 26)
50: ('Hippolytos', 450)
51: ('Medea', 21)
52: ('Medea', 22)
53: ('Medea', 34)
54: ('Medea', 35)
55: ('Medea', 36)
56: ('Medea', 56)
57: ('Medea', 57)
58: ('Medea', 58)
59: ('Medea', 59)
61: ('Hekabe', 736)
65: ('Agamemnon', 611)
66: ('Agamemnon', 612)
71: ('Agamemnon', 587)
72: ('Agamemnon', 588)
73: ('Agamemnon', 589)
75: ('Agamemnon', 593)
76: ('Agamemnon', 591)
77: ('Troerinnen', 747)
78: ('Troerinnen', 748)
79: ('Agamemnon', 594)
80: ('Agamemnon', 595)
81: ('Agamemnon', 596)
82: ('Agamemnon', 597)
88: ('Rhesos', 63)
90: ('Rhe

# 2) Compare alignment schemes

## The expert's results
First we go through the lines listed by Tuillier to find exact line-to-line alignments and visualize them.

With our automatic alignment search later on, we seek to get as close as possible to these matches **without knowing anything about ancient Greek tragedies!** 

In [6]:
# find matches by searching in given line-to-line map
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
alignments_outfile = 'alignments_word-based_Toullier'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, lambda line:line.text_no_diacritics.split(' '), line_to_line_map)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/70 (  0%)
10/70 ( 14%)
20/70 ( 29%)
30/70 ( 43%)
40/70 ( 57%)
50/70 ( 71%)
60/70 ( 86%)


Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
0,1,"Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,",Εἴθ’ ὤφελ’ Ἀργοῦς μὴ διαπτάσθαι σκάφος,1.0,Medea,4.0
1,2,μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων,μηδ’ ἐν νάπαισι Πηλίου πεσεῖν ποτε,3.0,Medea,6.0
2,3,"ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,",Πελίαι μετῆλθον. οὐ γὰρ ἂν δέσποιν’ ἐμὴ,6.0,Medea,6.0
3,4,"μήτηρ γένους δύστηνος ἠπατημένη,",,,,
4,5,"τόλμημα τολμᾶν παντάτολμον ἀνέτλη,",,,,
5,6,"ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,",ἔρωτι θυμὸν ἐκπλαγεῖσ’ Ἰάσονος·,8.0,Medea,4.0
6,7,"θεώσεως πεισθεῖσα τυχεῖν αὐτόθεν,",,,,
7,8,οὐδ’ ἂν φαγεῖν πείσασα καρποῦ τὸν πόσιν,οὐδ’ ἂν κτανεῖν πείσασα Πελιάδας κόρας,9.0,Medea,5.0
8,9,τοῦ μηδὲ συμφέροντος αὐτίκα σφίσι,,,,
9,10,"λειμῶνος ἐξῴκιστο τοῦ πανολβίου,",,,,


## Exercise 2.1: The Smith-Waterman Algorithm
Check out the code at https://github.com/nicoschmidt/DigitalCentones/blob/master/smith_waterman.py and try to understand the basic working of the alignment function.



## Exercise 2.2: Find character-based line-to-line alignments with diacritics
Now we use the Smith-Waterman algorithm to find all alignments automatically.
Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [7]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text # we use the lower-case text with diacritics
alignments_outfile = 'alignments_character-based_with_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/70 (  0%)
10/70 ( 14%)
20/70 ( 29%)
30/70 ( 43%)
40/70 ( 57%)
50/70 ( 71%)
60/70 ( 86%)


Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
0,1,εἴθ ὤφελ ἐν λειμῶνι μηδ ἕρπειν ὄφις,εἴθ ὤφελ ἀργοῦς μὴ διαπτάσθαι σκάφος,1,Medea,18
1,2,μηδ ἐν νάπαισι τοῦδ ὑφεδρεύειν δράκων,μηδ ἐν νάπαισι πηλίου πεσεῖν ποτε,3,Medea,27
2,3,ἀγκυλομήτης οὐ γὰρ ἂν πλευρᾶς φῦμα,πελίαι μετῆλθον οὐ γὰρ ἂν δέσποιν ἐμὴ,6,Medea,22
3,4,μήτηρ γένους δύστηνος ἠπατημένη,μήδεια δ ἡ δύστηνος ἠτιμασμένη,20,Medea,21
4,5,τόλμημα τολμᾶν παντάτολμον ἀνέτλη,οἷον πατρώιας μὴ ἀπολείπεσθαι χθονός,35,Medea,9
5,6,ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,ἔρωτι θυμὸν ἐκπλαγεῖσ ἰάσονος,8,Medea,31
6,7,θεώσεως πεισθεῖσα τυχεῖν αὐτόθεν,τμηθεῖσα πεύκη μηδ ἐρετμῶσαι χέρας,4,Medea,12
7,8,οὐδ ἂν φαγεῖν πείσασα καρποῦ τὸν πόσιν,οὐδ ἂν κτανεῖν πείσασα πελιάδας κόρας,9,Medea,34
8,9,τοῦ μηδὲ συμφέροντος αὐτίκα σφίσι,κακῶν νέα γὰρ φροντὶς οὐκ ἀλγεῖν φιλεῖ,48,Medea,14
9,10,λειμῶνος ἐξῴκιστο τοῦ πανολβίου,γήμας κρέοντος παῖδ ὃς αἰσυμνᾶι χθονός,19,Medea,10


## Exercise 2.3: Find character-based line-to-line alignments without diacritics
Now we change the input strings to be text without diacritics (have a look at the line_form_func and Exercise 1.2 as well as [the data structure](https://github.com/nicoschmidt/DigitalCentones/blob/e51e1100201cc08971d1c742adcc44275c5359e1/alignment.py#L75)).

Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [8]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics
alignments_outfile = 'alignments_character-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func, 4)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/70 (  0%)
10/70 ( 14%)
20/70 ( 29%)
30/70 ( 43%)
40/70 ( 57%)
50/70 ( 71%)
60/70 ( 86%)


Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
0,1,ειθ ωφελ εν λειμωνι μηδ ερπειν οφις,ειθ ωφελ αργους μη διαπτασθαι σκαφος,1,Medea,18
1,2,μηδ εν ναπαισι τουδ υφεδρευειν δρακων,μηδ εν ναπαισι πηλιου πεσειν ποτε,3,Medea,35
2,3,αγκυλομητης ου γαρ αν πλευρας φυμα,πελιαι μετηλθον ου γαρ αν δεσποιν εμη,6,Medea,23
3,4,μητηρ γενους δυστηνος ηπατημενη,μηδεια δ η δυστηνος ητιμασμενη,20,Medea,27
4,5,τολμημα τολμαν παντατολμον ανετλη,τον παντα συντηκουσα δακρυοις χρονον,25,Medea,15
5,6,ερνους ερωτι θυμον εκπεπληγμενη,ερωτι θυμον εκπλαγεισ ιασονος,8,Medea,32
6,7,θεωσεως πεισθεισα τυχειν αυτοθεν,ουδ αν κτανειν πεισασα πελιαδας κορας,9,Medea,13
7,8,ουδ αν φαγειν πεισασα καρπου τον ποσιν,ουδ αν κτανειν πεισασα πελιαδας κορας,9,Medea,37
8,9,του μηδε συμφεροντος αυτικα σφισι,εχθραν τις αυτη καλλινικον αισεται,45,Medea,15
9,10,λειμωνος εξωκιστο του πανολβιου,ανδρων αριστεων οι το παγχρυσον δερος,5,Medea,14


## Exercise 2.4: Find word-based line-to-line alignments without diacritics
Instead of treating the lines as sequences of characters, we can also treat them as sequences of words and do the matching on a word-level

Run the following code and inspect the ouput.

Where does it differ? Where does it find good/bad alignments? What about processing speed?

Tweak the scores for match, mismatch and gaps and try to find a good setting.

In [9]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics.split(' ')
alignments_outfile = 'alignments_word-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/70 (  0%)
10/70 ( 14%)
20/70 ( 29%)
30/70 ( 43%)
40/70 ( 57%)
50/70 ( 71%)
60/70 ( 86%)


Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
0,1,"Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,",Εἴθ’ ὤφελ’ Ἀργοῦς μὴ διαπτάσθαι σκάφος,1.0,Medea,4.0
1,2,μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων,μηδ’ ἐν νάπαισι Πηλίου πεσεῖν ποτε,3.0,Medea,6.0
2,3,"ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,",Πελίαι μετῆλθον. οὐ γὰρ ἂν δέσποιν’ ἐμὴ,6.0,Medea,6.0
3,4,"μήτηρ γένους δύστηνος ἠπατημένη,",,,,
4,5,"τόλμημα τολμᾶν παντάτολμον ἀνέτλη,",,,,
5,6,"ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,",ἔρωτι θυμὸν ἐκπλαγεῖσ’ Ἰάσονος·,8.0,Medea,4.0
6,7,"θεώσεως πεισθεῖσα τυχεῖν αὐτόθεν,",,,,
7,8,οὐδ’ ἂν φαγεῖν πείσασα καρποῦ τὸν πόσιν,οὐδ’ ἂν κτανεῖν πείσασα Πελιάδας κόρας,9.0,Medea,5.0
8,9,τοῦ μηδὲ συμφέροντος αὐτίκα σφίσι,,,,
9,10,"λειμῶνος ἐξῴκιστο τοῦ πανολβίου,",,,,


# 3) Advanced exercises and extensions

Now that you got the basic ideas and a feeling for how the algorithm behaves and how parameters affect the outputs, here are some suggestions how to further improve the software.




## Exercise 3.1: incorporating morphological information
In some cases words in the Christus Patiens have been quoted with modified tense, case, etc. in order to fit in the context sentences. These word forms might differ substantially between the original line in the source tragedy (Medea) and the Christus Patiens and thus are not found during alignment.

For example, see the first word of line 51 Chr.Pat. (βοᾷ) / line 21 Medea (βοᾶι).

The word is the same but another form of writing.
In order to resolve such cases we can incorporate morphogical information by mapping all word forms to their word stem / lemma (βοαω in this case).

Have a look at the formatting of the morphology data base files from [this repository](https://github.com/gcelano/MorpheusGreekUnicode): 
```
<d>
  <token>
    <word-form-id>1</word-form-id>
    <word-form>ἀάατον</word-form>
    <word-form-without-diacritics>ααατον</word-form-without-diacritics>
    <lemma>ἀάατος</lemma>
    <lemma-without-diacritics>ααατος</lemma-without-diacritics>
    <morphological-analysis>a-s---fa-</morphological-analysis>
    <lemma-id>1</lemma-id>
    <translation>not to be injured, inviolable</translation>
    <dialect/>
  </token>
<t>
  <i>1</i>
  <f>ἀάατον</f>
  <b>ααατον</b>
  <l>ἀάατος</l>
  <e>ααατος</e>
  <p>a-s---fa-</p>
  <d>1</d>
  <s>not to be injured, inviolable</s>
  <a/>
</t>
```

In the following, we load the data base and create a dictionary to map all word forms to their corresponding lemmas.
We apply the map to all lines in both texts and run the alignment algorithm on the modified texts.



In [10]:
# download files if not present
!([ ! -f MorpheusGreek1-319492.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek1-319492.xml
!([ ! -f MorpheusGreek319493-638984.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek319493-638984.xml
!([ ! -f MorpheusGreek638985-958476.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek638985-958476.xml
morph_xml_fnames = ['MorpheusGreek1-319492.xml',
                    'MorpheusGreek319493-638984.xml',
                    'MorpheusGreek638985-958476.xml']

# load data base
morph_dict, max_id = load_morphology(morph_xml_fnames, False)
print('{} lemmas loaded'.format(max_id))

# add all words of Christus Patiens and Medea that were not in the data base
m = max_id
max_id = add_lemma_ids(lines_chr_pat, morph_dict, max_id, False)
print('{} lemmas added by Christus Patiens'.format(max_id - m))
m = max_id
max_id = add_lemma_ids(lines_medea, morph_dict, max_id, False)
print('{} lemmas added by Medea'.format(max_id - m))
print('{} total lemmas'.format(max_id))

# print some lines
print('Christus Patiens:', lines_chr_pat[:3],'\n\n')
print('Medea:', lines_medea[:3], '\n\n')



36488 lemmas loaded
34 lemmas added by Christus Patiens
17 lemmas added by Medea
36539 total lemmas
Christus Patiens: [work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                0
no:                 1
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,
text:               εἴθ ὤφελ ἐν λειμῶνι μηδ ἕρπειν ὄφις
text_no_diacritics: ειθ ωφελ εν λειμωνι μηδ ερπειν οφις
text_lemma:         εζομαι οφειλω ειμι λειμωνιας μηδε ερπω οφις
lemma_ids:          [[9599, 9656, 9721, 9722, 9906, 9907, 11143, 15762], [23943], [9722, 9897, 10843, 11174, 11175, 15762], [19394, 19395, 19397], [21088, 21098, 21104], [13244], [23956]], work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                1
no:                 2
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων
text:               μηδ ἐν νάπαισι τοῦδ ὑφεδρεύειν δράκων
text_no_diacritics: μηδ εν ναπαισι τουδ υφεδρευειν δρακων
text_lemma

You can see that the **text_lemma** field is now filled in all lines.

Now run  the Smith-Waterman algorithm on the texts:

In [11]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_lemma.split(' ')
alignments_outfile = 'alignments_word-based_without_diacritics_morphology'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/70 (  0%)
10/70 ( 14%)
20/70 ( 29%)
30/70 ( 43%)
40/70 ( 57%)
50/70 ( 71%)
60/70 ( 86%)


Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
0,1,"Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,",Εἴθ’ ὤφελ’ Ἀργοῦς μὴ διαπτάσθαι σκάφος,1.0,Medea,4.0
1,2,μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων,μηδ’ ἐν νάπαισι Πηλίου πεσεῖν ποτε,3.0,Medea,6.0
2,3,"ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,",Πελίαι μετῆλθον. οὐ γὰρ ἂν δέσποιν’ ἐμὴ,6.0,Medea,6.0
3,4,"μήτηρ γένους δύστηνος ἠπατημένη,",,,,
4,5,"τόλμημα τολμᾶν παντάτολμον ἀνέτλη,",,,,
5,6,"ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,",ἔρωτι θυμὸν ἐκπλαγεῖσ’ Ἰάσονος·,8.0,Medea,4.0
6,7,"θεώσεως πεισθεῖσα τυχεῖν αὐτόθεν,",,,,
7,8,οὐδ’ ἂν φαγεῖν πείσασα καρποῦ τὸν πόσιν,οὐδ’ ἂν κτανεῖν πείσασα Πελιάδας κόρας,9.0,Medea,5.0
8,9,τοῦ μηδὲ συμφέροντος αὐτίκα σφίσι,,,,
9,10,"λειμῶνος ἐξῴκιστο τοῦ πανολβίου,",,,,


##Exercise 3.1a: Find the bug
This is a tough one and required some knowledge of the Greek language.

Have a look at Chr.Pat line 6 / Medea line 21. Which word is not detected as match and why?


In [12]:
print(lines_chr_pat[5], end='\n\n')
for word in lines_chr_pat[5].text_lemma.split(' '):
  print('word "{}": lemma ID {}, all forms mapping to this lemma: {}'.format(word, morph_dict[word][0], morph_dict[morph_dict[word][0]][1]))

print('\n\n')

print(lines_medea[7], end='\n\n')
for word in lines_medea[7].text_lemma.split(' '):
  print('word "{}": lemma ID {}, all forms mapping to this lemma: {}'.format(word, morph_dict[word][0], morph_dict[morph_dict[word][0]][1]))


work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                5
no:                 6
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,
text:               ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη
text_no_diacritics: ερνους ερωτι θυμον εκπεπληγμενη
text_lemma:         ερνους ερως θυμον εκπλησσω
lemma_ids:          [[36491], [13280, 13287], [15521, 15526], [10267]]

word "ερνους": lemma ID 36491, all forms mapping to this lemma: ['ερνους']
word "ερως": lemma ID 13080, all forms mapping to this lemma: ['ερωσα', 'ερωντ', 'ερωντε', 'αρασαν', 'ερωσαν', 'αρασαι', 'ερασαι', 'ερως', 'ερωντας', 'ερωσι', 'ερωσιν', 'ερωντων', 'ερωντες', 'αραντα', 'ερωντα', 'ερωμενων', 'ερωμενους', 'ερωμενοις', 'ερωμενοι', 'εραντι', 'ερωντι', 'ερωντος', 'αραν', 'ανερων', 'εραν', 'ερων', 'ερωμενην', 'ερωμενη', 'ερωμενης', 'ερωμενον', 'ερωωμενον', 'ερωμενω', 'τωραμενω', 'ερωμενου', 'ερωμενος', 'ερασθαι', 'ερωμεν', 'ερωμεθα', 'ηραμεθα', 'ηρων', 'ηραν', 'ενηραμην', 'ηραμην', 

## Exercise 3.2: Word count statistics
Modify the alignment scoring by introducing a weighted matching score scheme, where frequent (but potentially uninteresting) words get lower match scores and rare, (but potentially interesting words) get higher match scores.


## Exercise 3.3: Alternative alignment algorithms
Create an alternative alignment function, e.g. by just counting same words in lines to be matched, without taking care of the ordering of words. Does this improve the results? In which situations?