<a href="https://colab.research.google.com/github/nicoschmidt/DigitalCentones/blob/master/SmartTexts2019/Practicum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Smart Texts 2019 Practicum May 27th:
# Christus Patiens and the opportunities for Digital Reading

Welcome to the Smart Texts practicum! Today we will investigate the possibilities of computer-aided intertextual reading.
In this session you will get to know some techniques to handle a digital version of the Greek cento Christus Patiens from XXX by Gregorius Nazianzenus. You can have a look at this digital version, an edition by Johann Georg Brambs from 1885 in the Scaife Viewer: https://scaife.perseus.org/reader/urn:cts:greekLit:tlg2022.tlg003.opp-grc1:1.1-1.30/

It is available on GitHub (along with many other texts!) in a repository of the ***OpenGreekAndLatin*** project: https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml



# Analysing Cento Christus Patiens using alignment algorithms

- get code and files from GitHub
- import code

In [6]:
!([ -d DigitalCentones ]) && rm -r DigitalCentones
!git clone https://github.com/nicoschmidt/DigitalCentones

from DigitalCentones.smith_waterman import SmithWaterman
from DigitalCentones.alignment import *
from IPython.display import HTML

Cloning into 'DigitalCentones'...
remote: Enumerating objects: 45, done.[K
remote: Counting objects:   2% (1/45)   [Kremote: Counting objects:   4% (2/45)   [Kremote: Counting objects:   6% (3/45)   [Kremote: Counting objects:   8% (4/45)   [Kremote: Counting objects:  11% (5/45)   [Kremote: Counting objects:  13% (6/45)   [Kremote: Counting objects:  15% (7/45)   [Kremote: Counting objects:  17% (8/45)   [Kremote: Counting objects:  20% (9/45)   [Kremote: Counting objects:  22% (10/45)   [Kremote: Counting objects:  24% (11/45)   [Kremote: Counting objects:  26% (12/45)   [Kremote: Counting objects:  28% (13/45)   [Kremote: Counting objects:  31% (14/45)   [Kremote: Counting objects:  33% (15/45)   [Kremote: Counting objects:  35% (16/45)   [Kremote: Counting objects:  37% (17/45)   [Kremote: Counting objects:  40% (18/45)   [Kremote: Counting objects:  42% (19/45)   [Kremote: Counting objects:  44% (20/45)   [Kremote: Counting objects:  46% (2

# Load Chritus Patiens and Medea text files
- copy xml file of Christus Patiens from GitHub repository ***First1KGreek*** by the ***OpenGreekAndLatin*** project
- parse the xml file
- remove preface lines
- load first 48 lines from Medea as plain text (no xml available yet :( )

In [7]:
# get Christus Patiens xml file (if not present)
!([ ! -f tlg2022.tlg003.opp-grc1.xml ]) && wget https://raw.githubusercontent.com/OpenGreekAndLatin/First1KGreek/master/data/tlg2022/tlg003/tlg2022.tlg003.opp-grc1.xml
lines_chr_pat = load_xml('tlg2022.tlg003.opp-grc1.xml')
for i in range(30, len(lines_chr_pat)):
    lines_chr_pat[i].no = -29+i # hack to re-number lines
lines_chr_pat = lines_chr_pat[30:]
print('Christus Patiens:', lines_chr_pat[:3],'\n\n')

lines_medea = load_plain_text('DigitalCentones/data/Medea-48.txt', 'Medea', 'Medea', 'Euripides')
print('Medea:', lines_medea[:3], '\n\n')

Christus Patiens: [work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                30
no:                 1
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,
text:               εἴθ ὤφελ ἐν λειμῶνι μηδ ἕρπειν ὄφις
text_no_diacritics: ειθ ωφελ εν λειμωνι μηδ ερπειν οφις
text_lemma:         
lemma_ids:          [], work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                31
no:                 2
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων
text:               μηδ ἐν νάπαισι τοῦδ ὑφεδρεύειν δράκων
text_no_diacritics: μηδ εν ναπαισι τουδ υφεδρευειν δρακων
text_lemma:         
lemma_ids:          [], work_id:            urn:cts:greekLit:tlg2022.tlg003.opp-grc1
idx:                32
no:                 3
speaker:            ΘΕΟΤΟΚΟΣ.
text_raw:           ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,
text:               ἀγκυλομήτης οὐ γὰρ ἂν πλευρᾶς φῦμα
text_no_diacritics: αγκ

The 




---

**Print all characters with diacritics**



In [0]:
# print all characters in the corpus:
alphabet = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text])
print('ID, Character, Name')
for a in alphabet:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))

ID, Character, Name
0x20 " " SPACE
0xe1 "á" LATIN SMALL LETTER A WITH ACUTE
0x390 "ΐ" GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
0x3ac "ά" GREEK SMALL LETTER ALPHA WITH TONOS
0x3ad "έ" GREEK SMALL LETTER EPSILON WITH TONOS
0x3ae "ή" GREEK SMALL LETTER ETA WITH TONOS
0x3af "ί" GREEK SMALL LETTER IOTA WITH TONOS
0x3b1 "α" GREEK SMALL LETTER ALPHA
0x3b2 "β" GREEK SMALL LETTER BETA
0x3b3 "γ" GREEK SMALL LETTER GAMMA
0x3b4 "δ" GREEK SMALL LETTER DELTA
0x3b5 "ε" GREEK SMALL LETTER EPSILON
0x3b6 "ζ" GREEK SMALL LETTER ZETA
0x3b7 "η" GREEK SMALL LETTER ETA
0x3b8 "θ" GREEK SMALL LETTER THETA
0x3b9 "ι" GREEK SMALL LETTER IOTA
0x3ba "κ" GREEK SMALL LETTER KAPPA
0x3bb "λ" GREEK SMALL LETTER LAMDA
0x3bc "μ" GREEK SMALL LETTER MU
0x3bd "ν" GREEK SMALL LETTER NU
0x3be "ξ" GREEK SMALL LETTER XI
0x3bf "ο" GREEK SMALL LETTER OMICRON
0x3c0 "π" GREEK SMALL LETTER PI
0x3c1 "ρ" GREEK SMALL LETTER RHO
0x3c2 "ς" GREEK SMALL LETTER FINAL SIGMA
0x3c3 "σ" GREEK SMALL LETTER SIGMA
0x3c4 "τ" GREEK SMALL LETT

**Print all characters without diacritics**

In [0]:
alphabet_no_diacritics = np.unique([c for lines in [lines_chr_pat, lines_medea] for line in lines for c in line.text_no_diacritics])
for a in alphabet_no_diacritics:
    print('{} "{}" {}'.format(hex(ord(a)), a, unicodedata.name(a, 'not defined')))


0x20 " " SPACE
0x61 "a" LATIN SMALL LETTER A
0x3b1 "α" GREEK SMALL LETTER ALPHA
0x3b2 "β" GREEK SMALL LETTER BETA
0x3b3 "γ" GREEK SMALL LETTER GAMMA
0x3b4 "δ" GREEK SMALL LETTER DELTA
0x3b5 "ε" GREEK SMALL LETTER EPSILON
0x3b6 "ζ" GREEK SMALL LETTER ZETA
0x3b7 "η" GREEK SMALL LETTER ETA
0x3b8 "θ" GREEK SMALL LETTER THETA
0x3b9 "ι" GREEK SMALL LETTER IOTA
0x3ba "κ" GREEK SMALL LETTER KAPPA
0x3bb "λ" GREEK SMALL LETTER LAMDA
0x3bc "μ" GREEK SMALL LETTER MU
0x3bd "ν" GREEK SMALL LETTER NU
0x3be "ξ" GREEK SMALL LETTER XI
0x3bf "ο" GREEK SMALL LETTER OMICRON
0x3c0 "π" GREEK SMALL LETTER PI
0x3c1 "ρ" GREEK SMALL LETTER RHO
0x3c2 "ς" GREEK SMALL LETTER FINAL SIGMA
0x3c3 "σ" GREEK SMALL LETTER SIGMA
0x3c4 "τ" GREEK SMALL LETTER TAU
0x3c5 "υ" GREEK SMALL LETTER UPSILON
0x3c6 "φ" GREEK SMALL LETTER PHI
0x3c7 "χ" GREEK SMALL LETTER CHI
0x3c8 "ψ" GREEK SMALL LETTER PSI
0x3c9 "ω" GREEK SMALL LETTER OMEGA


# Load all Matches found by Toullier

In [0]:
# Input file is a text file with one header row and remaining rows structured as
# '<target_line_number>,<source_work_number>,<source_work_line_number>...\n'
line_to_line_map = np.loadtxt('DigitalCentones/data/ChristusPatiens_Tulliers_matches.csv', int, delimiter=',', skiprows=1, usecols=[0,1,2])

# These are all matched works Toullier found
source_work_names = ['Agamemnon',
                     'Prometheus',
                     'Alkestis',
                     'Andromache',
                     'Bakchen',
                     'Hekabe',
                     'Helena',
                     'Hippolytos',
                     'Iphigenie in Aulis',
                     'Iphigenie in Tauris',
                     'Medea',
                     'Orestes',
                     'Ph\"onikerinnen',
                     'Rhesos',
                     'Troerinnen',
                     'Ilias',
                     'Alexandra']

# store map as a dictionary with target_line_number as keys and tuples (<source_work_id>, <source_work_number>) as values 
line_to_line_map = dict([(target_line_no,(source_work_names[source_work_id-1], source_line_id)) for target_line_no, source_work_id, source_line_id in line_to_line_map])
line_to_line_map

{1: ('Medea', 1),
 2: ('Medea', 3),
 3: ('Medea', 6),
 4: ('Medea', 20),
 5: ('Hekabe', 1123),
 6: ('Medea', 8),
 8: ('Medea', 9),
 14: ('Medea', 10),
 15: ('Medea', 11),
 17: ('Prometheus', 1027),
 25: ('Alexandra', 151),
 32: ('Medea', 14),
 33: ('Medea', 15),
 34: ('Medea', 13),
 37: ('Medea', 16),
 38: ('Medea', 17),
 39: ('Agamemnon', 764),
 40: ('Troerinnen', 605),
 41: ('Troerinnen', 620),
 42: ('Troerinnen', 621),
 43: ('Medea', 20),
 45: ('Prometheus', 1027),
 46: ('Medea', 25),
 47: ('Medea', 26),
 50: ('Hippolytos', 450),
 51: ('Medea', 21),
 52: ('Medea', 22),
 53: ('Medea', 34),
 54: ('Medea', 35),
 55: ('Medea', 36),
 56: ('Medea', 56),
 57: ('Medea', 57),
 58: ('Medea', 58),
 59: ('Medea', 59),
 61: ('Hekabe', 736),
 65: ('Agamemnon', 611),
 66: ('Agamemnon', 612),
 71: ('Agamemnon', 587),
 72: ('Agamemnon', 588),
 73: ('Agamemnon', 589),
 75: ('Agamemnon', 593),
 76: ('Agamemnon', 591),
 77: ('Troerinnen', 747),
 78: ('Troerinnen', 748),
 79: ('Agamemnon', 594),
 80: ('

In [0]:
# find matches by searching in given line-to-line map
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
alignments_outfile = 'alignments_word-based_Toullier'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, lambda line:line.text_no_diacritics.split(' '), line_to_line_map)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)

searching all verses for alignments...
0/2612 (  0%)
10/2612 (  0%)
20/2612 (  1%)
30/2612 (  1%)
40/2612 (  2%)
50/2612 (  2%)
60/2612 (  2%)
70/2612 (  3%)
80/2612 (  3%)
90/2612 (  3%)
100/2612 (  4%)
110/2612 (  4%)
120/2612 (  5%)
130/2612 (  5%)
140/2612 (  5%)
150/2612 (  6%)
160/2612 (  6%)
170/2612 (  7%)
180/2612 (  7%)
190/2612 (  7%)
200/2612 (  8%)
210/2612 (  8%)
220/2612 (  8%)
230/2612 (  9%)
240/2612 (  9%)
250/2612 ( 10%)
260/2612 ( 10%)
270/2612 ( 10%)
280/2612 ( 11%)
290/2612 ( 11%)
300/2612 ( 11%)
310/2612 ( 12%)
320/2612 ( 12%)
330/2612 ( 13%)
340/2612 ( 13%)
350/2612 ( 13%)
360/2612 ( 14%)
370/2612 ( 14%)
380/2612 ( 15%)
390/2612 ( 15%)
400/2612 ( 15%)
410/2612 ( 16%)
420/2612 ( 16%)
430/2612 ( 16%)
440/2612 ( 17%)
450/2612 ( 17%)
460/2612 ( 18%)
470/2612 ( 18%)
480/2612 ( 18%)
490/2612 ( 19%)
500/2612 ( 19%)
510/2612 ( 20%)
520/2612 ( 20%)
530/2612 ( 20%)
540/2612 ( 21%)
550/2612 ( 21%)
560/2612 ( 21%)
570/2612 ( 22%)
580/2612 ( 22%)
590/2612 ( 23%)
600/2612 ( 2

In [0]:
HTML(filename='./'+ alignments_outfile + '.html')

Target Line ID,Target Line No.,Target Line Text,Source Line Text,Source Line No.,Source Text ID,Alignment Score
30,1,"Εἴθ᾿ ὤφελ’ ἐν λειμῶνι μηδ’ ἕρπειν ὄφις,",Εἴθ’ ὤφελ’ Ἀργοῦς μὴ διαπτάσθαι σκάφος,1.0,Medea,4.0
31,2,μηδ’ ἐν νάπαισι τοῦδ’ ὑφεδρεύειν δράκων,μηδ’ ἐν νάπαισι Πηλίου πεσεῖν ποτε,3.0,Medea,6.0
32,3,"ἀγκυλομήτης· οὐ γὰρ ἂν πλευρᾶς φῦμα,",Πελίαι μετῆλθον. οὐ γὰρ ἂν δέσποιν’ ἐμὴ,6.0,Medea,6.0
33,4,"μήτηρ γένους δύστηνος ἠπατημένη,",,,,
34,5,"τόλμημα τολμᾶν παντάτολμον ἀνέτλη,",,,,
35,6,"ἔρνους ἔρωτι θυμὸν ἐκπεπληγμένη,",ἔρωτι θυμὸν ἐκπλαγεῖσ’ Ἰάσονος·,8.0,Medea,4.0
36,7,"θεώσεως πεισθεῖσα τυχεῖν αὐτόθεν,",,,,
37,8,οὐδ’ ἂν φαγεῖν πείσασα καρποῦ τὸν πόσιν,οὐδ’ ἂν κτανεῖν πείσασα Πελιάδας κόρας,9.0,Medea,5.0
38,9,τοῦ μηδὲ συμφέροντος αὐτίκα σφίσι,,,,
39,10,"λειμῶνος ἐξῴκιστο τοῦ πανολβίου,",,,,


# Find character-based line-to-line alignments with diacritics


In [0]:
string_mapping_function = lambda s:s # identity
smith_waterman = SmithWaterman(match_score=5,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0,
                               string_mapping_function=string_mapping_function)
line_form_func = lambda line:line.text
alignments_outfile = 'alignments_character-based_with_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func)
HTML(filename='./'+ alignments_outfile + '.html')

searching all verses for alignments...
0/2612 (  0%)
10/2612 (  0%)
20/2612 (  1%)
30/2612 (  1%)
40/2612 (  2%)
50/2612 (  2%)
60/2612 (  2%)


KeyboardInterrupt: ignored

# Find character-based line-to-line alignments without diacritics

In [0]:
smith_waterman = SmithWaterman(match_score=4,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics
alignments_outfile = 'alignments_character-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, line_form_func, 4)

# Find word-based line-to-line alignments without diacritics

In [0]:
smith_waterman = SmithWaterman(match_score=2,
                               mismatch_score=-1,
                               gap_score=-1,
                               n_max_alignments=1,
                               min_score_treshold=0)
line_form_func = lambda line:line.text_no_diacritics.split(' ')
alignments_outfile = 'alignments_word-based_without_diacritics'
target_lines = lines_chr_pat
source_lines_dict = {lines_medea[0].work_id:lines_medea}
alignments = find_line_to_line_alignments(target_lines, source_lines_dict, smith_waterman.align, 3, line_form_func)
save_alignments_html(alignments, target_lines, source_lines_dict, alignments_outfile, lambda line:line.text_raw.split(' '), 4)

# Incorporating morphological information
- load morphological data base from GitHub
- Map all words to their lemmatized form
- Find alignments using the lemmas

In [3]:
# download files if not present
!([ ! -f MorpheusGreek1-319492.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek1-319492.xml
!([ ! -f MorpheusGreek319493-638984.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek319493-638984.xml
!([ ! -f MorpheusGreek638985-958476.xml ]) && wget https://raw.githubusercontent.com/gcelano/MorpheusGreekUnicode/master/MorpheusGreek638985-958476.xml
morph_xml_fnames = ['MorpheusGreek1-319492.xml',
                    'MorpheusGreek319493-638984.xml',
                    'MorpheusGreek638985-958476.xml']

# load data base
morph_dict, max_id = load_morphology(morph_xml_fnames, False)
print('{} lemmas loaded'.format(max_id))
m = max_id
max_id = add_lemma_ids(lines_chr_pat, morph_dict, max_id, False)
print('{} lemmas added by Christus Patiens'.format(max_id - m))
m = max_id
max_id = add_lemma_ids(lines_medea, morph_dict, max_id, False)
print('{} lemmas added by Medea'.format(max_id - m))
print('{} total lemmas'.format(max_id))



NameError: ignored

In [2]:
! ([ ! -f MorpheusGreek1-319492.xml1 ]) && echo "MorpheusGreek1-319492.xml does not exist"


MorpheusGreek1-319492.xml does not exist


In [0]:
!ls -la MorpheusGreek1-319492.xml

-rw-r--r-- 1 root root 66474209 May 19 11:21 MorpheusGreek1-319492.xml
