# Repetitions in Homeric speeches: get data

The DICES search interface is found at: http://dices.ub.uni-rostock.de/app/speeches/search/ 

## TODO

- ✅ get all speeches for Od. and Il. from Dices API
- ✅ filter speeches for relevant tags (giuramenti, istruzioni, preghiere)
- ✅ create an initial dataframe with relevant data
- ✅ add text fetched from Perseus via CTS URN
- ✅ add lemmatised version of text via CLTK
- ✅ save data to JSON lines format suitable for processing with passim

## Dices tags

| Dices tag      | Expansion |
| ----------- | ----------- |
| cha         | Challenge |
| com | Command |
| con | Consolation|
| del | Deliberation|
| des | Desire and Wish|
| exh | Exhortation and Self-Exhortation|
| far | Farewell|
| gre | Greeting and Reception|
| inf | Information and Description|
| ins | Instruction|
| inv | Invitation |
| lam | Lament |
| lau | Praise and Laudation |
| mes | Message|
| nar | Narration|
| ora | Prophecy, Oracular Speech, and Interpretation|
| per | Persuasion|
| pra | Prayer|
| que | Question|
| req | Request|
| res | Reply to Question|
| tau | Taunt|
| thr | Threat|
| und | Undefined|
| vit | Vituperation|
| vow | Promise and Oath|
| war | Warning|

## Retrieving all Homeric speeches from Dices

In [1]:
from dicesapi import DicesAPI
import pandas as pd

# create a connection to DICES
api = DicesAPI(logfile='data/log/dices.log')

In [2]:
od_speeches = api.getSpeeches(work_title='Odyssey', progress=True)
print('Got', len(od_speeches), 'speeches')

Got 673 speeches


In [3]:
il_speeches = api.getSpeeches(work_title='Iliad', progress=True)
print('Got', len(il_speeches), 'speeches')

Got 698 speeches


In [5]:
speeches = il_speeches + od_speeches

In [6]:
len(speeches)

1371

## Filter speeches

The `tag` information is contained in the `dicesapi.Speech` class, yet a bid hidden. 

In [124]:
def filter_speeches_by_tag(speeches_list, tags_to_keep):
    filtered_speeches = []
    for speech in speeches_list:
        matching_tags = list(set([t['type'] for t in speech._attributes['tags']]) & set(tags_to_keep))
        if len(matching_tags) > 0:
            filtered_speeches.append(speech)
    return filtered_speeches

In [125]:
# ⚠️ check with Ombretta which tags exactly correspond to "discorso di istruzione, discorso di recapito"
speeches_subset = filter_speeches_by_tag(speeches, ['vow', 'ins', 'pra', 'mes', 'nar'])

In [126]:
len(speeches_subset)

221

## Fetch text of speeches from Perseus

In [11]:
from dicesapi.text import CtsAPI
from dicesapi.jupyter import NotebookPBar
cts = CtsAPI()

In [12]:
def download_speech_text(speeches):
    '''Download the text of the speeches from CTS server, append to speech objects'''
    pbar = NotebookPBar(max=len(speeches), prefix='Downloading text')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            s.passage = cts.getPassage(s)
        pbar.update()
    return speeches

In [13]:
speeches_subset = speeches
len(speeches_subset)

1371

In [14]:
speeches_subset = download_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=1371), Label(value='Downloading text 0/1371')))

In [15]:
speeches_subset[-1].passage.text

'τίπτε μοι, Ἑρμεία χρυσόρραπι, εἰλήλουθας αἰδοῖός τε φίλος τε; πάρος γε μὲν οὔ τι θαμίζεις. αὔδα ὅ τι φρονέεις· τελέσαι δέ με θυμὸς ἄνωγεν, εἰ δύναμαι τελέσαι γε καὶ εἰ τετελεσμένον ἐστίν. ἀλλʼ ἕπεο προτέρω, ἵνα τοι πὰρ ξείνια θείω.'

In [16]:
len(speeches_subset)

1371

## Process speeches with CLTK

In [17]:
def parse_speech_text(speeches):
    '''Run CLTK NLP pipeline to parse all the speeches'''
    
    pbar = NotebookPBar(max=len(speeches), prefix='Running NLP')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            print('no passage:', s)
        elif not hasattr(s.passage, 'cltk') or s.passage.cltk is None:
            s.passage.runCltkPipeline(remove_punct=True)
        pbar.update()
    return speeches



In [18]:
speeches_subset = parse_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=1371), Label(value='Running NLP 0/1371')))

no passage: <Speech 931: Odyssey 10.456-10.465>


In [19]:
s = speeches_subset[0]
s.passage.text

'ὦ φίλʼ, ἐπεί σε πρῶτα κιχάνω τῷδʼ ἐνὶ χώρῳ, χαῖρέ τε καὶ μή μοί τι κακῷ νόῳ ἀντιβολήσαις, ἀλλὰ σάω μὲν ταῦτα, σάω δʼ ἐμέ· σοὶ γὰρ ἐγώ γε εὔχομαι ὥς τε θεῷ καί σευ φίλα γούναθʼ ἱκάνω. καί μοι τοῦτʼ ἀγόρευσον ἐτήτυμον, ὄφρʼ ἐῢ εἰδῶ· τίς γῆ, τίς δῆμος, τίνες ἀνέρες ἐγγεγάασιν; ἦ πού τις νήσων εὐδείελος, ἦέ τις ἀκτὴ κεῖθʼ ἁλὶ κεκλιμένη ἐριβώλακος ἠπείροιο;'

In [21]:
# get all pos tags in order to know what to keep and what to exclude
set([str(w.pos) for w in s.passage.cltk for s in speeches_subset])

{'adjective',
 'adverb',
 'coordinating_conjunction',
 'determiner',
 'interjection',
 'noun',
 'numeral',
 'pronoun',
 'subordinating_conjunction',
 'verb'}

In [22]:
keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun', 'adjective', 'adverb']
" ".join([w.lemma for w in s.passage.cltk if str(w.pos) in keep_pos_tags])

'φίλη σύ πρῶτος κιχάνω οὗδος χῶρος χαίρω μή ἐγώ τὶς κακός νόος ἀντιβολέω σύ μέν οὗτος σῴζω δʼ ἐγώ σύ γάρ ἐγώ γε εὔχομαι ὡς θεός σύ φίλος γούνα ἱκάνω ἐγώ οὗτος ἀγορεύω τυμόω ἐγώ οἶδα τίς γῆ τίς δῆμος τίς ἀνήρ ἐγγάγω ἦ πού τὶς νῆσος εὐδείελος ἐγώ ἀκτή κεῖος ἅλς κλίνω ἐριβάλλω ἀπείρω'

## Speeches to Dataframe

In [23]:
len(speeches_subset)

1371

In [32]:
speeches_subset[0].getSpkrString()

'Odysseus'

In [35]:
speeches_subset[0].getAddrString()

'Athena'

In [45]:
def make_speeches_table(speeches) -> pd.DataFrame:

    keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun']

    return pd.DataFrame(dict(
        speech_id = speech.id,
        group = f"speech-{'-'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))}",
        length = len([w for w in speech.passage.text]) if speech.passage else None,
        passage_urn = speech.urn,
        speaker = speech.getSpkrString(),
        addressee = speech.getAddrString(),
        lemmatised_text = " ".join([w.lemma for w in speech.passage.cltk]) if speech.passage else None,
        lemmatised_filtered_text = " ".join([w.lemma for w in speech.passage.cltk if str(w.pos) in keep_pos_tags]) if speech.passage else None,
        text = speech.passage.text if speech.passage else None,
        label = f"{speech.author.name}, {speech.work.title} {speech.urn.split(':')[-1]}",
        dices_tags = '|'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))
    ) for speech in speeches)

In [46]:
speeches_df = make_speeches_table(speeches_subset)

In [47]:
speeches_df

Unnamed: 0,speech_id,group,length,passage_urn,speaker,addressee,lemmatised_text,lemmatised_filtered_text,text,label,dices_tags
0,995,speech-que,354.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Odysseus,Athena,ὦ φίλη ἐπεί σύ πρῶτος κιχάνω οὗδος εἷς χῶρος χ...,φίλη σύ κιχάνω χῶρος χαίρω ἐγώ νόος ἀντιβολέω ...,"ὦ φίλʼ, ἐπεί σε πρῶτα κιχάνω τῷδʼ ἐνὶ χώρῳ, χα...","Homer, Odyssey 13.228-13.235",que
1,996,speech-inf-res,576.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Athena,Odysseus,νήπιος εἰς ὦ ξένος ἤ τηλόθεν εἰλήλουθος εἰ δή ...,ξένος γαῖα ἀναιρέω νώνυμος οἶδα μιν ἠμεῖς ὅσος...,"νήπιός εἰς, ὦ ξεῖνʼ, ἢ τηλόθεν εἰλήλουθας, εἰ ...","Homer, Odyssey 13.237-13.249",inf|res
2,997,speech-lam-nar,1372.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Odysseus,Athena,πυνθάνομαι Ἰθάκης γε καί ἐν Κρήτη εὐρεῖος τηλό...,πυνθάνομαι Ἰθάκης Κρήτη Πόντος εἰληλόω χρῆμα λ...,"πυνθανόμην Ἰθάκης γε καὶ ἐν Κρήτῃ εὐρείῃ, τηλο...","Homer, Odyssey 13.256-13.286",lam|nar
3,998,speech-del,897.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Athena,Odysseus,κερδαλής κʼ εἰμί καί ἐπίκλοπος ὅς σύ παρέρχομα...,ἐπίκλοπος ὅς σύ παρέρχομαι δόλος θεός ἀντιάζω ...,κερδαλέος κʼ εἴη καὶ ἐπίκλοπος ὅς σε παρέλθοι ...,"Homer, Odyssey 13.291-13.310",del
4,999,speech-del,757.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Odysseus,Athena,ἀργαλέων σύ τεάομαι γιγνώσκω βροτός ἀντιάζω κα...,σύ τεάομαι γιγνώσκω βροτός ἀντιάζω μαγος ἐπίστ...,"ἀργαλέον σε, θεά, γνῶναι βροτῷ ἀντιάσαντι, καὶ...","Homer, Odyssey 13.312-13.328",del
...,...,...,...,...,...,...,...,...,...,...,...
1366,809,speech-res,87.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:4...,dream,Penelope,οὐ μέν σύ ἐκεῖνος γε διηνεκέως ἀγορεύω ζῶ ὅς γ...,σύ ἀγορεύω ζῶ ὅς γι θνῄσκω ἀνεμώλιον βάζω,"οὐ μέν τοι κεῖνόν γε διηνεκέως ἀγορεύσω, ζώει ...","Homer, Odyssey 4.836-4.837",res
1367,810,speech-per-req,610.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:5...,Athena,Zeus,Ζεύς πατήρ ἠδʼ ἄλλος μάκαρος θεός ἀείν εἰμί μή...,πατήρ θεός πρόφρον σκηπτοῦχος βασιλεύς φρήν οἶ...,"Ζεῦ πάτερ ἠδʼ ἄλλοι μάκαρες θεοὶ αἰὲν ἐόντες, ...","Homer, Odyssey 5.7-5.20",per|req
1368,811,speech-ins,270.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:5...,Zeus,Athena,τέκνον ἐμός ποῖος σύ ἔπος φεύγω ἕρκος ὀδούς οὐ...,τέκνον ποῖος σύ ἔπος φεύγω ἕρκος ὀδούς βουλεύω...,"τέκνον ἐμόν, ποῖόν σε ἔπος φύγεν ἕρκος ὀδόντων...","Homer, Odyssey 5.22-5.27",ins
1369,812,speech-com,613.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:5...,Zeus,Hermes,Ἑρμεία σύ γάρ ὅστις ὁ τʼ ἄλλος πέρ ἄγγελος ἐγώ...,Ἑρμεία σύ ὅστις ἄγγελος ἐγώ νύμφη λέγω βουλή ν...,"Ἑρμεία, σὺ γὰρ αὖτε τά τʼ ἄλλα περ ἄγγελός ἐσσ...","Homer, Odyssey 5.29-5.42",com


## Convert speeches to Passim format

```json
{
    "id": "...",
    "group": "...",
    "text": "...",
}

In [48]:
import json

def speeches_to_passim(speeches_df : pd.DataFrame, json_path : str, lemmatised=False) -> None:

    # transform DataFrame to passim-amenable JSON
    docs = [
        {
            "id" : idx,
            "group": row.group,
            "label": row.label,
            "dices_tags": row.dices_tags,
            "text": row.lemmatised_filtered_text if lemmatised else row.text,
            "raw_text": row.text if lemmatised else None,
            "passage_urn": row.passage_urn
        }
        for idx, row in speeches_df.iterrows()
    ]

    # write passim docs to JSON file
    with open(json_path, 'w', encoding='utf-8') as f:
        for d in docs:
            print(json.dumps(d, ensure_ascii=False), file=f)



In [149]:
# speeches_to_passim(speeches_df, 'test_input.json')

In [53]:
passim_input_path = 'data/input/homeric_speeches_lemmatised.json'

In [54]:
speeches_to_passim(speeches_df[~speeches_df.text.isna()], passim_input_path, lemmatised=True)

## Run Passim

In [161]:
passim_output_path = "data/passim/out_cluster/"
passim_logfile = "data/log/passim.log"

In [162]:
!rm -r {passim_output_path}

In [124]:
# -a : Minimum length of alignment (default: 50)
!seriatim -a 5 {passim_input_path} {passim_output_path} >& {passim_logfile}

^C


## Read Passim's output

In [7]:
import glob, itertools, json
import pandas as pd

# Read one JSON record per line
def read_jsonl_file(f):
  res = []
  for line in f:
    res.append(json.loads(line))
  return res

def read_jsonl(d):
  return list(itertools.chain.from_iterable([read_jsonl_file(open(f)) for f in glob.glob(d + '/*.json')]))

In [4]:
def passim_output_to_dataframe(output_path, dataframe_path, columns_to_keep=['cluster', 'id', 'label', 'dices_tags', 'text', 'raw_text']) -> pd.DataFrame:
    tr_clusters = pd.DataFrame(read_jsonl(output_path))
    print(f'There are {tr_clusters.cluster.unique().size} text reuse clusters in {output_path}')
    output_df = tr_clusters[columns_to_keep]
    output_df.to_csv(dataframe_path)
    return output_df

## Experiments

| Experiment ID     | Passim parameters | Explanation  | \# of extracted clusters |
| ----------- | ----------- |----------|------------|
| `exp0`      | `-n 1 --min-match 1 -a 5` | reused passages consist of at least 1 shared n-grams of size 1 (uni-gram) | 1066 |
| `exp1`| `-n 3 --max-repeat 100`|reused passages consist of at least 5 shared n-grams of size 3 (uni-gram)|23|
| `exp2`|`-n 3 --min-match 3 --max-repeat 100 -w 1`|reused passages consist of at least 3 shared n-grams of size 3 (tri-grams)|24|
| `exp3`|`-n 3 --min-match 3 --max-repeat 100 -a 10`|reused passages consist of at least 3 shared n-grams of size 3, and the aligned passage should be at least 10 characters long (default is `20`)|41|
| `exp4`|`-n 2 --min-match 2 --max-repeat 100 -a 10`|reused passages consist of at least 2 shared n-grams of size 2 (bi-grams), and the aligned passage should be at least 10 characters long (default is `20`)|212|
| `exp5`|`-n 3 --min-match 1 --max-repeat 100 -a 10`|reused passages consist of at least 1 shared n-grams of size 3 (tri-grams), and the aligned passage should be at least 10 characters long (default is `20`)|41|
| `exp6`|`-n 4 --min-match 1 --max-repeat 100 -a 10`|reused passages consist of at least 1 shared n-grams of size 4 (bi-grams), and the aligned passage should be at least 10 characters long (default is `20`)|36|

### Experiment 0 (exp0)

In [205]:
!cat data/passim/exp0/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 1 --min-match 1 -a 5 --max-repeat 100 -w 1 data/input/input_lemmatised.json data/passim/out_cluster


In [207]:
output_path = 'data/passim/exp0/out.json/'
tsv_path = 'data/output/passim_clusters_exp0.csv'
tr_clusters = passim_output_to_dataframe(output_path, tsv_path)

There are 1066 text reuse clusters in data/passim/exp0/out.json/


In [208]:
tr_clusters

Unnamed: 0,cluster,id,label,dices_tags,text,raw_text
0,0,0,"Homer, Iliad 1.37-1.42",pra|req,κλίνω ἐγώ ὅς Χρύση ἀμφιβαίνω Κίλλας Τενέδοιος ...,"κλῦθί μευ ἀργυρότοξʼ, ὃς Χρύσην ἀμφιβέβηκας Κί..."
1,0,1,"Homer, Iliad 10.278-10.282",pra|req,κλίνω ἐγώ αἴγιοχος Ζεύς τέκος ὅς ἐγώ πόνος παρ...,"κλῦθί μευ αἰγιόχοιο Διὸς τέκος, ἥ τέ μοι αἰεὶ ..."
2,0,2,"Homer, Iliad 10.284-10.294",pra|req,κλύω ἐγώ Ζεύς τέκος Ἀτρυτώνη σπεῖον ἐγώ πατήρ ...,κέκλυθι νῦν καὶ ἐμεῖο Διὸς τέκος Ἀτρυτώνη· σπε...
3,0,4,"Homer, Odyssey 1.32-1.43",lam|nar,ὅς πόπος οἷος νυ θεός βροτοί αἰτιόω ἡμεῖς φημί...,"ὢ πόποι, οἷον δή νυ θεοὺς βροτοὶ αἰτιόωνται· ἐ..."
4,0,6,"Homer, Odyssey 1.158-1.177",nar|que,φίλη ἐγώ νεμεσέω ὅστις λέγω μέλει κίθαρις ἀοιδ...,"ξεῖνε φίλʼ, ἦ καί μοι νεμεσήσεαι ὅττι κεν εἴπω..."
...,...,...,...,...,...,...
3777,8589936020,123,"Homer, Odyssey 3.14-3.20",ins,σύ χρή ἐς αἰδώς ἀβαιρέω τοὔνω πόντος ἐπέπλως ὄ...,"Τηλέμαχʼ, οὐ μέν σε χρὴ ἔτʼ αἰδοῦς, οὐδʼ ἠβαιό..."
3778,8589936020,180,"Homer, Odyssey 17.124-17.146",inf|nar,σύ ἐγώ κρύπτω ἔπος ἐπίκειμαι μιν ὅς γι νῆσος ὁ...,"ὢ πόποι, ἦ μάλα δὴ κρατερόφρονος ἀνδρὸς ἐν εὐν..."
3779,8589936118,109,"Homer, Odyssey 14.462-14.506",nar,ἄγω ἀρτύνω ἡγέομαι Ὀδυσεύς Ἀτρεΐδης Μενέλαος ὁ...,"κέκλυθι νῦν, Εὔμαιε καὶ ἄλλοι πάντες ἑταῖροι, ..."
3780,8589936118,145,"Homer, Iliad 19.155-19.183",com|del|ins,ἄγω λαός σκεδάζω δεῖπνον ἀνόχω ἕπλω δῶρον ἄναξ...,"μὴ δʼ οὕτως, ἀγαθός περ ἐών, θεοείκελʼ Ἀχιλλεῦ..."


### Experiment 1 (exp1)

In [183]:
!cat data/passim/exp1/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 3 --max-repeat 100 data/input/input_lemmatised.json data/passim/exp1


In [212]:
output_path = 'data/passim/exp1/out.json/'
tsv_path = 'data/output/passim_clusters_exp1.csv'
tr_clusters_exp1 = passim_output_to_dataframe(output_path, tsv_path)

There are 23 text reuse clusters in data/passim/exp1/out.json/


### Experiment 2 (exp2)

In [191]:
!cat data/passim/exp2/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 3 --min-match 3 --max-repeat 100 -w 1 data/input/input_lemmatised.json data/passim/exp2


In [213]:
output_path = 'data/passim/exp2/out.json/'
tsv_path = 'data/output/passim_clusters_exp2.csv'
tr_clusters_exp2 = passim_output_to_dataframe(output_path, tsv_path)

There are 24 text reuse clusters in data/passim/exp2/out.json/


### Experiment 3 (exp3)

In [192]:
!cat data/passim/exp3/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 3 --min-match 3 --max-repeat 100 -a 10 data/input/input_lemmatised.json data/passim/exp3



In [214]:
output_path = 'data/passim/exp3/out.json/'
tsv_path = 'data/output/passim_clusters_exp3.csv'
tr_clusters_exp3 = passim_output_to_dataframe(output_path, tsv_path)

There are 41 text reuse clusters in data/passim/exp3/out.json/


### Experiment 4 (exp 4)

In [202]:
!cat data/passim/exp4/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 2 --min-match 2 --max-repeat 100 -a 10 data/input/input_lemmatised.json data/passim/exp4


In [210]:
output_path = 'data/passim/exp4/out.json/'
tsv_path = 'data/output/passim_clusters_exp4.csv'
tr_clusters_exp4 = passim_output_to_dataframe(output_path, tsv_path)

There are 212 text reuse clusters in data/passim/exp4/out.json/


### Experiment 5 (exp5)

In [8]:
!cat data/passim/exp5/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 3 --min-match 1 --max-repeat 100 -a 10 data/input/input_lemmatised.json data/passim/exp5



In [9]:
output_path = 'data/passim/exp5/out.json/'
tsv_path = 'data/output/passim_clusters_exp5.csv'
tr_clusters_exp5 = passim_output_to_dataframe(output_path, tsv_path)

There are 41 text reuse clusters in data/passim/exp5/out.json/


## Experiment 5 (exp5)

In [11]:
!cat data/passim/exp6/memo.txt

/home/romanell/Documents/impresso/passim/bin/passim -n 4 --min-match 1 --max-repeat 100 -a 10 data/input/input_lemmatised.json data/passim/exp6



In [10]:
output_path = 'data/passim/exp6/out.json/'
tsv_path = 'data/output/passim_clusters_exp6.csv'
tr_clusters_exp6 = passim_output_to_dataframe(output_path, tsv_path)

There are 36 text reuse clusters in data/passim/exp6/out.json/
