# Repetitions in Homeric speeches: get data

The DICES search interface is found at: http://dices.ub.uni-rostock.de/app/speeches/search/ 

## TODO

- ✅ get all speeches for Od. and Il. from Dices API
- ✅ filter speeches for relevant tags (giuramenti, istruzioni, preghiere)
- ✅ create an initial dataframe with relevant data
- ✅ add text fetched from Perseus via CTS URN
- (optional for now) add lemmatised version of text via Pirra
- ✅ save data to JSON lines format suitable for processing with passim

## Dices tags

| Dices tag      | Expansion |
| ----------- | ----------- |
| cha         | Challenge |
| com | Command |
| con | Consolation|
| del | Deliberation|
| des | Desire and Wish|
| exh | Exhortation and Self-Exhortation|
| far | Farewell|
| gre | Greeting and Reception|
| inf | Information and Description|
| ins | Instruction|
| inv | Invitation |
| lam | Lament |
| lau | Praise and Laudation |
| mes | Message|
| nar | Narration|
| ora | Prophecy, Oracular Speech, and Interpretation|
| per | Persuasion|
| pra | Prayer|
| que | Question|
| req | Request|
| res | Reply to Question|
| tau | Taunt|
| thr | Threat|
| und | Undefined|
| vit | Vituperation|
| vow | Promise and Oath|
| war | Warning|

## Retrieving all Homeric speeches from Dices

In [157]:
from dicesapi import DicesAPI
import pandas as pd

# create a connection to DICES
api = DicesAPI(logfile='data/log/dices.log')

In [120]:
od_speeches = api.getSpeeches(work_title='Odyssey', progress=True)
print('Got', len(od_speeches), 'speeches')

Got 673 speeches


In [121]:
il_speeches = api.getSpeeches(work_title='Iliad', progress=True)
print('Got', len(il_speeches), 'speeches')

Got 698 speeches


In [122]:
speeches = il_speeches + od_speeches

In [123]:
len(speeches)

1371

## Filter speeches

The `tag` information is contained in the `dicesapi.Speech` class, yet a bid hidden. 

In [124]:
def filter_speeches_by_tag(speeches_list, tags_to_keep):
    filtered_speeches = []
    for speech in speeches_list:
        matching_tags = list(set([t['type'] for t in speech._attributes['tags']]) & set(tags_to_keep))
        if len(matching_tags) > 0:
            filtered_speeches.append(speech)
    return filtered_speeches

In [125]:
# ⚠️ check with Ombretta which tags exactly correspond to "discorso di istruzione, discorso di recapito"
speeches_subset = filter_speeches_by_tag(speeches, ['vow', 'ins', 'pra', 'mes', 'nar'])

In [126]:
len(speeches_subset)

221

## Fetch text of speeches from Perseus

In [127]:
from dicesapi.text import CtsAPI
from dicesapi.jupyter import NotebookPBar
cts = CtsAPI()

In [128]:
def download_speech_text(speeches):
    '''Download the text of the speeches from CTS server, append to speech objects'''
    pbar = NotebookPBar(max=len(speeches), prefix='Downloading text')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            s.passage = cts.getPassage(s)
        pbar.update()
    return speeches

In [129]:
speeches_subset = download_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=221), Label(value='Downloading text 0/221')))

In [130]:
speeches_subset[-1].passage.text

'οὕτως οὔ τίς οἱ νεμεσήσεται οὐδʼ ἀπιθήσει Ἀργείων, ὅτε κέν τινʼ ἐποτρύνῃ καὶ ἀνώγῃ.'

In [131]:
len(speeches_subset)

221

## Process speeches with CLTK

In [132]:
def parse_speech_text(speeches):
    '''Run CLTK NLP pipeline to parse all the speeches'''
    
    pbar = NotebookPBar(max=len(speeches), prefix='Running NLP')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            print('no passage:', s)
        elif not hasattr(s.passage, 'cltk') or s.passage.cltk is None:
            s.passage.runCltkPipeline(remove_punct=True)
        pbar.update()
    return speeches



In [133]:
speeches_subset = parse_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=221), Label(value='Running NLP 0/221')))

In [134]:
s = speeches_subset[0]
s.passage.text

'κλῦθί μευ ἀργυρότοξʼ, ὃς Χρύσην ἀμφιβέβηκας Κίλλάν τε ζαθέην Τενέδοιό τε ἶφι ἀνάσσεις, Σμινθεῦ εἴ ποτέ τοι χαρίεντʼ ἐπὶ νηὸν ἔρεψα, ἢ εἰ δή ποτέ τοι κατὰ πίονα μηρίʼ ἔκηα ταύρων ἠδʼ αἰγῶν, τὸ δέ μοι κρήηνον ἐέλδωρ· τίσειαν Δαναοὶ ἐμὰ δάκρυα σοῖσι βέλεσσιν.'

In [135]:
keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun']
" ".join([w.lemma for w in s.passage.cltk if str(w.pos) in keep_pos_tags])

'κλίνω ἐγώ ὅς Χρύση ἀμφιβαίνω Κίλλας Τενέδοιος ἶφις ἀνίσσημι Σμινθεύς σύ χαρίω ναός ῥέπω σύ πίων μηρίνω καάω ταῦρος ἤδος αἰγός ἐγώ κρήνον ἐέλδω τίνω δάκρυον βέλεσσις'

## Speeches to Dataframe

In [136]:
len(speeches_subset)

221

In [144]:
def make_speeches_table(speeches) -> pd.DataFrame:

    keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun']

    return pd.DataFrame(dict(
        speech_id = speech.id,
        group = f"speech-{'-'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))}",
        length = len([w for w in speech.passage.text]),
        passage_urn = speech.urn,
        lemmatised_text = " ".join([w.lemma for w in speech.passage.cltk]),
        lemmatised_filtered_text = " ".join([w.lemma for w in speech.passage.cltk if str(w.pos) in keep_pos_tags]),
        text = speech.passage.text,
        label = f"{speech.author.name}, {speech.work.title} {speech.urn.split(':')[-1]}",
        dices_tags = '|'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))
    ) for speech in speeches)

In [145]:
speeches_df = make_speeches_table(speeches_subset)

In [146]:
speeches_df

Unnamed: 0,speech_id,group,length,passage_urn,lemmatised_text,lemmatised_filtered_text,text,label,dices_tags
0,3,speech-pra-req,256,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1...,κλίνω ἐγώ ἀργυρός ὅς Χρύση ἀμφιβαίνω Κίλλας τε...,κλίνω ἐγώ ὅς Χρύση ἀμφιβαίνω Κίλλας Τενέδοιος ...,"κλῦθί μευ ἀργυρότοξʼ, ὃς Χρύσην ἀμφιβέβηκας Κί...","Homer, Iliad 1.37-1.42",pra|req
1,270,speech-pra-req,217,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1...,κλίνω ἐγώ αἴγιοχος Ζεύς τέκος ὅς τε ἐγώ ἀεί ἐν...,κλίνω ἐγώ αἴγιοχος Ζεύς τέκος ὅς ἐγώ πόνος παρ...,"κλῦθί μευ αἰγιόχοιο Διὸς τέκος, ἥ τέ μοι αἰεὶ ...","Homer, Iliad 10.278-10.282",pra|req
2,271,speech-pra-req,475,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1...,κλύω νῦν καί ἐγώ Ζεύς τέκος Ἀτρυτώνη σπεῖον ἐγ...,κλύω ἐγώ Ζεύς τέκος Ἀτρυτώνη σπεῖον ἐγώ πατήρ ...,κέκλυθι νῦν καὶ ἐμεῖο Διὸς τέκος Ἀτρυτώνη· σπε...,"Homer, Iliad 10.284-10.294",pra|req
3,644,speech-pra-req,44,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,κλίνω θεά ἀγαθός ἐγώ ἐπίρροθος ἔρχομαι ποδοῖι,κλίνω θεά ἐγώ ἐπίρροθος ἔρχομαι ποδοῖι,"κλῦθι θεά, ἀγαθή μοι ἐπίρροθος ἐλθὲ ποδοῖιν.","Homer, Iliad 23.770-23.770",pra|req
4,699,speech-lam-nar,531,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,ὅς πόπος οἷος δή νυ θεός βροτοί αἰτιόω ἐκ ἡμεῖ...,ὅς πόπος οἷος νυ θεός βροτοί αἰτιόω ἡμεῖς φημί...,"ὢ πόποι, οἷον δή νυ θεοὺς βροτοὶ αἰτιόωνται· ἐ...","Homer, Odyssey 1.32-1.43",lam|nar
...,...,...,...,...,...,...,...,...,...
216,243,speech-del-nar-per,7409,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:9...,εἰ μέν δή νόστος γε μετά φρής φαίδιμʼ Ἀχιλλῆς ...,νόστος φρής φαίδιμʼ Ἀχιλλῆς βάλλω ἀμύνω ναῦς θ...,εἰ μὲν δὴ νόστόν γε μετὰ φρεσὶ φαίδιμʼ Ἀχιλλεῦ...,"Homer, Iliad 9.434-9.605",del|nar|per
217,1236,speech-lau-nar,1211,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:2...,χαίρω πατήρ ὦ ξένος γίγνομαι σύ εἰς πέρ ὀπίσσω...,χαίρω πατήρ ξένος γίγνομαι σύ ὀπίσσω ὄλβος ἔχω...,"χαῖρε, πάτερ ὦ ξεῖνε· γένοιτό τοι ἔς περ ὀπίσσ...","Homer, Odyssey 20.199-20.225",lau|nar
218,1237,speech-vow,367,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:2...,βουκόω ἐπεί οὔτε κακός οὔτʼ ἄφρων φῶς ἔοικα γι...,βουκόω φῶς ἔοικα γιγνώσκω ὅς σύ φρήν ἵκω τοῦνο...,"βουκόλʼ, ἐπεὶ οὔτε κακῷ οὔτʼ ἄφρονι φωτὶ ἔοικα...","Homer, Odyssey 20.227-20.234",vow
219,248,speech-mes-res,699,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:9...,Ἀτρεΐδης κύδιστος ἄναξ ἀνήρ Ἀγάμεμνος ἐκεῖνος ...,Ἀτρεΐδης ἄναξ ἀνήρ Ἀγάμεμνος γι ἐθέλω σβέννυμι...,Ἀτρεΐδη κύδιστε ἄναξ ἀνδρῶν Ἀγάμεμνον κεῖνός γ...,"Homer, Iliad 9.677-9.692",mes|res


## Convert speeches to Passim format

```json
{
    "id": "...",
    "group": "...",
    "text": "...",
}

In [151]:
import json

def speeches_to_passim(speeches_df : pd.DataFrame, json_path : str, lemmatised=False) -> None:

    # transform DataFrame to passim-amenable JSON
    docs = [
        {
            "id" : idx,
            "group": row.group,
            "label": row.label,
            "dices_tags": row.dices_tags,
            "text": row.lemmatised_filtered_text if lemmatised else row.text,
            "raw_text": row.text if lemmatised else None,
            "passage_urn": row.passage_urn
        }
        for idx, row in speeches_df.iterrows()
    ]

    # write passim docs to JSON file
    with open(json_path, 'w', encoding='utf-8') as f:
        for d in docs:
            print(json.dumps(d, ensure_ascii=False), file=f)



In [149]:
# speeches_to_passim(speeches_df, 'test_input.json')

In [159]:
passim_input_path = 'data/input/input_lemmatised.json'

In [160]:
speeches_to_passim(speeches_df, passim_input_path, lemmatised=True)

## Run Passim

In [161]:
passim_output_path = "data/passim/out_cluster/"
passim_logfile = "data/log/passim.log"

In [162]:
!rm -r {passim_output_path}

In [124]:
# -a : Minimum length of alignment (default: 50)
!seriatim -a 5 {passim_input_path} {passim_output_path} >& {passim_logfile}

^C


## Read Passim's output

In [120]:
import glob, itertools
import pandas as pd

# Read one JSON record per line
def read_jsonl_file(f):
  res = []
  for line in f:
    res.append(json.loads(line))
  return res

def read_jsonl(d):
  return list(itertools.chain.from_iterable([read_jsonl_file(open(f)) for f in glob.glob(d + '/*.json')]))

In [121]:
output_path = 'out_cluster/out.json/'
tr_clusters = pd.DataFrame(read_jsonl(output_path))

In [122]:
print(f'There are {tr_clusters.cluster.unique().size} text reuse clusters in {output_path}')

There are 10 text reuse clusters in out_cluster/out.json/


In [123]:
tr_clusters[:2]

Unnamed: 0,uid,cluster,begin,end,boiler,src,size,pboiler,dices_tags,group,id,label,passage_urn,text
0,1158869053340225277,1,176,261,False,"[{'uid': -2005171627764057613, 'begin': 147, '...",4,0.0,pra | req,speeches,speech_107,"Homer, Odyssey 9.528-9.535",urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:9...,οἱ μοῖρʼ ἐστὶ φίλους τʼ ἰδέειν καὶ ἱκέσθαι οἶ...
1,-2005171627764057613,1,142,225,False,[],4,0.0,res | ins,speeches,speech_78,"Homer, Odyssey 4.472-4.480",urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:4...,ι πρὶν μοῖρα φίλους τʼ ἰδέειν καὶ ἱκέσθαι οἶκο...


In [98]:
columns_to_keep = ['cluster', 'id', 'label', 'dices_tags', 'text']
df = tr_clusters[columns_to_keep]

In [104]:
df.to_csv('passim_clusters.csv')