# Repetitions in Homeric speeches: get data

The DICES search interface is found at: http://dices.ub.uni-rostock.de/app/speeches/search/ 

## TODO

- ✅ get all speeches for Od. and Il. from Dices API
- ✅ filter speeches for relevant tags (giuramenti, istruzioni, preghiere)
- ✅ create an initial dataframe with relevant data
- ✅ add text fetched from Perseus via CTS URN
- (optional for now) add lemmatised version of text via Pirra
- ✅ save data to JSON lines format suitable for processing with passim

## Dices tags

| Dices tag      | Expansion |
| ----------- | ----------- |
| cha         | Challenge |
| com | Command |
| con | Consolation|
| del | Deliberation|
| des | Desire and Wish|
| exh | Exhortation and Self-Exhortation|
| far | Farewell|
| gre | Greeting and Reception|
| inf | Information and Description|
| ins | Instruction|
| inv | Invitation |
| lam | Lament |
| lau | Praise and Laudation |
| mes | Message|
| nar | Narration|
| ora | Prophecy, Oracular Speech, and Interpretation|
| per | Persuasion|
| pra | Prayer|
| que | Question|
| req | Request|
| res | Reply to Question|
| tau | Taunt|
| thr | Threat|
| und | Undefined|
| vit | Vituperation|
| vow | Promise and Oath|
| war | Warning|

## Retrieving all Homeric speeches from Dices

In [1]:
from dicesapi import DicesAPI
import pandas as pd

# create a connection to DICES
api = DicesAPI(logfile='dices.log')

In [2]:
od_speeches = api.getSpeeches(work_title='Odyssey', progress=True)
print('Got', len(od_speeches), 'speeches')

Got 673 speeches


In [3]:
il_speeches = api.getSpeeches(work_title='Iliad', progress=True)
print('Got', len(il_speeches), 'speeches')

Got 698 speeches


In [4]:
speeches = il_speeches + od_speeches

In [5]:
len(speeches)

1371

## Filter speeches

In [20]:
unique_tags = set()

for speech in speeches:
    for tag in speech._attributes['tags']:
        unique_tags.add(tag['type'])

print(unique_tags)

{'req', 'ora', 'per', 'gre', 'tau', 'lam', 'que', 'vow', 'pra', 'war', 'und', 'con', 'nar', 'inv', 'lau', 'exh', 'ins', 'mes', 'inf', 'res', 'thr', 'com', 'del', 'cha', 'far', 'des', 'vit'}


The `tag` information is contained in the `dicesapi.Speech` class, yet a bid hidden. 

In [8]:
def filter_speeches_by_tag(speeches_list, tags_to_keep):
    filtered_speeches = []
    for speech in speeches_list:
        for tag in speech._attributes['tags']:
            if tag['type'] in tags_to_keep:
                filtered_speeches.append(speech)
    return filtered_speeches

In [18]:
speeches_subset = filter_speeches_by_tag(speeches, ['vow', 'ins', 'pra'])

In [19]:
len(speeches_subset)

123

## Fetch text of speeches from Perseus

In [26]:
from dicesapi.text import CtsAPI
cts = CtsAPI()

In [36]:
for speech in speeches_subset:
    text = cts.getPassage(speech)
    speech.passage = text

In [37]:
speeches_subset[0]._attributes['tags']

[{'type': 'ins', 'doubt': False}, {'type': 'com', 'doubt': False}]

In [39]:
speeches_subset[-1].passage.text

'ὦ γέρον αἰεί τοι μῦθοι φίλοι ἄκριτοί εἰσιν, ὥς ποτʼ ἐπʼ εἰρήνης· πόλεμος δʼ ἀλίαστος ὄρωρεν. ἤδη μὲν μάλα πολλὰ μάχας εἰσήλυθον ἀνδρῶν, ἀλλʼ οὔ πω τοιόνδε τοσόνδέ τε λαὸν ὄπωπα· λίην γὰρ φύλλοισιν ἐοικότες ἢ ψαμάθοισιν ἔρχονται πεδίοιο μαχησόμενοι προτὶ ἄστυ. Ἕκτορ σοὶ δὲ μάλιστʼ ἐπιτέλλομαι, ὧδε δὲ ῥέξαι· πολλοὶ γὰρ κατὰ ἄστυ μέγα Πριάμου ἐπίκουροι, ἄλλη δʼ ἄλλων γλῶσσα πολυσπερέων ἀνθρώπων· τοῖσιν ἕκαστος ἀνὴρ σημαινέτω οἷσί περ ἄρχει, τῶν δʼ ἐξηγείσθω κοσμησάμενος πολιήτας.'

In [40]:
len(speeches_subset)

123

## Convert speeches to Passim format

```json
{
    "id": "...",
    "group": "...",
    "text": "...",
}

In [81]:
docs = [
    {
        "id": f"speech_{n+1}",
        "group": "speeches",
        "passage_urn": speech.urn,
        "text": speech.passage.text,
        "label": f"{speech.author.name}, {speech.work.title} {speech.urn.split(':')[-1]}",
        "dices_tags": " | ".join([tag['type'] for tag in speech._attributes['tags']])
    }
    for n, speech in enumerate(speeches_subset)
]

In [82]:
docs[0]

{'id': 'speech_1',
 'group': 'speeches',
 'passage_urn': 'urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:16.492-16.502',
 'text': 'Γλαῦκε πέπον πολεμιστὰ μετʼ ἀνδράσι νῦν σε μάλα χρὴ αἰχμητήν τʼ ἔμεναι καὶ θαρσαλέον πολεμιστήν· νῦν τοι ἐελδέσθω πόλεμος κακός, εἰ θοός ἐσσι. πρῶτα μὲν ὄτρυνον Λυκίων ἡγήτορας ἄνδρας πάντῃ ἐποιχόμενος Σαρπηδόνος ἀμφιμάχεσθαι· αὐτὰρ ἔπειτα καὶ αὐτὸς ἐμεῦ πέρι μάρναο χαλκῷ. σοὶ γὰρ ἐγὼ καὶ ἔπειτα κατηφείη καὶ ὄνειδος ἔσσομαι ἤματα πάντα διαμπερές, εἴ κέ μʼ Ἀχαιοὶ τεύχεα συλήσωσι νεῶν ἐν ἀγῶνι πεσόντα. ἀλλʼ ἔχεο κρατερῶς, ὄτρυνε δὲ λαὸν ἅπαντα. ὣς ἄρα μιν εἰπόντα τέλος θανάτοιο κάλυψεν',
 'label': 'Homer, Iliad 16.492-16.502',
 'dices_tags': 'ins | com'}

In [83]:
import json
with open('in.json', 'w', encoding='utf-8') as f:
  for d in docs:
    print(json.dumps(d, ensure_ascii=False), file=f)

## Run Passim

In [124]:
# -a : Minimum length of alignment (default: 50)
!seriatim -a 5 in.json out_cluster >& out_cluster.err

^C


## Read Passim's output

In [120]:
import glob, itertools
import pandas as pd

# Read one JSON record per line
def read_jsonl_file(f):
  res = []
  for line in f:
    res.append(json.loads(line))
  return res

def read_jsonl(d):
  return list(itertools.chain.from_iterable([read_jsonl_file(open(f)) for f in glob.glob(d + '/*.json')]))

In [121]:
output_path = 'out_cluster/out.json/'
tr_clusters = pd.DataFrame(read_jsonl(output_path))

In [122]:
print(f'There are {tr_clusters.cluster.unique().size} text reuse clusters in {output_path}')

There are 10 text reuse clusters in out_cluster/out.json/


In [123]:
tr_clusters[:2]

Unnamed: 0,uid,cluster,begin,end,boiler,src,size,pboiler,dices_tags,group,id,label,passage_urn,text
0,1158869053340225277,1,176,261,False,"[{'uid': -2005171627764057613, 'begin': 147, '...",4,0.0,pra | req,speeches,speech_107,"Homer, Odyssey 9.528-9.535",urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:9...,οἱ μοῖρʼ ἐστὶ φίλους τʼ ἰδέειν καὶ ἱκέσθαι οἶ...
1,-2005171627764057613,1,142,225,False,[],4,0.0,res | ins,speeches,speech_78,"Homer, Odyssey 4.472-4.480",urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:4...,ι πρὶν μοῖρα φίλους τʼ ἰδέειν καὶ ἱκέσθαι οἶκο...


In [98]:
columns_to_keep = ['cluster', 'id', 'label', 'dices_tags', 'text']
df = tr_clusters[columns_to_keep]

In [104]:
df.to_csv('passim_clusters.csv')