# Repetitions in Homeric speeches: get data

The DICES search interface is found at: http://dices.ub.uni-rostock.de/app/speeches/search/ 

## Retrieving all Homeric speeches from Dices

In [4]:
from dicesapi import DicesAPI
import pandas as pd

# create a connection to DICES
api = DicesAPI(
    dices_api='http://dices.ub.uni-rostock.de/api/',
    logfile='data/log/dices.log'
)

In [3]:
DicesAPI?

[0;31mInit signature:[0m
[0mDicesAPI[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdices_api[0m[0;34m=[0m[0;34m'http://csa20211203-005.uni-rostock.de/api/'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcts_api[0m[0;34m=[0m[0;34m'https://scaife-cts.perseus.org/api/cts'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlogfile[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlogdetail[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprogress_class[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      a connection to the DICES API
[0;31mInit docstring:[0m
The __init__ function is called when a class is instantiated. 
It initializes the attributes of the class, and it can take arguments that get passed to it by its parent class. 
In this case, we are using the __init__ function to initialize some attributes in our Dices object.

    self: Refer to the object in

In [5]:
od_speeches = api.getSpeeches(work_title='Odyssey', progress=True)
print('Got', len(od_speeches), 'speeches')

Got 673 speeches


In [6]:
il_speeches = api.getSpeeches(work_title='Iliad', progress=True)
print('Got', len(il_speeches), 'speeches')

Got 698 speeches


In [7]:
speeches = il_speeches + od_speeches

In [8]:
len(speeches)

1371

In [35]:
speeches[0]._attributes['tags']
speeches[0].author
speeches[0].work
speeches[0].cluster

<SpeechCluster 2019: Odyssey 3.69 ff.>

## Filter speeches

The `tag` information is contained in the `dicesapi.Speech` class, yet a bid hidden. 

In [124]:
def filter_speeches_by_tag(speeches_list, tags_to_keep):
    filtered_speeches = []
    for speech in speeches_list:
        matching_tags = list(set([t['type'] for t in speech._attributes['tags']]) & set(tags_to_keep))
        if len(matching_tags) > 0:
            filtered_speeches.append(speech)
    return filtered_speeches

In [125]:
# ⚠️ check with Ombretta which tags exactly correspond to "discorso di istruzione, discorso di recapito"
speeches_subset = filter_speeches_by_tag(speeches, ['vow', 'ins', 'pra', 'mes', 'nar'])

In [126]:
len(speeches_subset)

221

## Fetch text of speeches from Perseus

In [11]:
from dicesapi.text import CtsAPI
from dicesapi.jupyter import NotebookPBar
cts = CtsAPI()

In [20]:
test = speeches[-1]

In [21]:
test.urn

'urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.106-1.120'

In [22]:
cts.getPassage(test).text

'μάντι κακῶν οὐ πώ ποτέ μοι τὸ κρήγυον εἶπας· αἰεί τοι τὰ κάκʼ ἐστὶ φίλα φρεσὶ μαντεύεσθαι, ἐσθλὸν δʼ οὔτέ τί πω εἶπας ἔπος οὔτʼ ἐτέλεσσας· καὶ νῦν ἐν Δαναοῖσι θεοπροπέων ἀγορεύεις ὡς δὴ τοῦδʼ ἕνεκά σφιν ἑκηβόλος ἄλγεα τεύχει, οὕνεκʼ ἐγὼ κούρης Χρυσηΐδος ἀγλάʼ ἄποινα οὐκ ἔθελον δέξασθαι, ἐπεὶ πολὺ βούλομαι αὐτὴν οἴκοι ἔχειν· καὶ γάρ ῥα Κλυταιμνήστρης προβέβουλα κουριδίης ἀλόχου, ἐπεὶ οὔ ἑθέν ἐστι χερείων, οὐ δέμας οὐδὲ φυήν, οὔτʼ ἂρ φρένας οὔτέ τι ἔργα. ἀλλὰ καὶ ὧς ἐθέλω δόμεναι πάλιν εἰ τό γʼ ἄμεινον· βούλομʼ ἐγὼ λαὸν σῶν ἔμμεναι ἢ ἀπολέσθαι· αὐτὰρ ἐμοὶ γέρας αὐτίχʼ ἑτοιμάσατʼ ὄφρα μὴ οἶος Ἀργείων ἀγέραστος ἔω, ἐπεὶ οὐδὲ ἔοικε· λεύσσετε γὰρ τό γε πάντες ὅ μοι γέρας ἔρχεται ἄλλῃ.'

In [10]:
def download_speech_text(speeches):
    '''Download the text of the speeches from CTS server, append to speech objects'''
    pbar = NotebookPBar(max=len(speeches), prefix='Downloading text')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            s.passage = cts.getPassage(s)
        pbar.update()
    return speeches

In [11]:
speeches_subset = speeches
len(speeches_subset)

1371

In [12]:
speeches_subset = download_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=1371), Label(value='Downloading text 0/1371')))

In [13]:
speeches_subset[-1].passage.text

'τοιγὰρ ἐγώ τοι, ξεῖνε, μάλʼ ἀτρεκέως ἀγορεύσω. μήτηρ μέν τέ μέ φησι τοῦ ἔμμεναι, αὐτὰρ ἐγώ γε οὐκ οἶδʼ· οὐ γάρ πώ τις ἑὸν γόνον αὐτὸς ἀνέγνω. ὡς δὴ ἐγώ γʼ ὄφελον μάκαρός νύ τευ ἔμμεναι υἱὸς ἀνέρος, ὃν κτεάτεσσιν ἑοῖς ἔπι γῆρας ἔτετμε. νῦν δʼ ὃς ἀποτμότατος γένετο θνητῶν ἀνθρώπων, τοῦ μʼ ἔκ φασι γενέσθαι, ἐπεὶ σύ με τοῦτʼ ἐρεείνεις.'

In [14]:
len(speeches_subset)

1371

## Process speeches with CLTK

In [15]:
def parse_speech_text(speeches):
    '''Run CLTK NLP pipeline to parse all the speeches'''
    
    pbar = NotebookPBar(max=len(speeches), prefix='Running NLP')

    for s in speeches:
        if not hasattr(s, 'passage') or s.passage is None:
            print('no passage:', s)
        elif not hasattr(s.passage, 'cltk') or s.passage.cltk is None:
            s.passage.runCltkPipeline(remove_punct=True)
        pbar.update()
    return speeches



In [16]:
speeches_subset = parse_speech_text(speeches_subset)

HBox(children=(IntProgress(value=0, bar_style='info', max=1371), Label(value='Running NLP 0/1371')))

no passage: <Speech 931: Odyssey 10.456-10.465>


In [17]:
s = speeches_subset[0]
s.passage.text

'ὢ πόποι αἰγιόχοιο Διὸς τέκος οὐκέτι νῶϊ ὀλλυμένων Δαναῶν κεκαδησόμεθʼ ὑστάτιόν περ; οἵ κεν δὴ κακὸν οἶτον ἀναπλήσαντες ὄλωνται ἀνδρὸς ἑνὸς ῥιπῇ, ὃ δὲ μαίνεται οὐκέτʼ ἀνεκτῶς Ἕκτωρ Πριαμίδης, καὶ δὴ κακὰ πολλὰ ἔοργε.'

In [21]:
# get all pos tags in order to know what to keep and what to exclude
set([str(w.pos) for w in s.passage.cltk for s in speeches_subset])

{'adjective',
 'adverb',
 'coordinating_conjunction',
 'determiner',
 'interjection',
 'noun',
 'numeral',
 'pronoun',
 'subordinating_conjunction',
 'verb'}

In [22]:
keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun', 'adjective', 'adverb']
" ".join([w.lemma for w in s.passage.cltk if str(w.pos) in keep_pos_tags])

'φίλη σύ πρῶτος κιχάνω οὗδος χῶρος χαίρω μή ἐγώ τὶς κακός νόος ἀντιβολέω σύ μέν οὗτος σῴζω δʼ ἐγώ σύ γάρ ἐγώ γε εὔχομαι ὡς θεός σύ φίλος γούνα ἱκάνω ἐγώ οὗτος ἀγορεύω τυμόω ἐγώ οἶδα τίς γῆ τίς δῆμος τίς ἀνήρ ἐγγάγω ἦ πού τὶς νῆσος εὐδείελος ἐγώ ἀκτή κεῖος ἅλς κλίνω ἐριβάλλω ἀπείρω'

## Speeches to Dataframe

In [24]:
len(speeches_subset)

1371

In [25]:
def make_speeches_table(speeches) -> pd.DataFrame:

    keep_pos_tags = ['verb', 'pronoun', 'noun', 'proper_noun', 'adjective', 'adverb']

    return pd.DataFrame(dict(
        speech_id = speech.id,
        group = f"speech-{'-'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))}",
        length = len([w for w in speech.passage.text]) if speech.passage else None,
        passage_urn = speech.urn,
        speaker = speech.getSpkrString(),
        addressee = speech.getAddrString(),
        lemmatised_text = " ".join([w.lemma for w in speech.passage.cltk]) if speech.passage else None,
        lemmatised_filtered_text = " ".join([w.lemma for w in speech.passage.cltk if str(w.pos) in keep_pos_tags]) if speech.passage else None,
        text = speech.passage.text if speech.passage else None,
        label = f"{speech.author.name}, {speech.work.title} {speech.urn.split(':')[-1]}",
        dices_tags = '|'.join(sorted([tag['type'] for tag in speech._attributes['tags']]))
    ) for speech in speeches)

In [26]:
speeches_df = make_speeches_table(speeches_subset)

In [27]:
speeches_df

Unnamed: 0,speech_id,group,length,passage_urn,speaker,addressee,lemmatised_text,lemmatised_filtered_text,text,label,dices_tags
0,223,speech-que,215.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Hera,Athena,ὅς πόπος αἴγιοχος Ζεύς τέκος οὐκέτι νῶϊ ὀλλύω ...,ὅς πόπος αἴγιοχος Ζεύς τέκος οὐκέτι νῶϊ ὀλλύω ...,ὢ πόποι αἰγιόχοιο Διὸς τέκος οὐκέτι νῶϊ ὀλλυμέ...,"Homer, Iliad 8.352-8.356",que
1,605,speech-del,463.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Andromache,servants,δεῦτε δύω ἐγώ ἕπεσθος ὁράω ὅστις ἔργον τύω αἰδ...,ἐγώ ἕπεσθος ὁράω ὅστις ἔργον τύω αἰδοία ἑκυρός...,"δεῦτε δύω μοι ἕπεσθον, ἴδωμʼ ὅτινʼ ἔργα τέτυκτ...","Homer, Iliad 22.450-22.459",del
2,224,speech-del-des-lam,995.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Athena,Hera,καί λίαν οὗτος γε μένος θυμός τʼ ὀλείσειμι χεί...,λίαν οὗτος γε μένος θυμός ὀλείσειμι χείρ Ἀργεῖ...,καὶ λίην οὗτός γε μένος θυμόν τʼ ὀλέσειε χερσὶ...,"Homer, Iliad 8.358-8.380",del|des|lam
3,606,speech-lam-ora,1691.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Andromache,Hector,Ἕκτορ ἐγώ δύστηνος εἰ ἆρα γείνω αἶμαι ἀμφότερο...,Ἕκτορ ἐγώ δύστηνος εἰ ἆρα γείνω αἶμαι ἀμφότερο...,Ἕκτορ ἐγὼ δύστηνος· ἰῇ ἄρα γεινόμεθʼ αἴσῃ ἀμφό...,"Homer, Iliad 22.477-22.514",lam|ora
4,225,speech-com,450.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Zeus,Iris,βάσκυς εἶμι Ἶρος ταχύς πάλιν τρέπω μηδʼ ἐάω ἄν...,βάσκυς εἶμι Ἶρος ταχύς πάλιν τρέπω ἐάω ἔρχομαι...,"βάσκʼ ἴθι Ἶρι ταχεῖα, πάλιν τρέπε μηδʼ ἔα ἄντη...","Homer, Iliad 8.399-8.408",com
...,...,...,...,...,...,...,...,...,...,...,...
1366,46,speech-com-del,322.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Odysseus,Greeks,δαιμόνις ἀτρέμας ἧμαι καί ἄλλος μῦθος ἀκούω ὁ ...,δαιμόνις ἀτρέμας ἧμαι ἄλλος μῦθος ἀκούω ὁ σύ φ...,"δαιμόνιʼ ἀτρέμας ἧσο καὶ ἄλλων μῦθον ἄκουε, οἳ...","Homer, Iliad 2.200-2.206",com|del
1367,39,speech-del-inf-nar,893.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Agamemnon,leaders of the Greeks (Greek soldiers),κλύω φίλος θεῖός ἐγώ ἐνύπνιον ἔρχομαι ὄνειρος ...,κλύω φίλος θεῖός ἐγώ ἐνύπνιον ἔρχομαι ὄνειρος ...,κλῦτε φίλοι· θεῖός μοι ἐνύπνιον ἦλθεν ὄνειρος ...,"Homer, Iliad 2.56-2.75",del|inf|nar
1368,45,speech-com-per-war,350.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Odysseus,leaders of the Greeks (Greek soldiers),δαιμόνις οὐ σύ ἔοικα κακός ὡς δειδίζω ἀλλʼ αὐτ...,δαιμόνις οὐ σύ ἔοικα κακός ὡς δειδίζω αὐτός κά...,"δαιμόνιʼ οὔ σε ἔοικε κακὸν ὣς δειδίσσεσθαι, ἀλ...","Homer, Iliad 2.190-2.197",com|per|war
1369,705,speech-inf-que-res,1514.0,urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:1...,Athena,Telemachus,τοιγάρ ἐγώ σύ οὗτος μαγος ἀτρεκέως ἀγορεύω Μέν...,ἐγώ σύ οὗτος μαγος ἀτρεκέως ἀγορεύω Μέντη Ἀγχι...,τοιγὰρ ἐγώ τοι ταῦτα μάλʼ ἀτρεκέως ἀγορεύσω. Μ...,"Homer, Odyssey 1.179-1.212",inf|que|res


In [33]:
speeches_df.to_pickle('data/homeric_speeches_df.pickle')

## Convert speeches to Passim format

```json
{
    "id": "...",
    "group": "...",
    "text": "...",
}

In [4]:
import pandas as pd
import sys
sys.path.append('.')
from lib.utils import *

In [2]:
speeches_df = pd.read_pickle('data/homeric_speeches_df.pickle')

In [8]:
speeches_df.head()

Unnamed: 0,speech_id,group,length,passage_urn,speaker,addressee,lemmatised_text,lemmatised_filtered_text,text,label,dices_tags
0,223,speech-que,215.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Hera,Athena,ὅς πόπος αἴγιοχος Ζεύς τέκος οὐκέτι νῶϊ ὀλλύω ...,ὅς πόπος αἴγιοχος Ζεύς τέκος οὐκέτι νῶϊ ὀλλύω ...,ὢ πόποι αἰγιόχοιο Διὸς τέκος οὐκέτι νῶϊ ὀλλυμέ...,"Homer, Iliad 8.352-8.356",que
1,605,speech-del,463.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Andromache,servants,δεῦτε δύω ἐγώ ἕπεσθος ὁράω ὅστις ἔργον τύω αἰδ...,ἐγώ ἕπεσθος ὁράω ὅστις ἔργον τύω αἰδοία ἑκυρός...,"δεῦτε δύω μοι ἕπεσθον, ἴδωμʼ ὅτινʼ ἔργα τέτυκτ...","Homer, Iliad 22.450-22.459",del
2,224,speech-del-des-lam,995.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Athena,Hera,καί λίαν οὗτος γε μένος θυμός τʼ ὀλείσειμι χεί...,λίαν οὗτος γε μένος θυμός ὀλείσειμι χείρ Ἀργεῖ...,καὶ λίην οὗτός γε μένος θυμόν τʼ ὀλέσειε χερσὶ...,"Homer, Iliad 8.358-8.380",del|des|lam
3,606,speech-lam-ora,1691.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2...,Andromache,Hector,Ἕκτορ ἐγώ δύστηνος εἰ ἆρα γείνω αἶμαι ἀμφότερο...,Ἕκτορ ἐγώ δύστηνος εἰ ἆρα γείνω αἶμαι ἀμφότερο...,Ἕκτορ ἐγὼ δύστηνος· ἰῇ ἄρα γεινόμεθʼ αἴσῃ ἀμφό...,"Homer, Iliad 22.477-22.514",lam|ora
4,225,speech-com,450.0,urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:8...,Zeus,Iris,βάσκυς εἶμι Ἶρος ταχύς πάλιν τρέπω μηδʼ ἐάω ἄν...,βάσκυς εἶμι Ἶρος ταχύς πάλιν τρέπω ἐάω ἔρχομαι...,"βάσκʼ ἴθι Ἶρι ταχεῖα, πάλιν τρέπε μηδʼ ἔα ἄντη...","Homer, Iliad 8.399-8.408",com


In [9]:
speeches_to_passim(speeches_df[~speeches_df.text.isna()], 'data/input/homeric_speeches_lemmatised.json', lemmatised=True)

In [10]:
speeches_to_passim(speeches_df[~speeches_df.text.isna()], 'data/input/homeric_speeches_raw.json', lemmatised=False)