# Processing Greek corpora for the riddle solver

<div><br/><img src="delphic_sibyl.png" style="width: 50%; border: 1px solid #ddd; padding: 5px" alt="Michelangelo’s Delphic Sibyl, Sistine Chapel" /></div>
<div><center><i>Michelangel's Delphic Sibyl, Sistine Chapel</i></center><br/></div>

[Pseudo-Sibylline](https://en.wikipedia.org/wiki/Sibylline_Oracles) oracles 
contain hexametric poems written in Ancient Greek. These *oracula* were
mainly composed in 150BC - 700AD to twelve distinct extant books. They were
circulating and quite famous among the Judaeo-Christian community at that time.

They shouldn't, however, be too much confused with the earlier 
[Sibylline books](https://en.wikipedia.org/wiki/Sibylline_Books). Sibylline books
contained religious ceremonial advices that were consulted by the selected
priests and curators in the Roman state, when it was in deep political trouble.
The collection of the original Sibylline books were destroyed by different
accidental events and deliberate actions in history.

Pseudo-Sibylline oracles, on the other hand, contain Jewish narrative of the
human history contrasted to the Greek mythology and to the chronology of the
other great ancient empires. Other intention of the material is to support
evolving Christian doctrine and interpretation of the prophesies. Prophesies
were mostly grounded on Jewish literature, but surprisingly some pagan world
events also came to be interpreted as signs of the coming Messiah. Sibyl, as a
woman prophetess, child of Noah in the Pseudo-Sibylline lore, has a unique
character crossing over the common borders in several ancient religions and 
art.

Good introductions to the Pseudo-Sibylline oracles can be found from these two
books:

1. Sibylline Oracles in [The Old Testament Pseudepigrapha, Volume 
   I](https://books.google.fi/books?id=TNdeolWctsQC) by J. J. Collins
2. Part 1 in [The Book Three of the Sibylline Oracles and Its Social 
   Setting](https://books.google.fi/books?id=Zqh8ZQZqnWYC) by Rieuwerd Buitenwerf

Some material in the Pseudo-Sibylline oracles contain cryptic puzzles, referring 
to persons, cities, countries, and epithets of God for example. These secretive 
references are often very general in nature, pointing only to the first letter 
of the subject and its numerical value. Solving them requires, not so much of 
mathematical or cryptographical skills in modern sense, but a proper knowledge 
of the context, both inner textual and historical context.

Most of the alphanumeric riddles in the oracles can already been taken as solved
by various researchers. Some of the riddles are still problematic and open for
better proposals. Better yet, few of these open riddles are more complex and
specific enough so that one may try to solve them by modern programmable tools.

As an independent researcher not associated or affiliated by any organization,
the sole motivation and purpose of mine in the chapters one and two is to
provide a reusable and a testable method for processing and analysing ancient
corpora, especially detecting alphanumeric patterns in text.

## Natural language processing

Programmatical approach to solve the riddles requires huge Greek text corpora.
Bigger it is, the better. I will download and preprocess available open source
Greek corpora, which is a quite daunting task for many reasons. Programming
language of my choice is [Python](http://python.org) for it has plenty
of good and stable open source libraries required for my work. Python is widely
recognized in academic and scientific circles and well oriented to the research
projects.

I have left the most of the overly technical details of these chapters for the
enthusiasts to read straight from the commented code in 
[functions.py](https://git.io/vAS2Z) script. By collecting the large part of the 
used procedures to the separate script maintains this document more concise too.

In the end of the task of the first chapter, I'll have a word database
containing hundreds of thousands of unique Greek words extracted from the
naturally written language corpora. Then words can be further used in the riddle 
solver in the second chapter.

Note that rather than just reading, this, and the following chapters can also be 
run interactively in your local [Jupyter notebook](https://jupyter.org/) 
installation, if you prefer. That means that you may test and verify the procedure 
or alter parameters and try solving the riddles with your own parameters.

Your can download independent Jupyter notebooks for [processing corpora](https://git.io/vASwM), 
[solving riddles](https://git.io/vASrY), and [analysing results](https://).

You may also run code directly from [Python shell](https://www.python.org/shell/) 
environment, no problem.

## Required components

The first sub task is to get a big raw ancient Greek text to operate with. I have
implemented an importer interface with [tqdm](https://github.com/tqdm/tqdm) library 
to the [Perseus](http://www.perseus.tufts.edu/hopper/opensource/download) and the 
[First1KGreek](http://opengreekandlatin.github.io/First1KGreek/) open source data 
sources in this chapter.

I'm using my own [Abnum](https://github.com/markomanninen/abnum3) library to remove 
accents from the Greek words, remove non-alphabetical characters from the corpora, 
as well as calculating the isopsephical value of the Greek words. 
[Greek_accentuation](https://github.com/jtauber/greek-accentuation) library is used 
to split words into syllables. This is required because the riddles of my closest 
interest contain specific information about the syllables of the words. 
[Pandas](http://pandas.pydata.org/) library is used as an API (application programming 
interface) to the collected database. [Plotly](https://plot.ly/) library and online 
infographic service are used for the visual presentation of the statistics.

You can install these libraries by uncommenting and running the next install lines in 
the Jupyter notebook:

In [1]:
import sys

#!{sys.executable} -m pip install tqdm abnum
#!{sys.executable} -m pip install pandas plotly
#!{sys.executable} -m pip install greek_accentuation

For your convenience, my environment is the following:

In [2]:
print("Python %s" % sys.version)

Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]


Note that `Python 3.4+` is required for all examples to work properly. To find
out other ways of installing PyPI maintained libraries, please consult:
https://packaging.python.org/tutorials/installing-packages/

#### Downloading corpora

I'm going to use `Perseus` and `OpenGreekAndLatin` corpora for the study by
combining them into a single raw text file and unique words database.

The next code snippets will download hundreds of megabytes of Greek text to a
local computer for quicker access. `tqdm` downloader requires a stable internet
connection to work properly.

One could also download source zip files via browser and place them to the same
directory with the Jupyter notebook or where Python is optionally run in shell
mode. Zip files must then be renamed as `perseus.zip` and `first1k.zip`.

1\. Download packed zip files from their GitHub repositories:

In [3]:
from functions import download_with_indicator, perseus_zip_file, first1k_zip_file
# download perseus files
fs = "https://github.com/PerseusDL/canonical-greekLit/archive/master.zip"
download_with_indicator(fs, perseus_zip_file)
# download first1k files
fs = "https://github.com/OpenGreekAndLatin/First1KGreek/archive/master.zip"
download_with_indicator(fs, first1k_zip_file)

2\. Unzip files to the corresponding directories:

In [4]:
from functions import perseus_zip_dir, first1k_zip_dir, unzip
# first argument is the zip source, second is the destination dir
unzip(perseus_zip_file, perseus_zip_dir)
unzip(first1k_zip_file, first1k_zip_dir)

3\. Copy only suitable Greek text xml files from `perseus_zip_dir` and
`first1k_zip_dir` to the temporary work directories. Original repositories
contain a lot of unnecessary files for the riddle solver which are skipped in
this process.

In [5]:
from functions import copy_corpora, joinpaths, perseus_tmp_dir, first1k_tmp_dir
# important files resides in the data directory of the repositories
for item in [[joinpaths(perseus_zip_dir,
              ["canonical-greekLit-master", "data"]), perseus_tmp_dir],
             [joinpaths(first1k_zip_dir,
              ["First1KGreek-master", "data"]), first1k_tmp_dir]]:
    copy_corpora(*item)

greek_text_perseus_tmp already exists. Either remove it and run again, or just use the old one.
greek_text_first1k_tmp already exists. Either remove it and run again, or just use the old one.


Depending on if the files have been downloaded already, the output may differ.

### Collecting files

When the files has been downloaded and copied, it is time to read them to the
RAM (Random-Access Memory). At this point file paths are collected to the
`greek_corpora_x` variable that is used on later iterators.

In [6]:
from functions import init_corpora, perseus_dir, first1k_dir
# collect files and initialize data dictionary
greek_corpora_x = init_corpora([[perseus_tmp_dir, perseus_dir], [first1k_tmp_dir, first1k_dir]])
print(len(greek_corpora_x), "files found")

1705 files found


Actual files found may differ by increasing over time, because Greek corpora
repositories are constantly maintained and new texts are added by voluteer
contributors.

#### Processing files

Next step is to extract Greek content from the downloaded and selected XML
source files. Usually this task might take a lot of effort in NLP. Python 
[NLTK](https://www.nltk.org/) and [CLTK](https://github.com/cltk/cltk) libraries
would be useful at this point, but in my case I'm only interested of Greek words,
that is, text content that has a certain 
[Greek Unicode](https://en.wikipedia.org/wiki/Greek_alphabet#Greek_in_Unicode) 
letter block. Thus I'm able to simplify this part by removing all other characters 
from source files. Again, details can be found from the 
[functions.py](https://git.io/vAS2Z) script.

Extracted content is saved to the `corpora/author/work` based directories.
Simplified uncial conversion is also made at the same time so that the final
data contains only plain uppercase words separated by spaces. Pretty much in a
format written by the ancient Greeks, except they didn't have even spaces to
denote individual words and phrases.

<div><br/><img src="P47.png" style="width: 75%; border: 0px solid #ddd; padding: 5px" alt="Papyrus 47, Uncial Greek text without spaces" /></div>
<div><center><i>Papyrus 47, Uncial Greek text without spaces. Rev 13:17-</i></center><br/></div>

This will take several minutes depending on if you have already run it once and
have the previous temporary directories available. Old processed corpora files
are removed first, then they are recreated by calling `process_greek_corpora`
function.

In [7]:
from functions import remove, all_greek_text_file, perseus_greek_text_file, \
                      first1k_greek_text_file, process_greek_corpora
# remove old processed temporary files
try:
    remove(all_greek_text_file)
    remove(perseus_greek_text_file)
    remove(first1k_greek_text_file)
except OSError:
    pass
# process and get greek corpora data to the RAM memory
# one could use a filter to process only selected files here...
#greek_corpora = process_greek_corpora(list(filter(lambda x: "aristot.nic.eth_gk.xml" in x['file'], greek_corpora_x)))
greek_corpora = process_greek_corpora(greek_corpora_x)

## Statistics

After the files have been downloaded and preprocessed, I'm going to output the
size of them:

In [8]:
from functions import get_file_size

print("Size of the all raw text: %s MB" % get_file_size(all_greek_text_file))
print("Size of the perseus raw text: %s MB" % get_file_size(perseus_greek_text_file))
print("Size of the first1k raw text: %s MB" % get_file_size(first1k_greek_text_file))

Size of the all raw text: 347.85 MB
Size of the perseus raw text: 107.5 MB
Size of the first1k raw text: 240.35 MB


Then, I will calculate other statistics of the saved text files to compare their
content:

In [9]:
from functions import get_stats

ccontent1, chars1, lwords1 = get_stats(perseus_greek_text_file)
ccontent2, chars2, lwords2 = get_stats(first1k_greek_text_file)
ccontent3, chars3, lwords3 = get_stats(all_greek_text_file)

Corpora: perseus_greek_text_files.txt
Letters: 51411752
Words in total: 9900720
Unique words: 423428

Corpora: first1k_greek_text_files.txt
Letters: 114403613
Words in total: 23215849
Unique words: 670149

Corpora: all_greek_text_files.txt
Letters: 165815365
Words in total: 33116569
Unique words: 833817



## Letter statistics

I'm using `DataFrame` class from `Pandas` library to handle tabular data and
show basic letter statistics for each corpora and combination of them. Native
`Counter` class in Python is used to count unique elements in the given
sequence. Sequence in this case is the raw Greek text stripped from all special
characters and spaces, and elements are the letters of the Greek alphabet.

This will take some time to process too:

In [10]:
from functions import Counter, DataFrame
# perseus dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent1).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars1, 2))
a = df.sort_values(1, ascending=False)
# first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent2).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars2, 2))
b = df.sort_values(1, ascending=False)
# perseus + first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent3).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars3, 2))
c = df.sort_values(1, ascending=False)

The first column is the letter, the second column is the count of the letter,
and the third column is the percentage of the letter contra all letters.

In [11]:
from functions import display_side_by_side
# show tables side by side to save some vertical space
display_side_by_side(Perseus=a, First1K=b, Perseus_First1K=c)

Letter,Count,Percent
Α,5636525,10.96
Ε,4934559,9.6
Ο,4928002,9.59
Ι,4872354,9.48
Ν,4537851,8.83
Τ,3924588,7.63
Σ,3824160,7.44
Υ,2407552,4.68
Ρ,1977236,3.85
Η,1885144,3.67

Letter,Count,Percent
Α,12595597,11.01
Ο,11190640,9.78
Ι,11143471,9.74
Ε,10786804,9.43
Ν,9826695,8.59
Τ,9506378,8.31
Σ,8213894,7.18
Υ,5498854,4.81
Η,4394923,3.84
Ρ,4302171,3.76

Letter,Count,Percent
Α,18232122,11.0
Ο,16118642,9.72
Ι,16015825,9.66
Ε,15721363,9.48
Ν,14364546,8.66
Τ,13430966,8.1
Σ,12038054,7.26
Υ,7906406,4.77
Η,6280067,3.79
Ρ,6279407,3.79


`First1K` corpora contains mathematical texts in Greek, which explains why the rarely used digamma (Ϛ = 6), qoppa (Ϟ/Ϙ = 90), and sampi(Ϡ = 900) letters are included on the table. You can find other interesting differences too, like the occurrence of E/T, K/Π, and M/Λ, which are probably explained by the difference of the included text genres in the corpora.

#### Bar chart

The next chart will show visually which are the most used letters and the least
used letters in the available Ancient Greek corpora.

<img src="stats.png" />


Vowels with `N`, `S`, and `T` consonants pops up as the most used letters. The least used letters are `Z`, `Chi`, and `Psi`.

#### Optional live chart

Uncomment the next part to output a new fresh graph from Plotly:

In [12]:
#from plotly.offline import init_notebook_mode
#init_notebook_mode(connected=False)

# for the fist time set plotly service credentials, then you can comment the next line
#import plotly
#plotly.tools.set_credentials_file(username='MarkoManninen', api_key='xyz')

# use tables and graphs...
#import plotly.tools as tls
# embed plotly graphs
#tls.embed("https://plot.ly/~MarkoManninen/8/")

### Unique words database

Now it is time to collect unique Greek words to the database and show certain
specialties of the word statistics. I'm reusing data from the `greek_corpora`
variable that is in the memory already. Running the next code will take a
minute or two depending on the processor speed of your computer:

In [13]:
from functions import syllabify, Abnum, greek, vowels

# greek abnum object for calculating isopsephical value
g = Abnum(greek)

# lets count unique words statistic from the parsed greek corpora rather than the plain text file
# it would be pretty dauntful to find out occurence of the all 800000+ unique words from the text 
# file that is over 600 MB big!
unique_word_stats = {}
for item in greek_corpora:
    for word, cnt in item['uwords'].items():
        if word not in unique_word_stats:
            unique_word_stats[word] = 0
        unique_word_stats[word] += cnt

# init dataframe
df = DataFrame([[k, v] for k, v in unique_word_stats.items()])
# add column for the occurrence percentage of the word
df[2] = df[1].apply(lambda x: round(x*100/lwords3, 2))
# add column for the length of the word
df[3] = df[0].apply(lambda x: len(x))
# add isopsephy column
df[4] = df[0].apply(lambda x: g.value(x))
# add syllabified column
df[5] = df[0].apply(lambda x: syllabify(x))
# add length of the syllables column
df[6] = df[5].apply(lambda x: len(x))
# count vowels in the word as a column
df[7] = df[0].apply(lambda x: sum(list(x.count(c) for c in vowels)))
# count consonants in the word as a column
df[8] = df[0].apply(lambda x: len(x) - sum(list(x.count(c) for c in vowels)))

### Store database

This is the single most important part of the chapter. I'm saving all
simplified unique words as a CSV file that can be used as a database for the
riddle solver. After this you may proceed to the 
[riddle solver](https://git.io/vASrY) Jupyter notebook document in interactive 
mode, if you prefer.

In [14]:
from functions import csv_file_name
# save dataframe to CSV file
df.to_csv(csv_file_name, header=False, index=False, encoding='utf-8')

Noteworth is that stored words are not stems or any base forms of the words but
contain words in all possible inflected forms. Due to nature of machine
processed texts, one should also be warned about corrupted words and other noise
to occur in results. Programming tools are good for extracting interesting
content and filtering data that would be impossible for a human to do because
of its enormous size. But results still need verification and interpretation,
also procedures can be fine tuned and developed in many ways.

### Most repeated words

For a confirmation of the succesful task, I will show the total number of the
unique words, and five of the most repeated words in the database:

In [15]:
from functions import display_html
# use to_html and index=False to hide index column and output table
words = df.sort_values(1, ascending=False).head(n=5).iloc[:,0:3]
words.columns = ['Word', 'Count', 'Percent']
print("Total records: %s" % len(df))
display_html(words.to_html(index=False), raw=True)

Total records: 833817


Word,Count,Percent
ΚΑΙ,1781716,5.38
ΔΕ,778652,2.35
ΤΟ,671023,2.03
ΤΩΝ,487065,1.47
Η,483443,1.46


KAI...

For a curiosity, let's also see the longest words in the database:

In [16]:
from functions import HTML
# load result to the temporary variable for later usage
l = df.sort_values(3, ascending=False).head(n=20)
l = l[[0, 1, 3]]
l.columns = ['Word', 'Count', 'Length']
# output table
HTML(l.to_html(index=False))

Word,Count,Length
ΠΑΡΕΓΕΝΟΜΕΝΟΜΕΝΟΣΗΝΚΑΙΕΤΙΕΚΤΗΣΛΕΣΒΟΥΟΥΦΑΜΕΝ,1,43
ΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΟΤΗΤΟΣΑΤΜΙΔΟΥΜΕΝΟΝΦΕΡΕΤΑΙ,1,42
ΕΜΟΥΟΙΑΠΕΦΕΥΓΑΧΕΙΡΑΣΛΥΠΗΣΑΣΜΕΝΟΥΔΕΝΑΟΥΔΕΝ,1,41
ΠΥΡΟΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΙΔΟΜΙΚΡΙΤΡΙΑΔΥ,1,40
ΔΥΝΑΤΟΝΔΕΤΟΑΙΤΙΑΙΗΣΓΕΝΕΣΕΩΣΚΑΙΤΗΣΦΘΟΡΑΣ,1,39
ΠΥΡΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΟΥΜΙΚΤΡΙΤΥΑΔΥ,1,38
ΚΑΙΙΚΕΛΗΧΡΥΣΗΑΦΡΟΔΙΤΗΚΑΙΟΙΣΕΚΟΣΜΗΣΕ,1,35
ΚΑΙΤΟΝΑΡΙΣΤΑΡΧΟΝΑΣΜΕΝΩΣΤΗΝΓΡΑΦΗΝΤΟΥ,1,35
ΕΝΝΕΑΚΑΙΕΙΚΟΣΙΚΑΙΕΠΤΑΚΟΣΙΟΠΛΑΣΙΑΚΙΣ,1,35
ΑΡΣΕΝΙΚΩΝΟΝΟΜΑΤΩΝΣΤΟΙΧΕΙΑΕΣΤΙΠΕΝΤΕ,1,34


How about finding out, which words have the biggest isopsephical values?

In [17]:
# sort by the isopsephy column
m = df.sort_values(4, ascending=False).head(n=20)
m = m[[0, 1, 4]]
m.columns = ['Word', 'Count', 'Isopsephy']
# output table
HTML(m.to_html(index=False))

Word,Count,Isopsephy
ΛΕΟΝΤΑΤΥΦΛΩΣΩΝΣΚΩΛΩΨΔΕΤΟΥ,1,6865
ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,1,5186
ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ,2,5122
ΟΡΘΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,2,5086
ΓΛΩΣΣΟΤΟΜΗΘΕΝΤΩΝΧΡΙΣΤΙΑΝΩΝ,1,5056
ΚΑΙΤΟΝΑΡΙΣΤΑΡΧΟΝΑΣΜΕΝΩΣΤΗΝΓΡΑΦΗΝΤΟΥ,1,4969
ΑΡΣΕΝΙΚΩΝΟΝΟΜΑΤΩΝΣΤΟΙΧΕΙΑΕΣΤΙΠΕΝΤΕ,1,4768
ΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΟΤΗΤΟΣΑΤΜΙΔΟΥΜΕΝΟΝΦΕΡΕΤΑΙ,1,4754
ΕΠΙΣΚΟΠΩΚΩΝΣΤΑΝΤΙΝΟΥΠΟΛΕΩΣ,1,4701
ΚΩΔΩΝΟΦΑΛΑΡΑΧΡΩΜΕΝΟΥΣ,1,4642


How many percent of the whole word base, the least repeated words take:

In [18]:
# length of the words database
le = len(df)
# group words by occurrence and count grouped items, list the first 10 items
for x, y in df.groupby([1, 2]).count()[:10].T.items():
    print("words repeating %s time(s): " % x[0], round(100*y[0]/le, 2), "%")

words repeating 1 time(s):  45.03 %
words repeating 2 time(s):  15.83 %
words repeating 3 time(s):  7.46 %
words repeating 4 time(s):  4.83 %
words repeating 5 time(s):  3.32 %
words repeating 6 time(s):  2.49 %
words repeating 7 time(s):  1.92 %
words repeating 8 time(s):  1.58 %
words repeating 9 time(s):  1.28 %
words repeating 10 time(s):  1.11 %


Words that repeat 1-4 times fills over 73% of the whole text. Words repeating three times takes 16.5% of the words being the greatest repeatance factor.

Finally, for cross checking the data processing algorithm, I want to know in which texts the longest words occur:

In [21]:
from functions import search_words_from_corpora
# using already instantiated l variable
words = list(y[0] for x, y in l.T.items())
search_words_from_corpora(words, [perseus_dir, first1k_dir], 3)

 + Aristophanes, Lysistrata (tlg0019.tlg007.perseus-grc2.xml) =>    

   ----- ΣΠΕΡΜΑΓΟΡΑΙΟΛΕΚΙΘΟΛΑΧΑΝΟΠΩΛΙΔΕΣ (1) -----
   ὦ ξύμμαχοι γυναῖκες ἐκθεῖτ ἔνδοθεν ὦ σπερμαγοραιολεκιθολαχανοπώλιδες ὦ σκοροδοπανδοκευτριαρτοπώλιδες

 + Aristophanes, Wasps (tlg0019.tlg004.perseus-grc1.xml) =>    

   ----- ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ (1) -----
   ς ἀκούειν ἡδἔ εἰ καὶ νῦν ἐγὼ τὸν πατέρ ὅτι βούλομαι τούτων ἀπαλλαχθέντα τῶν ὀρθροφοιτοσυκοφαντοδικοταλαιπώρων τρόπων ζῆν βίον γενναῖον ὥσπερ Μόρυχος αἰτίαν ἔχω ταῦτα δρᾶν ξυνωμότης ὢν καὶ φρονῶν

 + Athenaeus, Deipnosophistae (tlg0008.tlg001.perseus-grc3.xml) =>    

   ----- ΠΥΡΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΟΥΜΙΚΤΡΙΤΥΑΔΥ (1) -----
   τις ἃ Ζανὸς καλέοντι τρώγματ ἔπειτ ἐπένειμεν ἐνκατακνακομιγὲς πεφρυγμένον πυρβρομολευκερεβινθοακανθουμικτριτυαδυ βρῶμα τοπανταναμικτον ἀμπυκικηροιδηστίχας παρεγίνετο τούτοις

 + Athenaeus, TheDeipnosophists (tlg0008.tlg001.perseus-grc4.xml) =>    

   ----- ΠΥΡΟΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΙΔΟΜΙΚΡΙΤΡΙΑΔΥ (1) -----
   ἐπεί γ ἐπένε

For a small explanation: [Aristophanes](https://en.wikipedia.org/wiki/Aristophanes) was a Greek comic playwright and a word expert of a kind. Mathematical texts are also filled with long compoud words for fractions for example.

So thats all for the Greek corpora processing and basic statistics. One could further investigate the basic stats, categorize and compare individual texts as well.

In [22]:
words = list(y[0] for x, y in m.T.items())
search_words_from_corpora(words, [perseus_dir, first1k_dir], 3)

 + Appian, TheCivilWars (tlg0551.tlg017.perseus-grc2.xml) =>    

   ----- ΣΥΝΥΠΟΧΩΡΟΥΝΤΩΝ (1) -----
   καὶ ἡ σύνταξις ἤδη παρελέλυτο ὀξύτερον ὑπεχώρουν καί τῶν ἐπιτεταγμένων σφίσι δευτέρων καὶ τρίτων συνυποχωρούντων μισγόμενοι πάντες ἀλλήλοις ἀκόσμως ἐθλίβοντο ὑπὸ σφῶν καὶ τῶν πολεμίων ἀπαύστως αὐτοῖς ἐπικειμένων

 + Aristophanes, Wasps (tlg0019.tlg004.perseus-grc1.xml) =>    

   ----- ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ (1) -----
   ς ἀκούειν ἡδἔ εἰ καὶ νῦν ἐγὼ τὸν πατέρ ὅτι βούλομαι τούτων ἀπαλλαχθέντα τῶν ὀρθροφοιτοσυκοφαντοδικοταλαιπώρων τρόπων ζῆν βίον γενναῖον ὥσπερ Μόρυχος αἰτίαν ἔχω ταῦτα δρᾶν ξυνωμότης ὢν καὶ φρονῶν

 + Athenaeus, Deipnosophistae (tlg0008.tlg001.perseus-grc3.xml) =>    

   ----- ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ (1) -----
   τῶν ἐξ Ἀκαδημίας τις ὑπὸ Πλάτωνα καὶ Βρυσωνοθρασυμαχειοληψικερμάτων πληγεὶς ἀνάγκῃ ληψολιγομίσθῳ τέχνῃ σ

 + Athenaeus, TheDeipnosophists (tlg0008.tlg001.perseus-grc4.xml) =>    

   ----- ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ (1) -----
   Βρυσωνοθ

## The [MIT](http://choosealicense.com/licenses/mit/) License

Copyright &copy; 2018 Marko Manninen