# Processing Greek corpora for the riddle solver

<div><br/><img src="delphic_sibyl.png" style="width: 65%; border: 1px solid #ddd; padding: 5px" alt="Michelangelo’s Delphic Sibyl, Sistine Chapel" /></div>
<div><center><i>Michelangelo’s Delphic Sibyl, Sistine Chapel</i></center><br/></div>

[Pseudo Sibylline](https://en.wikipedia.org/wiki/Sibylline_Oracles) oracles contain hexametric poems written in Ancient Greek. These oracula were mainly composed in 150BC - 200AD to twelve different extant books. They were circulating and quite famous among the Judaeo-Christian community at that time. They shouldn't, however, be too much confused with the earlier [Sibylline books](https://en.wikipedia.org/wiki/Sibylline_Books). Sibylline books contained religious ceremonical advices that were consulted by the selected priests and curators in the Roman empire, when it was in deep political trouble. The collection of the original Sibylline books were destroyed by different accidental events and deliberate actions in history.

Pseudo-Sibylline oracles, on the other hand, contain Jewish narrative of the human history contrasted to the Greek mythology and to the chronology of the great ancient empires. Other intention of the material is to support Christian doctrine and interpretation of the prophesies. Prophesies were mostly grounded on Jewish literature, but surprisingly some pagan events also came to be interpreted as signs of the coming Messiah.

Some of the material in the Pseudo-Sibylline oracles contain cryptic puzzles, often referring to persons, cities, countries, and epithets of God for example. These secretive references are often very general in nature, pointing only to the first letter of the subject and its numerical value. Solving them requires a proper knowledge of the context, not only inner textual but historical context.

Most of the alphanumeric riddles in the oracles have been solved by researchers already. Some of the riddles are still problematic and open for better proposals. Better yet, few of these open riddles are specific enough so that one may try to solve them by modern programmable tools.

## Natural language processing

Programmatical approach to solve the riddles requires huge Greek text corpora. Bigger it is, the better. I will download and preprocess available open source Greek corpora, which is a quite daunting task for many reasons. I have left the most of the details of this part for the enthusiasts to read straight from the commented code: [https://git.io/vAS2Z](https://github.com/markomanninen/grcriddles/blob/master/functions.py). In the end, I'll have a word database containing hundreds of thousands of unique Greek words extracted from the naturally written language corpora. Then words can be further used in the riddle solver.

Note that rather than just reading, this, and the following chapters can also be run interactively in your local Jupyter notebook installation if you prefer. That means that you may verify the procedure or alter parameters and try solving the riddles with your own parameters.

Your can download these independent Jupyter notebooks from:

- Processing Greek corpora: https://github.com/markomanninen/grcriddles/blob/master/processing.ipynb
- Riddle solver: https://github.com/markomanninen/grcriddles/blob/master/solver.ipynb



## Collecting Greek Corpora

The first things is to get a big raw Ancient Greek text to operate with. [CLTK](https://github.com/cltk/cltk) library provides an importer to the [Perseus](http://www.perseus.tufts.edu/hopper/opensource/download) and the [First1KGreek](http://opengreekandlatin.github.io/First1KGreek/) open source data sources.

I'm using [Abnum](https://github.com/markomanninen/abnum3) library to strip diacritics of the Greek words, remove non-alphabetical characters, as well as calculating the isopsephical value of the words. [Greek_accentuation](https://github.com/jtauber/greek-accentuation) library is used to split words into syllables. This is required because few of the riddles contain specific information about syllables. [Pandas](http://pandas.pydata.org/) library is used as an API to the collected database and [Plotly](https://plot.ly/) library is used for the visual presentation of the statistics.

You can install these libraries by uncommenting the next lines:

In [1]:
import sys

#!{sys.executable} -m pip install cltk abnum
#!{sys.executable} -m pip install pandas plotly
#!{sys.executable} -m pip install greek_accentuation

For your convenience, my environment is the following:

In [2]:
print("Python %s" % sys.version)

Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]


Note, that `Python 3.4+` is required for all libraries to work properly.

#### List CLTK corpora

Let's see what corporas are available for download:

In [3]:
from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('greek')
print(', '.join(corpus_importer.list_corpora))

greek_software_tlgu, greek_text_perseus, phi7, tlg, greek_proper_names_cltk, greek_models_cltk, greek_treebank_perseus, greek_lexica_perseus, greek_training_set_sentence_cltk, greek_word2vec_cltk, greek_text_lacus_curtius, greek_text_first1kgreek


I'm going to use `greek_text_perseus` and `greek_text_first1kgreek` corpora for the study, combine them to a single raw text file and unique words database.

### Download corporas

I have collected large part of the used procedures to the [functions](functions.py) script to maintain this notebook document more concise.

The next code snippet will download hundreds of megabytes of Greek text to your local computer for quicker access:

In [4]:
# import corpora
for corpus in ["greek_text_perseus", "greek_text_first1kgreek"]:
    try:
        corpus_importer.import_corpus(corpus)
    except Exception as e:
        print(e)

C:\Users\phtep\cltk_data\greek\text\greek_text_first1kgreek


Next I will copy only suitable greek text files from `greek_text_first1kgreek` to the working directory `greek_text_tlg`. Perseus corpora is pretty good as it is.

Note that one can download and extract `greek_text_first1kgreek` directly from  https://github.com/OpenGreekAndLatin/First1KGreek/zipball/master. It may have the most recent and complete set of files. If you wish to use it, extract package directly to `~\cltk_data\greek\text\greek_text_tlg`.

In [5]:
from functions import copy_corpora

for item in [["greek_text_first1kgreek", "greek_text_tlg"],
             ["greek_text_perseus", "greek_text_prs"]]:
    copy_corpora(*item)

C:\Users\phtep\cltk_data\greek\text\greek_text_tlg already exists, lets roll on!
C:\Users\phtep\cltk_data\greek\text\greek_text_prs already exists, lets roll on!


Perseus Greek source text is written as a betacode, so I will need a converter script for it. I found a suitable one from: https://github.com/epilanthanomai/hexameter but had to make a small fix to it, so I'm using my own version of the  [betacode](betacode.py) script.

### Process files

Next step is to find out Greek text nodes from the provided XML source files. I have to specify a tag table to find main text lines from the source files so that only Greek texts are processed. XML files have a lot of English and Latin phrases that needs to be stripped out.

Extracted content is saved to the author/work based directories. Simplified uncial conversion is also made at the same time so that the final output file contains only plain words separated by spaces. Pretty much in a format written by the ancient Greeks btw.

#### Collect text files

In [6]:
from functions import init_corpora

# init corpora list
corporas = ["greek_text_prs", "greek_text_tlg"]

greek_corpora_x = init_corpora(corporas)
print("%s files found" % len(greek_corpora_x))

1311 files found


#### Process text files

This will take several minutes depending on if you have already run it once and have temporary directories available:

In [7]:
from functions import remove, all_greek_text_file, perseus_greek_text_file, first1k_greek_text_file, process_greek_corpora

# remove old temp files
try:
    remove(all_greek_text_file)
    remove(perseus_greek_text_file)
    remove(first1k_greek_text_file)
except OSError:
    pass

# one could use a filter to process only selected files here...
#greek_corpora = process_greek_corpora(list(filter(lambda x: "aristot.nic.eth_gk.xml" in x['file'], greek_corpora_x)))
greek_corpora = process_greek_corpora(greek_corpora_x)

## Statistics

When files are downloaded and preprocessed, I can get the size of the text files:

In [8]:
from functions import get_file_size

print("Size of the all raw text: %s MB" % get_file_size(all_greek_text_file))
print("Size of the perseus raw text: %s MB" % get_file_size(perseus_greek_text_file))
print("Size of the first1k raw text: %s MB" % get_file_size(first1k_greek_text_file))
#Size of the all raw text: 604.88 MB
#Size of the perseus raw text: 79.74 MB
#Size of the first1k raw text: 525.13 MB

Size of the all raw text: 636.32 MB
Size of the perseus raw text: 79.05 MB
Size of the first1k raw text: 557.26 MB


I will calculate other statistics of the saved text files for cross checking their content:

In [9]:
from functions import get_stats

ccontent1, chars1, lwords1 = get_stats(perseus_greek_text_file)
ccontent2, chars2, lwords2 = get_stats(first1k_greek_text_file)
ccontent3, chars3, lwords3 = get_stats(all_greek_text_file)

Corpora: perseus_greek_text_files.txt
Letters: 37819229
Words in total: 7256179
Unique words: 349068

Corpora: first1k_greek_text_files.txt
Letters: 264628135
Words in total: 55077114
Unique words: 687212

Corpora: all_greek_text_files.txt
Letters: 302447364
Words in total: 62333293
Unique words: 852917



## Letter statistics

I'm using Pandas library to handle tabular data and show basic letter statistics.

In [10]:
from functions import Counter, DataFrame

#### Calculate statistics

This will take some time too:

In [11]:
# perseus dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent1).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars1, 2))
a = df.sort_values(1, ascending=False)
# first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent2).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars2, 2))
b = df.sort_values(1, ascending=False)
# perseus + first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent3).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars3, 2))
c = df.sort_values(1, ascending=False)

#### Show letter statistics

The first column is the letter, the second column is the count of the letter, and the third column is the percentage of the letter contra all letters.

Show tables side by side to save some vertical space:

In [12]:
from functions import display_side_by_side

display_side_by_side(Perseus=a, First1K=b, Perseus_First1K=c)

Letter,Count,Percent
Α,4146310,10.96
Ε,3648132,9.65
Ο,3635084,9.61
Ι,3583138,9.47
Ν,3384638,8.95
Τ,2882746,7.62
Σ,2802701,7.41
Υ,1764393,4.67
Ρ,1426240,3.77
Η,1382239,3.65

Letter,Count,Percent
Α,28522930,10.78
Ο,25211915,9.53
Ι,24140242,9.12
Ν,23736011,8.97
Ε,23490544,8.88
Τ,22869810,8.64
Σ,19918937,7.53
Υ,12066974,4.56
Ρ,10421878,3.94
Η,9847076,3.72

Letter,Count,Percent
Α,32669240,10.8
Ο,28846999,9.54
Ι,27723380,9.17
Ε,27138676,8.97
Ν,27120649,8.97
Τ,25752556,8.51
Σ,22721638,7.51
Υ,13831367,4.57
Ρ,11848118,3.92
Η,11229315,3.71


`First1K` corpora contains mathematical texts in Greek, which explains why the rarely used digamma (Ϛ = 6), qoppa (Ϟ/Ϙ = 90), and sampi(Ϡ = 900) letters are included on the table. You can find other interesting differences too, like the occurrence of E/T, K/Π, and M/Λ, which are probably explained by the difference of the included text genres in the corporas.

#### Plotly bar chart for letter stats

The next chart will show visually which are the most used letters and the least used letters in the available Ancient Greek corpora.

<img src="stats.png" />


Vowels with `N`, `S`, and `T` consonants pops up as the most used letters. The least used letters are `Z`, `Chi`, and `Psi`.

Uncomment next part to output a new fresh graph from Plotly:

In [13]:
#from plotly.offline import init_notebook_mode
#init_notebook_mode(connected=False)

# for the fist time set plotly service credentials, then you can comment the next line
#import plotly
#plotly.tools.set_credentials_file(username='MarkoManninen', api_key='xyz')

# use tables and graphs...
#import plotly.tools as tls
# embed plotly graphs
#tls.embed("https://plot.ly/~MarkoManninen/8/")

Then it is time to store unique Greek words to the database and show some specialties of the word statistics. This will take a minute or two:

In [14]:
from functions import syllabify, Abnum, greek, vowels

# greek abnum object for calculating isopsephical value
g = Abnum(greek)

# lets count unique words statistic from the parsed greek corpora rather than the plain text file
# it would be pretty dauntful to find out occurence of the all 800000+ unique words from the text 
# file that is over 600 MB big!
unique_word_stats = {}
for item in greek_corpora:
    for word, cnt in item['uwords'].items():
        if word not in unique_word_stats:
            unique_word_stats[word] = 0
        unique_word_stats[word] += cnt

# init dataframe
df = DataFrame([[k, v] for k, v in unique_word_stats.items()])
# add column for the occurrence percentage of the word
df[2] = df[1].apply(lambda x: round(x*100/lwords3, 2))
# add column for the length of the word
df[3] = df[0].apply(lambda x: len(x))
# add isopsephy column
df[4] = df[0].apply(lambda x: g.value(x))
# add syllabified column
df[5] = df[0].apply(lambda x: syllabify(x))
# add length of the syllables column
df[6] = df[5].apply(lambda x: len(x))
# count vowels in the word
df[7] = df[0].apply(lambda x: sum(list(x.count(c) for c in vowels)))
# count consonants in the word
df[8] = df[0].apply(lambda x: len(x) - sum(list(x.count(c) for c in vowels)))

### Save unique words database

This is the single most important part of the document. I'm saving all simplified unique words as a csv file that can be used as a database for the riddle solver. After this you may proceed to the [riddle solver](Isopsephical riddles in the Greek Pseudo Sibylline hexameter poetry.ipynb) Jupyter notebook document in interactive mode if you prefer.

In [15]:
from functions import csv_file_name
df.to_csv(csv_file_name, header=False, index=False, encoding='utf-8')

For confirmation, I will show twenty of the most repeated words in the database:

In [16]:
from functions import display_html
# use to_html and index=False to hide index column
display_html(df.sort_values(1, ascending=False).head(n=5).to_html(index=False), raw=True)

0,1,2,3,4,5,6,7,8
ΚΑΙ,3489609,5.6,3,31,[ΚΑΙ],1,2,1
ΔΕ,1430133,2.29,2,9,[ΔΕ],1,1,1
ΤΟ,1355647,2.17,2,370,[ΤΟ],1,1,1
ΤΟΥ,989407,1.59,3,770,[ΤΟΥ],1,2,1
ΤΩΝ,958932,1.54,3,1150,[ΤΩΝ],1,1,2


For curiosity, let's also see the longest words in the database:

In [17]:
from functions import HTML
l = df.sort_values(3, ascending=False).head(n=20)
HTML(l.to_html(index=False))

0,1,2,3,4,5,6,7,8
ΑΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΤΗΤΟΣΑΤΜΙΔΟΜΕΝΟΝΦΡΕΤΑΙ,3,0.0,40,4280,"[ΑΛ, ΛΗ, ΣΤΗ, ΣΑ, ΝΩ, ΘΕΝ, ΘΕΡΜ, ΤΗ, ΤΟ, ΣΑΤ, ...",16,17,23
ΔΥΝΑΤΟΝΔΕΤΟΑΙΤΑΙΗΣΓΕΝΣΕΩΣΚΑΙΤΗΣΦΘΟΡΑΣ,3,0.0,37,4466,"[ΔΥ, ΝΑ, ΤΟΝ, ΔΕ, ΤΟ, ΑΙ, ΤΑΙ, ΗΣ, ΓΕΝ, ΣΕ, Ω,...",15,18,19
ΕΝΝΕΑΚΑΙΔΕΚΑΕΤΗΡΙΕΝΝΕΑΚΑΙΔΕΚΑΕΤΗΡΔΟΣ,2,0.0,36,1454,"[ΕΝ, ΝΕ, Α, ΚΑΙ, ΔΕ, ΚΑ, Ε, ΤΗ, ΡΙ, ΕΝ, ΝΕ, Α,...",18,20,16
ΕΜΟΥΙΑΠΦΕΥΓΑΧΕΙΡΑΣΛΥΠΣΑΣΜΕΝΟΥΔΝΑΟΥΔΝ,3,0.0,36,4486,"[Ε, ΜΟΥΙ, ΑΠ, ΦΕΥ, ΓΑ, ΧΕΙ, ΡΑΣ, ΛΥΠ, ΣΑ, ΣΜΕ,...",13,19,17
ΣΙΑΛΟΙΟΡΑΧΙΝΤΕΘΑΛΥΙΑΝΑΛΟΙΦΗΕΥΤΡΑΦΟΥΣ,4,0.0,36,4553,"[ΣΙ, Α, ΛΟΙ, Ο, ΡΑ, ΧΙΝ, ΤΕ, ΘΑ, ΛΥΙ, Α, ΝΑ, Λ...",16,21,15
ΕΝΝΕΑΚΑΙΕΙΚΟΣΙΚΑΙΕΠΤΑΚΟΣΙΟΠΛΑΣΙΑΚΙΣ,1,0.0,35,1796,"[ΕΝ, ΝΕ, Α, ΚΑΙ, ΕΙ, ΚΟ, ΣΙ, ΚΑΙ, Ε, ΠΤΑ, ΚΟ, ...",17,20,15
ΚΑΙΟΣΑΑΛΛΑΤΩΝΤΟΙΟΥΤΩΝΠΡΟΣΔΙΟΡΙΖΜΕΘΑ,2,0.0,35,4220,"[ΚΑΙ, Ο, ΣΑ, ΑΛ, ΛΑ, ΤΩΝ, ΤΟΙ, ΟΥ, ΤΩΝ, ΠΡΟΣ, ...",15,18,17
ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,1,0.0,33,5186,"[ΟΡ, ΘΡΟ, ΦΟΙ, ΤΟ, ΣΥ, ΚΟ, ΦΑΝ, ΤΟ, ΔΙ, ΚΟ, ΤΑ...",14,16,17
ΤΕΤΤΑΡΑΚΟΝΤΑΚΑΙΠΕΝΤΑΚΙΣΧΙΛΙΟΣΤΟΝ,1,0.0,32,3485,"[ΤΕΤ, ΤΑ, ΡΑ, ΚΟΝ, ΤΑ, ΚΑΙ, ΠΕΝ, ΤΑ, ΚΙ, ΣΧΙ, ...",13,14,18
ΚΑΙΙΚΛΗΧΡΥΣΗΑΦΡΟΔΤΗΚΑΙΟΙΣΕΚΣΜΗΣΕ,3,0.0,32,3179,"[ΚΑΙ, Ι, ΚΛΗ, ΧΡΥ, ΣΗ, Α, ΦΡΟΔ, ΤΗ, ΚΑΙ, ΟΙ, Σ...",13,16,16


How about finding out, which words has the biggest isopsephical values?

In [18]:
HTML(df.sort_values(4, ascending=False).head(n=20).to_html(index=False))

0,1,2,3,4,5,6,7,8
ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,1,0.0,33,5186,"[ΟΡ, ΘΡΟ, ΦΟΙ, ΤΟ, ΣΥ, ΚΟ, ΦΑΝ, ΤΟ, ΔΙ, ΚΟ, ΤΑ...",14,16,17
ΓΛΩΣΣΟΤΟΜΗΘΕΝΤΩΝΧΡΙΣΤΙΑΝΩΝ,3,0.0,26,5056,"[ΓΛΩΣ, ΣΟ, ΤΟ, ΜΗ, ΘΕΝ, ΤΩΝ, ΧΡΙ, ΣΤΙ, Α, ΝΩΝ]",10,10,16
ΣΙΑΛΟΙΟΡΑΧΙΝΤΕΘΑΛΥΙΑΝΑΛΟΙΦΗΕΥΤΡΑΦΟΥΣ,4,0.0,36,4553,"[ΣΙ, Α, ΛΟΙ, Ο, ΡΑ, ΧΙΝ, ΤΕ, ΘΑ, ΛΥΙ, Α, ΝΑ, Λ...",16,21,15
ΤΟΙΧΩΡΥΧΟΥΝΤΩΝ,1,0.0,14,4550,"[ΤΟΙ, ΧΩ, ΡΥ, ΧΟΥΝ, ΤΩΝ]",5,7,7
ΕΜΟΥΙΑΠΦΕΥΓΑΧΕΙΡΑΣΛΥΠΣΑΣΜΕΝΟΥΔΝΑΟΥΔΝ,3,0.0,36,4486,"[Ε, ΜΟΥΙ, ΑΠ, ΦΕΥ, ΓΑ, ΧΕΙ, ΡΑΣ, ΛΥΠ, ΣΑ, ΣΜΕ,...",13,19,17
ΔΥΝΑΤΟΝΔΕΤΟΑΙΤΑΙΗΣΓΕΝΣΕΩΣΚΑΙΤΗΣΦΘΟΡΑΣ,3,0.0,37,4466,"[ΔΥ, ΝΑ, ΤΟΝ, ΔΕ, ΤΟ, ΑΙ, ΤΑΙ, ΗΣ, ΓΕΝ, ΣΕ, Ω,...",15,18,19
ΤΩΟΡΘΩΕΚΑΣΤΑΘΕΩΡΩΝ,4,0.0,18,4370,"[ΤΩ, ΟΡ, ΘΩ, Ε, ΚΑ, ΣΤΑ, ΘΕ, Ω, ΡΩΝ]",9,9,9
ΣΥΝΥΠΟΧΩΡΟΥΝΤΩΝ,1,0.0,15,4370,"[ΣΥ, ΝΥ, ΠΟ, ΧΩ, ΡΟΥΝ, ΤΩΝ]",6,7,8
ΟΡΘΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΡΩΝ,8,0.0,31,4286,"[ΟΡ, ΘΟ, ΦΟΙ, ΤΟ, ΣΥ, ΚΟ, ΦΑΝ, ΤΟ, ΔΙ, ΚΟ, ΤΑ,...",13,15,16
ΑΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΤΗΤΟΣΑΤΜΙΔΟΜΕΝΟΝΦΡΕΤΑΙ,3,0.0,40,4280,"[ΑΛ, ΛΗ, ΣΤΗ, ΣΑ, ΝΩ, ΘΕΝ, ΘΕΡΜ, ΤΗ, ΤΟ, ΣΑΤ, ...",16,17,23


How many percent of the whole word base, the least repeated words take:

In [19]:
le = len(df)
for x, y in df.groupby([1, 2]).count()[:10].T.items():
    print("words repeating %s time(s): " % x[0], round(100*y[0]/le, 2), "%")

words repeating 1 time(s):  13.7 %
words repeating 2 time(s):  13.93 %
words repeating 3 time(s):  15.7 %
words repeating 4 time(s):  12.19 %
words repeating 5 time(s):  3.78 %
words repeating 6 time(s):  4.86 %
words repeating 7 time(s):  2.57 %
words repeating 8 time(s):  3.5 %
words repeating 9 time(s):  2.22 %
words repeating 10 time(s):  1.74 %


Words that repeat 1-4 times fills the 60% of the whole text. Words repeating three times takes 16.5% of the words being the greatest repeatance factor.

Finally, for cross checking the data processing algorithm, I want to know in which texts the longest words occur:

In [20]:
from functions import listdir, get_content, path
# using already instantiated l variable I'm collecting the plain text words
words = list(y[0] for x, y in l.T.items())

def has_words(data):
    a = {}
    for x in words:
        # partial match is fine here. data should be split to words for exact match
        # but it will take more processing time. for shorter words it might be more useful however
        if x in data:
            a[x] = data.count(x)
    return a

def has_content(f):
    content = get_content(f)
    a = has_words(content)
    if a:
        print(" - %s => \r\n   %s\r\n" % (f, ', '.join(list("%s: %s" % (k, v) for k, v in a.items()))))

# iterate all corporas and see if selected words occur in the text
for corp in corporas:
    for b in filter(path.isdir, map(lambda x: path.join(corp, x), listdir(corp))):
        for c in filter(path.isfile, map(lambda x: path.join(b, x), listdir(b))):
            has_content(c)

 - greek_text_prs\Aristophanes\Simplified_Lysistrata.txt => 
   ΣΠΕΡΜΑΓΟΡΑΙΟΛΕΚΙΘΟΛΑΧΑΝΟΠΩΛΙΔΕΣ: 1, ΣΚΟΡΟΔΟΠΑΝΔΟΚΕΥΤΡΙΑΡΤΟΠΩΛΙΔΕΣ: 1

 - greek_text_prs\Aristophanes\Simplified_Wasps.txt => 
   ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ: 1

 - greek_text_prs\Plato\Simplified_Laws.txt => 
   ΤΕΤΤΑΡΑΚΟΝΤΑΚΑΙΠΕΝΤΑΚΙΣΧΙΛΙΟΣΤΟΝ: 1

 - greek_text_prs\Plato\Simplified_Republic.txt => 
   ΕΝΝΕΑΚΑΙΕΙΚΟΣΙΚΑΙΕΠΤΑΚΟΣΙΟΠΛΑΣΙΑΚΙΣ: 1

 - greek_text_tlg\AlexanderOfAphrodisias\Simplified_InAristotelisTopicorumLibrosOctoCommentaria.txt => 
   ΟΤΙΤΟΥΜΗΔΙΑΠΡΟΤΡΩΝΟΡΖΕΣΘΑΙΤΡΕΙΣ: 2

 - greek_text_tlg\Ammonius\Simplified_InAristotelisLibrumDeInterpretationeCommentarius.txt => 
   ΚΑΙΟΣΑΑΛΛΑΤΩΝΤΟΙΟΥΤΩΝΠΡΟΣΔΙΟΡΙΖΜΕΘΑ: 2

 - greek_text_tlg\ApolloniusDyscolus\Simplified_DeConstructione.txt => 
   ΠΑΡΥΦΙΣΤΑΜΕΝΟΥΠΡΑΓΜΑΤΟΣΚΟΙΝΩΣ: 3

 - greek_text_tlg\Artemidorus\Simplified_Onirocriticon.txt => 
   ΑΥΤΟΜΑΤΟΙΔΕΟΙΘΕΟΙΑΠΑΛΛΑΣΣΟΜΕΝΟΙ: 3

 - greek_text_tlg\ChroniconPaschale\Simplified_ChroniconPaschale.txt => 
   ΕΝΝΕΑΚΑΙΔΕΚΑΕΤΗΡΙΕ

For a small explanation: [Aristophanes](https://en.wikipedia.org/wiki/Aristophanes) was a Greek comic playwright and a word expert of a kind. Mathematical texts are also filled with long compoud words for fractions for example.

So thats all for the Greek corpora processing and basic statistics. One could further investigate the basic stats, categorize and compare individual texts as well.

## The [MIT](http://choosealicense.com/licenses/mit/) License

Copyright &copy; 2018 Marko Manninen