# Processing Greek corpora for the riddle solver

<div><br/><img src="delphic_sibyl.png" style="width: 65%; border: 1px solid #ddd; padding: 5px" alt="Michelangelo’s Delphic Sibyl, Sistine Chapel" /></div>
<div><center><i>Michelangelo’s Delphic Sibyl, Sistine Chapel</i></center><br/></div>

[Pseudo-Sibylline](https://en.wikipedia.org/wiki/Sibylline_Oracles) oracles contain hexametric poems written in Ancient Greek. These *oracula* were mainly composed in 150BC - 200AD to twelve distinct extant books. They were circulating and quite famous among the Judaeo-Christian community at that time. They shouldn't, however, be too much confused with the earlier [Sibylline books](https://en.wikipedia.org/wiki/Sibylline_Books). Sibylline books contained religious ceremonial advices that were consulted by the selected priests and curators in the Roman state, when it was in deep political trouble. The collection of the original Sibylline books were destroyed by different accidental events and deliberate actions in history.

Pseudo-Sibylline oracles, on the other hand, contain Jewish narrative of the human history contrasted to the Greek mythology and to the chronology of the other great ancient empires. Other intention of the material is to support evolving Christian doctrine and interpretation of the prophesies. Prophesies were mostly grounded on Jewish literature, but surprisingly some pagan world events also came to be interpreted as signs of the coming Messiah.

Some of the material in the Pseudo-Sibylline oracles contain cryptic puzzles, referring to persons, cities, countries, and epithets of God for example. These secretive references are often very general in nature, pointing only to the first letter of the subject and its numerical value. Solving them requires a proper knowledge of the context, not only inner textual but historical context.

Most of the alphanumeric riddles in the oracles have already been solved by various researchers. Some of the riddles are still problematic and open for better proposals. Better yet, few of these open riddles are specific enough so that one may try to solve them by modern programmable tools.

## Natural language processing

Programmatical approach to solve the riddles requires huge Greek text corpora. Bigger it is, the better. I will download and preprocess available open source Greek corpora, which is a quite daunting task for many reasons. Programming language of my choice is [Python](http://python.org). I have left the most of the details of this part for the enthusiasts to read straight from the commented code in [functions.py](https://git.io/vAS2Z) script. By collecting the large part of the used procedures to the separate script maintains this document more concise too.

In the end of the task of the first chapter, I'll have a word database containing hundreds of thousands of unique Greek words extracted from the naturally written language corpora. Then words can be further used in the riddle solver.

Note that rather than just reading, this, and the following chapters can also be run interactively in your local [Jupyter notebook](https://jupyter.org/) installation if you prefer. That means that you may test and verify the procedure or alter parameters and try solving the riddles with your own parameters.

Your can download these independent Jupyter notebooks for [processing corpora](https://git.io/vASwM), [solving riddles](https://git.io/vASrY), and [analysing results]().

## Required components

The first task is to get a big raw ancient Greek text to operate with. I have implemented an importer interface with [tqdm](https://github.com/tqdm/tqdm) library to the [Perseus](http://www.perseus.tufts.edu/hopper/opensource/download) and the [First1KGreek](http://opengreekandlatin.github.io/First1KGreek/) open source data sources in this chapter.

I'm using [Abnum](https://github.com/markomanninen/abnum3) library to strip diacritics of the Greek words, remove non-alphabetical characters, as well as calculating the isopsephical value of the words. [Greek_accentuation](https://github.com/jtauber/greek-accentuation) library is used to split words into syllables. This is required because few of the riddles contain specific information about syllables. [Pandas](http://pandas.pydata.org/) library is used as an API to the collected database and [Plotly](https://plot.ly/) library is used for the visual presentation of the statistics.

You can install these libraries by uncommenting the next lines:

I'm using my own [Abnum](https://github.com/markomanninen/abnum3) library to remove accents and non-alphabetical character of the Greek wordss, as well as calculating the isopsephical value of the Greek words. [Greek accentuation](https://github.com/jtauber/greek-accentuation) library is used to split words into syllables. This is required because few of the riddles contain specific information about the syllables of the words. [Pandas](http://pandas.pydata.org/) library is used as an API (application programming interface) to the collected database. [Plotly](https://plot.ly/) library and online infographic service are used for the visual presentation of the statistics.

You can install these libraries by uncommenting and running the next install lines:

In [1]:
import sys

#!{sys.executable} -m pip install tqdm abnum
#!{sys.executable} -m pip install pandas plotly
#!{sys.executable} -m pip install greek_accentuation

For your convenience, my environment is the following:

In [2]:
print("Python %s" % sys.version)

Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]


Note that `Python 3.4+` is required for all examples to work properly.

#### Downloading corpora

I'm going to use `greek_text_perseus` and `greek_text_first1k` corpora for the
study by combining them into a single raw text file and unique words database.

The next code snippets will download hundreds of megabytes of Greek text to a
local computer for quicker access.

a. Download packed zip files from their GitHub repositories:**

In [3]:
from functions import download_with_indicator, perseus_zip_file, first1k_zip_file
# download perseus files
fs = "https://github.com/PerseusDL/canonical-greekLit/archive/master.zip"
download_with_indicator(fs, perseus_zip_file)
# download first1k files
fs = "https://github.com/OpenGreekAndLatin/First1KGreek/archive/master.zip"
download_with_indicator(fs, first1k_zip_file)

b. Unzip files to the corresponding directories:**

In [4]:
from functions import perseus_zip_dir, first1k_zip_dir, unzip
# first argument is the zip source, second is the destination dir
unzip(perseus_zip_file, perseus_zip_dir)
unzip(first1k_zip_file, first1k_zip_dir)

c. Copy only suitable Greek text xml files from `perseus_zip_dir` and `first1k_zip_dir` to the temporary work directories. Original repositories contain a lot of unnecessary files for the riddle solver which are skipped in this process.**

In [5]:
from functions import copy_corpora, joinpaths, perseus_tmp_dir, first1k_tmp_dir
# important files resides in the data directory of the repositories
for item in [[joinpaths(perseus_zip_dir,
              ["canonical-greekLit-master", "data"]), perseus_tmp_dir],
             [joinpaths(first1k_zip_dir,
              ["First1KGreek-master", "data"]), first1k_tmp_dir]]:
    copy_corpora(*item)

greek_text_perseus_tmp already exists. Either remove it and run again, or just use the old one.
greek_text_first1k_tmp already exists. Either remove it and run again, or just use the old one.


Depending on if the files have been downloaded already, the output may differ.

### Collecting files

When the files has been downloaded and copied, it is time to read them to the runtime memory. At this point file paths are collected to the `greek_corpora_x` variable that is used on later iterators.

In [8]:
from functions import init_corpora, perseus_dir, first1k_dir
# collect files and initialize data dictionary
greek_corpora_x = init_corpora([[perseus_tmp_dir, perseus_dir], [first1k_tmp_dir, first1k_dir]])
print(len(greek_corpora_x), "files found")

1699 files found


#### Processing files

Next step is to extract Greek content from the provided XML source files.

Extracted content is saved to the corpora/author/work based directories. Simplified uncial conversion is also made at the same time so that the final output file contains only plain uppercase words separated by spaces. Pretty much in a format written by the ancient Greeks. Noteworth is that stored words are not stems but contain words in all possible inflected forms.

This will take several minutes depending on if you have already run it once and have the previous temporary directories available. Old processed corpora files are removed first, then they are recreated by calling `process_greek_corpora` function.

In [9]:
from functions import remove, all_greek_text_file, perseus_greek_text_file, \
                      first1k_greek_text_file, process_greek_corpora
# remove old processed temporary files
try:
    remove(all_greek_text_file)
    remove(perseus_greek_text_file)
    remove(first1k_greek_text_file)
except OSError:
    pass
# process and get greek corpora data to the RAM memory
# one could use a filter to process only selected files here...
#greek_corpora = process_greek_corpora(list(filter(lambda x: "aristot.nic.eth_gk.xml" in x['file'], greek_corpora_x)))
greek_corpora = process_greek_corpora(greek_corpora_x)

## Statistics

After the files have been downloaded and preprocessed, I'm going to output the size of them:

In [10]:
from functions import get_file_size

print("Size of the all raw text: %s MB" % get_file_size(all_greek_text_file))
print("Size of the perseus raw text: %s MB" % get_file_size(perseus_greek_text_file))
print("Size of the first1k raw text: %s MB" % get_file_size(first1k_greek_text_file))

Size of the all raw text: 346.5 MB
Size of the perseus raw text: 107.5 MB
Size of the first1k raw text: 239.0 MB


Then, I will calculate other statistics of the saved text files to compare their content:

In [11]:
from functions import get_stats

ccontent1, chars1, lwords1 = get_stats(perseus_greek_text_file)
ccontent2, chars2, lwords2 = get_stats(first1k_greek_text_file)
ccontent3, chars3, lwords3 = get_stats(all_greek_text_file)

Corpora: perseus_greek_text_files.txt
Letters: 51411752
Words in total: 9900720
Unique words: 423428

Corpora: first1k_greek_text_files.txt
Letters: 113763150
Words in total: 23084445
Unique words: 667503

Corpora: all_greek_text_files.txt
Letters: 165174902
Words in total: 32985165
Unique words: 831308



## Letter statistics

I'm using `DataFrame` class from `Pandas` library to handle tabular data and show basic letter statistics for each corpora and combination of them. Native `Counter` class in Python is used to count unique elements in the given sequence. Sequence in this case is the raw Greek text stripped from all special characters and spaces, and elements are the letters of the Greek alphabet.

This will take some time to process too:

In [12]:
from functions import Counter, DataFrame
# perseus dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent1).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars1, 2))
a = df.sort_values(1, ascending=False)
# first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent2).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars2, 2))
b = df.sort_values(1, ascending=False)
# perseus + first1k dataframe
df = DataFrame([[k, v] for k, v in Counter(ccontent3).items()])
df[2] = df[1].apply(lambda x: round(x*100/chars3, 2))
c = df.sort_values(1, ascending=False)

#### Show letter statistics

The first column is the letter, the second column is the count of the letter, and the third column is the percentage of the letter contra all letters.

In [13]:
from functions import display_side_by_side
# show tables side by side to save some vertical space
display_side_by_side(Perseus=a, First1K=b, Perseus_First1K=c)

Letter,Count,Percent
Α,5636525,10.96
Ε,4934559,9.6
Ο,4928002,9.59
Ι,4872354,9.48
Ν,4537851,8.83
Τ,3924588,7.63
Σ,3824160,7.44
Υ,2407552,4.68
Ρ,1977236,3.85
Η,1885144,3.67

Letter,Count,Percent
Α,12525389,11.01
Ο,11124835,9.78
Ι,11083598,9.74
Ε,10727970,9.43
Ν,9775036,8.59
Τ,9454047,8.31
Σ,8165588,7.18
Υ,5463093,4.8
Η,4369601,3.84
Ρ,4279639,3.76

Letter,Count,Percent
Α,18161914,11.0
Ο,16052837,9.72
Ι,15955952,9.66
Ε,15662529,9.48
Ν,14312887,8.67
Τ,13378635,8.1
Σ,11989748,7.26
Υ,7870645,4.77
Ρ,6256875,3.79
Η,6254745,3.79


`First1K` corpora contains mathematical texts in Greek, which explains why the rarely used digamma (Ϛ = 6), qoppa (Ϟ/Ϙ = 90), and sampi(Ϡ = 900) letters are included on the table. You can find other interesting differences too, like the occurrence of E/T, K/Π, and M/Λ, which are probably explained by the difference of the included text genres in the corpora.

#### Plotly bar chart for letter stats

The next chart will show visually which are the most used letters and the least used letters in the available Ancient Greek corpora.

<img src="stats.png" />


Vowels with `N`, `S`, and `T` consonants pops up as the most used letters. The least used letters are `Z`, `Chi`, and `Psi`.

#### Optional live chart

Uncomment the next part to output a new fresh graph from Plotly:

In [14]:
#from plotly.offline import init_notebook_mode
#init_notebook_mode(connected=False)

# for the fist time set plotly service credentials, then you can comment the next line
#import plotly
#plotly.tools.set_credentials_file(username='MarkoManninen', api_key='xyz')

# use tables and graphs...
#import plotly.tools as tls
# embed plotly graphs
#tls.embed("https://plot.ly/~MarkoManninen/8/")

### Unique words database

Now it is time to collect unique Greek words to the database and show certain specialties of the word statistics. I'm reusing data from the `greek_corpora` variable that is in the memory already. Running the next code will take a minute or two depending on the processor speed of your computer:

In [15]:
from functions import syllabify, Abnum, greek, vowels

# greek abnum object for calculating isopsephical value
g = Abnum(greek)

# lets count unique words statistic from the parsed greek corpora rather than the plain text file
# it would be pretty dauntful to find out occurence of the all 800000+ unique words from the text 
# file that is over 600 MB big!
unique_word_stats = {}
for item in greek_corpora:
    for word, cnt in item['uwords'].items():
        if word not in unique_word_stats:
            unique_word_stats[word] = 0
        unique_word_stats[word] += cnt

# init dataframe
df = DataFrame([[k, v] for k, v in unique_word_stats.items()])
# add column for the occurrence percentage of the word
df[2] = df[1].apply(lambda x: round(x*100/lwords3, 2))
# add column for the length of the word
df[3] = df[0].apply(lambda x: len(x))
# add isopsephy column
df[4] = df[0].apply(lambda x: g.value(x))
# add syllabified column
df[5] = df[0].apply(lambda x: syllabify(x))
# add length of the syllables column
df[6] = df[5].apply(lambda x: len(x))
# count vowels in the word as a column
df[7] = df[0].apply(lambda x: sum(list(x.count(c) for c in vowels)))
# count consonants in the word as a column
df[8] = df[0].apply(lambda x: len(x) - sum(list(x.count(c) for c in vowels)))

### Store database

This is the single most important part of the chapter. I'm saving all simplified unique words as a CSV file that can be used as a database for the riddle solver. After this you may proceed to the [riddle solver](https://git.io/vASrY) Jupyter notebook document in interactive mode if you prefer.

In [16]:
from functions import csv_file_name
# save dataframe to CSV file
df.to_csv(csv_file_name, header=False, index=False, encoding='utf-8')

For a confirmation of the succesful task, I will show the total number of the unique words, and five of the most repeated words in the database:

In [31]:
from functions import display_html
# use to_html and index=False to hide index column and output table
words = df.sort_values(1, ascending=False).head(n=5)
print("Total records: %s" % len(df))
display_html(words.to_html(index=False), raw=True)

Total records: 831308


0,1,2,3,4,5,6,7,8
ΚΑΙ,1775585,5.38,3,31,[ΚΑΙ],1,2,1
ΔΕ,776055,2.35,2,9,[ΔΕ],1,1,1
ΤΟ,668822,2.03,2,370,[ΤΟ],1,1,1
ΤΩΝ,485441,1.47,3,1150,[ΤΩΝ],1,1,2
Η,481623,1.46,1,8,[Η],1,1,0


KAI...

For a curiosity, let's also see the longest words in the database:

In [19]:
from functions import HTML
# load result to the temporary variable for later usage
l = df.sort_values(3, ascending=False).head(n=20)
# output table
HTML(l.to_html(index=False))

0,1,2,3,4,5,6,7,8
ΠΑΡΕΓΕΝΟΜΕΝΟΜΕΝΟΣΗΝΚΑΙΕΤΙΕΚΤΗΣΛΕΣΒΟΥΟΥΦΑΜΕΝ,1,0.0,43,3554,"[ΠΑ, ΡΕ, ΓΕ, ΝΟ, ΜΕ, ΝΟ, ΜΕ, ΝΟ, ΣΗΝ, ΚΑΙ, Ε, ...",19,22,21
ΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΟΤΗΤΟΣΑΤΜΙΔΟΥΜΕΝΟΝΦΕΡΕΤΑΙ,1,0.0,42,4754,"[Λ, ΛΗ, ΣΤΗ, ΣΑ, ΝΩ, ΘΕΝ, ΘΕΡ, ΜΟ, ΤΗ, ΤΟ, ΣΑΤ...",18,19,23
ΕΜΟΥΟΙΑΠΕΦΕΥΓΑΧΕΙΡΑΣΛΥΠΗΣΑΣΜΕΝΟΥΔΕΝΑΟΥΔΕΝ,1,0.0,41,4579,"[Ε, ΜΟΥ, ΟΙ, Α, ΠΕ, ΦΕΥ, ΓΑ, ΧΕΙ, ΡΑΣ, ΛΥ, ΠΗ,...",18,24,17
ΠΥΡΟΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΙΔΟΜΙΚΡΙΤΡΙΑΔΥ,1,0.0,40,2798,"[ΠΥ, ΡΟ, ΒΡΟ, ΜΟ, ΛΕΥ, ΚΕ, ΡΕ, ΒΙΝ, ΘΟ, Α, ΚΑΝ...",18,19,21
ΔΥΝΑΤΟΝΔΕΤΟΑΙΤΙΑΙΗΣΓΕΝΕΣΕΩΣΚΑΙΤΗΣΦΘΟΡΑΣ,1,0.0,39,4481,"[ΔΥ, ΝΑ, ΤΟΝ, ΔΕ, ΤΟ, ΑΙ, ΤΙ, ΑΙ, ΗΣ, ΓΕ, ΝΕ, ...",17,20,19
ΠΥΡΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΟΥΜΙΚΤΡΙΤΥΑΔΥ,1,0.0,38,3704,"[ΠΥΡ, ΒΡΟ, ΜΟ, ΛΕΥ, ΚΕ, ΡΕ, ΒΙΝ, ΘΟ, Α, ΚΑΝ, Θ...",16,18,20
ΚΑΙΤΟΝΑΡΙΣΤΑΡΧΟΝΑΣΜΕΝΩΣΤΗΝΓΡΑΦΗΝΤΟΥ,1,0.0,35,4969,"[ΚΑΙ, ΤΟ, ΝΑ, ΡΙ, ΣΤΑΡ, ΧΟ, ΝΑ, ΣΜΕ, ΝΩ, ΣΤΗΝ,...",13,15,20
ΚΑΙΙΚΕΛΗΧΡΥΣΗΑΦΡΟΔΙΤΗΚΑΙΟΙΣΕΚΟΣΜΗΣΕ,1,0.0,35,3264,"[ΚΑΙ, Ι, ΚΕ, ΛΗ, ΧΡΥ, ΣΗ, Α, ΦΡΟ, ΔΙ, ΤΗ, ΚΑΙ,...",16,19,16
ΕΝΝΕΑΚΑΙΕΙΚΟΣΙΚΑΙΕΠΤΑΚΟΣΙΟΠΛΑΣΙΑΚΙΣ,1,0.0,35,1796,"[ΕΝ, ΝΕ, Α, ΚΑΙ, ΕΙ, ΚΟ, ΣΙ, ΚΑΙ, Ε, ΠΤΑ, ΚΟ, ...",17,20,15
ΑΡΣΕΝΙΚΩΝΟΝΟΜΑΤΩΝΣΤΟΙΧΕΙΑΕΣΤΙΠΕΝΤΕ,1,0.0,34,4768,"[ΑΡ, ΣΕ, ΝΙ, ΚΩ, ΝΟ, ΝΟ, ΜΑ, ΤΩΝ, ΣΤΟΙ, ΧΕΙ, Α...",15,17,17


How about finding out, which words have the biggest isopsephical values?

In [32]:
# sort by the isopsephy column
m = df.sort_values(4, ascending=False).head(n=20)
# output table
HTML(m.to_html(index=False))

0,1,2,3,4,5,6,7,8
ΛΕΟΝΤΑΤΥΦΛΩΣΩΝΣΚΩΛΩΨΔΕΤΟΥ,1,0.0,25,6865,"[ΛΕ, ΟΝ, ΤΑ, ΤΥ, ΦΛΩ, ΣΩΝ, ΣΚΩ, ΛΩΨ, ΔΕ, ΤΟΥ]",10,11,14
ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,1,0.0,33,5186,"[ΟΡ, ΘΡΟ, ΦΟΙ, ΤΟ, ΣΥ, ΚΟ, ΦΑΝ, ΤΟ, ΔΙ, ΚΟ, ΤΑ...",14,16,17
ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ,2,0.0,30,5122,"[ΒΡΥ, ΣΩ, ΝΟ, ΘΡΑ, ΣΥ, ΜΑ, ΧΕΙ, Ο, ΛΗ, ΨΙ, ΚΕΡ...",13,14,16
ΟΡΘΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ,2,0.0,32,5086,"[ΟΡ, ΘΟ, ΦΟΙ, ΤΟ, ΣΥ, ΚΟ, ΦΑΝ, ΤΟ, ΔΙ, ΚΟ, ΤΑ,...",14,16,16
ΓΛΩΣΣΟΤΟΜΗΘΕΝΤΩΝΧΡΙΣΤΙΑΝΩΝ,1,0.0,26,5056,"[ΓΛΩΣ, ΣΟ, ΤΟ, ΜΗ, ΘΕΝ, ΤΩΝ, ΧΡΙ, ΣΤΙ, Α, ΝΩΝ]",10,10,16
ΚΑΙΤΟΝΑΡΙΣΤΑΡΧΟΝΑΣΜΕΝΩΣΤΗΝΓΡΑΦΗΝΤΟΥ,1,0.0,35,4969,"[ΚΑΙ, ΤΟ, ΝΑ, ΡΙ, ΣΤΑΡ, ΧΟ, ΝΑ, ΣΜΕ, ΝΩ, ΣΤΗΝ,...",13,15,20
ΑΡΣΕΝΙΚΩΝΟΝΟΜΑΤΩΝΣΤΟΙΧΕΙΑΕΣΤΙΠΕΝΤΕ,1,0.0,34,4768,"[ΑΡ, ΣΕ, ΝΙ, ΚΩ, ΝΟ, ΝΟ, ΜΑ, ΤΩΝ, ΣΤΟΙ, ΧΕΙ, Α...",15,17,17
ΛΛΗΣΤΗΣΑΝΩΘΕΝΘΕΡΜΟΤΗΤΟΣΑΤΜΙΔΟΥΜΕΝΟΝΦΕΡΕΤΑΙ,1,0.0,42,4754,"[Λ, ΛΗ, ΣΤΗ, ΣΑ, ΝΩ, ΘΕΝ, ΘΕΡ, ΜΟ, ΤΗ, ΤΟ, ΣΑΤ...",18,19,23
ΕΠΙΣΚΟΠΩΚΩΝΣΤΑΝΤΙΝΟΥΠΟΛΕΩΣ,1,0.0,26,4701,"[Ε, ΠΙ, ΣΚΟ, ΠΩ, ΚΩΝ, ΣΤΑΝ, ΤΙ, ΝΟΥ, ΠΟ, ΛΕ, ΩΣ]",11,12,14
ΚΩΔΩΝΟΦΑΛΑΡΑΧΡΩΜΕΝΟΥΣ,1,0.0,21,4642,"[ΚΩ, ΔΩ, ΝΟ, ΦΑ, ΛΑ, ΡΑ, ΧΡΩ, ΜΕ, ΝΟΥΣ]",9,10,11


How many percent of the whole word base, the least repeated words take:

In [21]:
# length of the words database
le = len(df)
# group words by occurrence and count grouped items, list the first 10 items
for x, y in df.groupby([1, 2]).count()[:10].T.items():
    print("words repeating %s time(s): " % x[0], round(100*y[0]/le, 2), "%")

words repeating 1 time(s):  44.95 %
words repeating 2 time(s):  15.86 %
words repeating 3 time(s):  7.48 %
words repeating 4 time(s):  4.84 %
words repeating 5 time(s):  3.32 %
words repeating 6 time(s):  2.5 %
words repeating 7 time(s):  1.92 %
words repeating 8 time(s):  1.59 %
words repeating 9 time(s):  1.28 %
words repeating 10 time(s):  1.11 %


Words that repeat 1-4 times fills over 73% of the whole text. Words repeating three times takes 16.5% of the words being the greatest repeatance factor.

Finally, for cross checking the data processing algorithm, I want to know in which texts the longest words occur:

In [28]:
from functions import search_words_from_corpora
# using already instantiated l variable I'm collecting the plain text words
words = list(y[0] for x, y in l.T.items())
search_words_from_corpora(words, [perseus_dir, first1k_dir])

 + Aristophanes, Lysistrata (tlg0019.tlg007.perseus-grc2.xml) =>    

   ----- ΣΠΕΡΜΑΓΟΡΑΙΟΛΕΚΙΘΟΛΑΧΑΝΟΠΩΛΙΔΕΣ (1) -----
   ὦ ξύμμαχοι γυναῖκες ἐκθεῖτ ἔνδοθεν ὦ σπερμαγοραιολεκιθολαχανοπώλιδες ὦ σκοροδοπανδοκευτριαρτοπώλιδες

 + Aristophanes, Wasps (tlg0019.tlg004.perseus-grc1.xml) =>    

   ----- ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ (1) -----
   ς ἀκούειν ἡδἔ εἰ καὶ νῦν ἐγὼ τὸν πατέρ ὅτι βούλομαι τούτων ἀπαλλαχθέντα τῶν ὀρθροφοιτοσυκοφαντοδικοταλαιπώρων τρόπων ζῆν βίον γενναῖον ὥσπερ Μόρυχος αἰτίαν ἔχω ταῦτα δρᾶν ξυνωμότης ὢν καὶ φρονῶν

 + Athenaeus, Deipnosophistae (tlg0008.tlg001.perseus-grc3.xml) =>    

   ----- ΠΥΡΒΡΟΜΟΛΕΥΚΕΡΕΒΙΝΘΟΑΚΑΝΘΟΥΜΙΚΤΡΙΤΥΑΔΥ (1) -----
   τις ἃ Ζανὸς καλέοντι τρώγματ ἔπειτ ἐπένειμεν ἐνκατακνακομιγὲς πεφρυγμένον πυρβρομολευκερεβινθοακανθουμικτριτυαδυ βρῶμα τοπανταναμικτον ἀμπυκικηροιδηστίχας παρεγίνετο τούτοις

   ----- ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ (1) -----
   τῶν ἐξ Ἀκαδημίας τις ὑπὸ Πλάτωνα καὶ Βρυσωνοθρασυμαχειοληψικερμάτων πληγεὶς ἀνάγκῃ ληψολιγομίσθῳ

For a small explanation: [Aristophanes](https://en.wikipedia.org/wiki/Aristophanes) was a Greek comic playwright and a word expert of a kind. Mathematical texts are also filled with long compoud words for fractions for example.

So thats all for the Greek corpora processing and basic statistics. One could further investigate the basic stats, categorize and compare individual texts as well.

In [34]:
words = list(y[0] for x, y in m.T.items())
search_words_from_corpora(words, [perseus_dir, first1k_dir])

 + Appian, TheCivilWars (tlg0551.tlg017.perseus-grc2.xml) =>    

   ----- ΣΥΝΥΠΟΧΩΡΟΥΝΤΩΝ (1) -----
   καὶ ἡ σύνταξις ἤδη παρελέλυτο ὀξύτερον ὑπεχώρουν καί τῶν ἐπιτεταγμένων σφίσι δευτέρων καὶ τρίτων συνυποχωρούντων μισγόμενοι πάντες ἀλλήλοις ἀκόσμως ἐθλίβοντο ὑπὸ σφῶν καὶ τῶν πολεμίων ἀπαύστως αὐτοῖς ἐπικειμένων

 + Aristophanes, Wasps (tlg0019.tlg004.perseus-grc1.xml) =>    

   ----- ΟΡΘΡΟΦΟΙΤΟΣΥΚΟΦΑΝΤΟΔΙΚΟΤΑΛΑΙΠΩΡΩΝ (1) -----
   ς ἀκούειν ἡδἔ εἰ καὶ νῦν ἐγὼ τὸν πατέρ ὅτι βούλομαι τούτων ἀπαλλαχθέντα τῶν ὀρθροφοιτοσυκοφαντοδικοταλαιπώρων τρόπων ζῆν βίον γενναῖον ὥσπερ Μόρυχος αἰτίαν ἔχω ταῦτα δρᾶν ξυνωμότης ὢν καὶ φρονῶν

 + Athenaeus, Deipnosophistae (tlg0008.tlg001.perseus-grc3.xml) =>    

   ----- ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ (1) -----
   τῶν ἐξ Ἀκαδημίας τις ὑπὸ Πλάτωνα καὶ Βρυσωνοθρασυμαχειοληψικερμάτων πληγεὶς ἀνάγκῃ ληψολιγομίσθῳ τέχνῃ σ

 + Athenaeus, TheDeipnosophists (tlg0008.tlg001.perseus-grc4.xml) =>    

   ----- ΒΡΥΣΩΝΟΘΡΑΣΥΜΑΧΕΙΟΛΗΨΙΚΕΡΜΑΤΩΝ (1) -----
   Βρυσωνοθ

## The [MIT](http://choosealicense.com/licenses/mit/) License

Copyright &copy; 2018 Marko Manninen