<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/dans.png"/>
<img align="right" src="images/logo.png"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Athenaeus corpus.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

## Installing Text-Fabric

See [here](https://annotation.github.io/text-fabric/tf/about/install.html)

## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your repository.
If you pull changes from the repository later, your work will not be overwritten.
Where you put your tutorial directory is up to you.
It will work from any directory.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

In [3]:
from tf.app import use

## Corpus data

Text-Fabric will fetch the Athenaeus corpus for you.

It will fetch the newest version by default, but you can get other versions as well.

The data will be stored in the `text-fabric-data` in your home directory.


# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all 300,000 words.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

# Incantation

The simplest way to get going is by this *incantation*:

For the very last version, use `hot`.

For the latest release, use `latest`.

If you have cloned the repos (TF app and data), use `clone`.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [6]:
A = use("pthu/athenaeus", hoist=globals())

The requested TF-app is not available offline
	~/text-fabric-data/pthu/athenaeus/app not found
rate limit is 5000 requests per hour, with 4955 left for this hour
	connecting to online GitHub repo pthu/athenaeus ... connected
	app/__init__.py...downloaded
	app/config.yaml...downloaded
	app/static...directory
		app/static/logo.png...downloaded
	OK


The requested data is not available offline
	~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf not found
rate limit is 5000 requests per hour, with 4939 left for this hour
	connecting to online GitHub repo pthu/athenaeus ... connected
	downloading https://github.com/pthu/athenaeus/releases/download/v1.1/Athenaeus-Deipnosophistae-tf-1.1.zip ... 
	unzipping ... 
	saving data


This is Text-Fabric 9.2.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

25 features found and 0 ignored
   |     0.19s T otype                from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     2.09s T oslots               from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     0.06s T _sentence            from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     1.49s T norm                 from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     1.46s T lemma                from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     0.01s T chapter              from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     1.51s T plain                from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnosophistae/tf/1.1
   |     1.58s T main                 from ~/text-fabric-data/pthu/athenaeus/Athenaeus/Deipnoso

You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

There are ways to tweak the set of features that is loaded. You can load more and less.

# Counting

In [7]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.04s 305350 nodes


# Node types

In [8]:
F.otype.slotType

'word'

In [9]:
F.otype.all

('_book',
 'head',
 'book',
 'hi',
 'cit',
 'num',
 'add',
 'chapter',
 'pb',
 'p',
 'quote',
 'bibl',
 'l',
 '_sentence',
 'word')

In [10]:
C.levels.data

(('_book', 265146.0, 265147, 265147),
 ('head', 265146.0, 287484, 287484),
 ('book', 17676.4, 285974, 285988),
 ('hi', 3114.9358974358975, 287485, 287562),
 ('cit', 1586.0179640718563, 287317, 287483),
 ('num', 949.7745454545454, 298830, 299104),
 ('add', 335.9391634980989, 279921, 280709),
 ('chapter', 199.65813253012047, 285989, 287316),
 ('pb', 171.18463524854744, 300676, 302224),
 ('p', 168.78421387651179, 299105, 300675),
 ('quote', 84.7364043506078, 302225, 305350),
 ('bibl', 51.0862462006079, 280710, 285973),
 ('l', 23.49968935830301, 287563, 298829),
 ('_sentence', 17.94801326744737, 265148, 279920),
 ('word', 1, 1, 265146))

In [11]:
for (typ, av, start, end) in C.levels.data:
    print(f"{end - start + 1:>7} {typ}s")

      1 _books
      1 heads
     15 books
     78 his
    167 cits
    275 nums
    789 adds
   1328 chapters
   1549 pbs
   1571 ps
   3126 quotes
   5264 bibls
  11267 ls
  14773 _sentences
 265146 words


# Feature statistics

There are no linguistic features, as far as I can see, but there is `lemma`.

# Word matters

## Top 20 frequent words

In [12]:
for (w, amount) in F.lemma.freqList("word")[0:20]:
    print(f"{amount:>5} {w}")

24734 ὅς,ὁ
12995 καί
11725 δέ
 5710 ἐν,εἰς,εἰμί
 3139 φημί
 2917 οὗτος
 2696 αὐτός
 2457 ὅς,ὁ,τίς,τις
 2408 εἰμί
 2055 οὐ
 2007 γάρ
 1934 σύ,τις,τεός,τε
 1932 ὅς,ὁ,τίς,τις,τῷ
 1856 ὅς,ὁ,τίς
 1828 μέν
 1807 ὅς,ὡς
 1742 περί
 1390 ἐπί
 1321 λέγω1,λέγω
 1312 εἶμι,εἰς,εἰμί


## Hapaxes

In [13]:
hapaxes1 = sorted(lx for (lx, amount) in F.lemma.freqList("word") if amount == 1)
len(hapaxes1)

11197

In [14]:
for lx in hapaxes1[0:20]:
    print(lx)

*isgreek
*p
*ʼαγκυλητους
*ʼαδεσθαι
*ʼαδυφωνον
*ʼακουσομεθα
*ʼαναξαρχον
*ʼανδρομαχον
*ʼανθος
*ʼαντιφωντος
*ʼαπο
*ʼαποδιδωσι
*ʼαρπασθηναι
*ʼαφι
*ʼβρενθιν
*ʼβυζαντιους
*ʼγενη
*ʼγλαυκου
*ʼγραφει
*ʼγραφων


### Small occurrence base

The occurrence base of a word are the books in which the word occurs.

In [15]:
occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("book"):
    book = F.book.v(s)
    for w in L.d(s, otype="word"):
        occurrenceBase[F.lemma.v(w)].add(book)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  0.17s done
  0.17s 23436 entries


An overview of how many words have how big occurrence bases:

In [16]:
occurrenceSize = collections.Counter()

for (w, books) in occurrenceBase.items():
    occurrenceSize[len(books)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"books {size:>4} : {amount:>5} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"books {size:>4} : {amount:>5} words")

books    1 : 12905 words
books    2 :  3623 words
books    3 :  1879 words
books    4 :  1178 words
books    5 :   779 words
books    6 :   572 words
books    7 :   435 words
books    8 :   375 words
books   15 :   346 words
books   10 :   296 words
...
books    6 :   572 words
books    7 :   435 words
books    8 :   375 words
books   15 :   346 words
books   10 :   296 words
books    9 :   283 words
books   11 :   216 words
books   12 :   206 words
books   13 :   172 words
books   14 :   171 words


Let's give the predicate *private* to those words whose occurrence base is a single book.

In [17]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

12905

### Peculiarity of books

As a final exercise with books, lets make a list of all books, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the book

In [18]:
bookList = []

empty = set()
ordinary = set()

for d in F.otype.s("book"):
    book = F.book.v(d)
    words = {F.lemma.v(w) for w in L.d(d, otype="word")}
    a = len(words)
    if not a:
        empty.add(book)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(book)
        continue
    p = 100 * o / a
    bookList.append((book, a, o, p))

bookList = sorted(bookList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty books")
print(f"Found {len(ordinary):>4} ordinary books (i.e. without private words)")

Found    0 empty books
Found    0 ordinary books (i.e. without private words)


In [19]:
print(
    "{:<20}{:>5}{:>5}{:>5}\n{}".format(
        "book",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in bookList[0:20]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in bookList[-20:]:
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))

book                 #all #own %own
-----------------------------------
3                    4706 1099 23.4%
14                   4764 1091 22.9%
15                   3812  840 22.0%
13                   4936 1077 21.8%
11                   4581  999 21.8%
7                    4446  949 21.3%
5                    4064  831 20.4%
4                    4767  954 20.0%
2                    3779  742 19.6%
9                    4101  798 19.5%
1                    3704  692 18.7%
6                    4442  825 18.6%
10                   4319  752 17.4%
12                   4155  712 17.1%
8                    3471  544 15.7%
...
3                    4706 1099 23.4%
14                   4764 1091 22.9%
15                   3812  840 22.0%
13                   4936 1077 21.8%
11                   4581  999 21.8%
7                    4446  949 21.3%
5                    4064  831 20.4%
4                    4767  954 20.0%
2                    3779  742 19.6%
9                    4101  798 19.5%

# Next steps

By now you have an impression how to compute around in the Athenaeus.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

**(in progress, not all of the tutorials below exist already!)**

* **[search](search.ipynb)** turbo charge your hand-coding with search templates

CC-BY Dirk Roorda