## How many words fit in a dactylic hexameter?

Homeric poetry is shaped in lines of 6 dactyls.  How many words make up a single hexameter?


This notebook will load a text of the *Iliad* and count words in it. It introduces a class named `histoutils` that defines some convenient methods for working with histograms (that is, counts of occurrences of some item).


### Load an *Iliad* text

The first cell below just loads two long texts:  don't worry yet about how it works, but notice that at the end, we've created a named String value called `iliad` and a named String value called `scholia`.

In [13]:
import scala.io.Source
val iliadUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/iliad-dipl.txt"
val scholiaUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/scholia-dipl.txt"

val iliad = Source.fromURL(iliadUrl).getLines.toVector
val scholia = Source.fromURL(scholiaUrl).getLines.toVector

// sanity check: visually inspect a sample of 10 lines
println(iliad.take(10))



Vector(Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος, οὐλομένην· ἡ μυρί' Ἀχαιοῖς ἄλγε' ἔθηκεν·, πολλὰς δ' ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν, ἡρώων· αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν, οἰωνοῖσί τε πᾶσι· Διὸς δ' ἐτελείετο βουλή·, ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε, Ἀτρείδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς ·, τίς τάρ σφωε θεῶν ἔριδι ξυνἕηκε μάχεσθαι·, Λητοῦς καὶ Διὸς υἱός· ὁ γὰρ βασιλῆϊ χολωθεὶς, νοῦσον ἀνὰ στρατὸν ὦρσε κακήν· ὀλέκοντο δὲ λαοὶ .)


[32mimport [39m[36mscala.io.Source
[39m
[36miliadUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/iliad-dipl.txt"[39m
[36mscholiaUrl[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/scholia-dipl.txt"[39m
[36miliad[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"\u039c\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u1f71\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2"[39m,
  [32m"\u03bf\u1f50\u03bb\u03bf\u03bc\u1f73\u03bd\u03b7\u03bd\u00b7 \u1f21 \u03bc\u03c5\u03c1\u03af' \u1f08\u03c7\u03b1\u03b9\u03bf\u1fd6\u03c2 \u1f04\u03bb\u03b3\u03b5' \u1f14\u03b8\u03b7\u03ba\u03b5\u03bd\u00b7"[39m,
  [32m"\u03c0\u03bf\u03bb\u03bb\u1f70\u03c2 \u03b4' \u1f30\u03c6\u03b8\u03af\u03bc\u03bf\u03c5\u03c2 \u03c8\u03c5\u03c7\u1f70\u03c2 \u1f0c\u03ca\u03b4\u03b9 \u03c0\u03c1\u03bf\u0390\u03b1\u03c

So let's count "words".  We could spend a long time deciding what a word is, but let's keep it simple today by:

1. throwing away some punctuation characters
2. then splitting the long text on whitespace



In [2]:
// We'll learn more later about what's going on here: the list of
// characters inside square brackets is actually a *regular expression*
val iliadNoPunct = iliad.map(ln => ln.replaceAll("[,·\\.:~]", ""))
// The expression [ ] means "one or more occurrences of any Whitespace character"
val iliadWordsByLine = iliadNoPunct.map(ln => ln.split("[ \n]+").toVector)

// Print first 100 words to see if they look right...
println(iliadWordsByLine.take(10))


Vector(Vector(Μῆνιν, ἄειδε, θεὰ, Πηληϊάδεω, Ἀχιλῆος), Vector(οὐλομένην, ἡ, μυρί', Ἀχαιοῖς, ἄλγε', ἔθηκεν), Vector(πολλὰς, δ', ἰφθίμους, ψυχὰς, Ἄϊδι, προΐαψεν), Vector(ἡρώων, αὐτοὺς, δὲ, ἑλώρια, τεῦχε, κύνεσσιν), Vector(οἰωνοῖσί, τε, πᾶσι, Διὸς, δ', ἐτελείετο, βουλή), Vector(ἐξ, οὗ, δὴ, τὰ, πρῶτα, διαστήτην, ἐρίσαντε), Vector(Ἀτρείδης, τε, ἄναξ, ἀνδρῶν, καὶ, δῖος, Ἀχιλλεύς), Vector(τίς, τάρ, σφωε, θεῶν, ἔριδι, ξυνἕηκε, μάχεσθαι), Vector(Λητοῦς, καὶ, Διὸς, υἱός, ὁ, γὰρ, βασιλῆϊ, χολωθεὶς), Vector(νοῦσον, ἀνὰ, στρατὸν, ὦρσε, κακήν, ὀλέκοντο, δὲ, λαοὶ))


[36miliadNoPunct[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"\u039c\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u1f71\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2"[39m,
  [32m"\u03bf\u1f50\u03bb\u03bf\u03bc\u1f73\u03bd\u03b7\u03bd \u1f21 \u03bc\u03c5\u03c1\u03af' \u1f08\u03c7\u03b1\u03b9\u03bf\u1fd6\u03c2 \u1f04\u03bb\u03b3\u03b5' \u1f14\u03b8\u03b7\u03ba\u03b5\u03bd"[39m,
  [32m"\u03c0\u03bf\u03bb\u03bb\u1f70\u03c2 \u03b4' \u1f30\u03c6\u03b8\u03af\u03bc\u03bf\u03c5\u03c2 \u03c8\u03c5\u03c7\u1f70\u03c2 \u1f0c\u03ca\u03b4\u03b9 \u03c0\u03c1\u03bf\u0390\u03b1\u03c8\u03b5\u03bd"[39m,
  [32m"\u1f21\u03c1\u1f7d\u03c9\u03bd \u03b1\u1f50\u03c4\u03bf\u1f7a\u03c2 \u03b4\u1f72 \u1f11\u03bb\u1f7d\u03c1\u03b9\u03b1 \u03c4\u03b5\u1fe6\u03c7\u03b5 \u03ba\u1f7b\u03bd\u03b5\u03c3\u03c3\u03b9\u03bd"[39m,
  [32m"\u03bf\u1f30\u03c9\u03bd\u03bf\u1fd6\u03c3\u03af \u03c4\u03b5 \u03c0\u1fb6\u03c3\u0

In [4]:
val wordCountsByIliadLine =  iliadWordsByLine.map(words => words.size).toVector

[36mwordCountsByIliadLine[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m5[39m,
  [32m6[39m,
  [32m6[39m,
  [32m6[39m,
  [32m7[39m,
  [32m7[39m,
  [32m7[39m,
  [32m7[39m,
  [32m8[39m,
  [32m8[39m,
  [32m5[39m,
  [32m8[39m,
  [32m7[39m,
  [32m6[39m,
  [32m7[39m,
  [32m6[39m,
  [32m6[39m,
  [32m7[39m,
  [32m7[39m,
  [32m8[39m,
  [32m5[39m,
  [32m6[39m,
  [32m7[39m,
  [32m6[39m,
  [32m8[39m,
  [32m8[39m,
  [32m7[39m,
  [32m9[39m,
  [32m9[39m,
  [32m7[39m,
  [32m6[39m,
  [32m9[39m,
  [32m9[39m,
  [32m7[39m,
  [32m8[39m,
  [32m6[39m,
  [32m6[39m,
  [32m7[39m,
...

Group word counts into lists, then find the size of each list.

In [5]:
val wcGrouped = wordCountsByIliadLine.groupBy(count => count).toVector

[36mwcGrouped[39m: [32mVector[39m[([32mInt[39m, [32mVector[39m[[32mInt[39m])] = [33mVector[39m(
  (
    [32m5[39m,
    [33mVector[39m(
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
      [32m5[39m,
...

In [6]:
val wcCounts = wcGrouped.map{ case (wordcount, occurrences) => (wordcount, occurrences.size)}


[36mwcCounts[39m: [32mVector[39m[([32mInt[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m5[39m, [32m1194[39m),
  ([32m10[39m, [32m536[39m),
  ([32m1[39m, [32m31[39m),
  ([32m6[39m, [32m3556[39m),
  ([32m9[39m, [32m1652[39m),
  ([32m2[39m, [32m27[39m),
  ([32m12[39m, [32m15[39m),
  ([32m7[39m, [32m4810[39m),
  ([32m3[39m, [32m24[39m),
  ([32m11[39m, [32m114[39m),
  ([32m8[39m, [32m3411[39m),
  ([32m4[39m, [32m269[39m)
)

In [8]:
wcCounts.sortBy(_._2)


[36mres7[39m: [32mVector[39m[([32mInt[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m12[39m, [32m15[39m),
  ([32m3[39m, [32m24[39m),
  ([32m2[39m, [32m27[39m),
  ([32m1[39m, [32m31[39m),
  ([32m11[39m, [32m114[39m),
  ([32m4[39m, [32m269[39m),
  ([32m10[39m, [32m536[39m),
  ([32m5[39m, [32m1194[39m),
  ([32m9[39m, [32m1652[39m),
  ([32m8[39m, [32m3411[39m),
  ([32m6[39m, [32m3556[39m),
  ([32m7[39m, [32m4810[39m)
)

In [11]:
println(iliadWordsByLine.filter(_.size == 1))

Vector(Vector(), Vector(), Vector(), Vector(), Vector(), Vector(), Vector(), Vector(νῦν δ' έτι καὶ μᾶλλον νοέω φρεσὶ τιμήσασθαι), Vector(τείχεος ἐξελθεῖν ἄλλοι δ' ἔντοσθε μένουσι), Vector(ἠθεῖ' ῆ μὲν πολλὰ πατὴρ καὶ πότνια μήτηρ), Vector(ἂλλ' ἐμὸς ἔνδοθι θυμὸς ἐτείρετο πένθεϊ λυγρῷ), Vector(νῦν δ' ἰ̈θὺς μεμαῶτε μαχώμεθα μη δέ τι δούρων), Vector(οἱ δ' ὅτε δὴ σχεδὸν ἦσαν ἐπ ἀλλήλοισιν ἰ̈όντες), Vector(ἂλλ' άγε δεῦρο θεοὺς ἐπιδώμεθα τοὶ γὰρ ἄριστοι), Vector(ὡς οὐκ έστι λέουσι καὶ ἀνδράσιν ὅρκια πιστὰ), Vector(οὐδὲ λύκοι τε καὶ ἄρνες ὁμόφρονα θυμὸν ἔχουσιν), Vector(ἀλλὰ κακὰ φρονέουσι διαμπερὲς ἀλλήλοισιν), Vector(παντοίης ἀρετῆς μιμνήσκεο νῦν σε μάλα χρὴ), Vector(ἔγχει ἐμῷ δαμάᾳ νῦν δ' ἀθρόα πάντ' ἀποτίσεις), Vector(ἐννῆμαρ δὴ νεῖκος ἐν ἀθανάτοισιν όρωρεν), Vector(κλέψαι δ' ὀτρύνεσκον ἐΰσκοπον ἀργεϊφόντην), Vector(αῖψα μάλ' ἐς στρατὸν ἐλθὲ καὶ υἱέϊ σῶ ἐπίτειλον), Vector(εὗρ' αδινὰ στενάχοντα φίλοι δ' ἀμφ' αὐτὸν ἑταῖροι), Vector(ἐσσυμένως ἐπένοντο καὶ ἐντύνοντο ἄριστον), Vector(τοῖσι δ' ὄϊ

## Introducing the `histoutils` class


The next two cells configure this Jupyter notebook to find the `histoutils` library in a personal repository on the widely used site `bintray.com`.

In [14]:
val personalBintray = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalBintray)


[36mpersonalBintray[39m: [32mcoursierapi[39m.[32mMavenRepository[39m = MavenRepository(https://dl.bintray.com/neelsmith/maven)

In [15]:
import $ivy.`edu.holycross.shot::histoutils:2.3.0`

Downloading https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloading https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom.sha1
Downloaded https://repo1.maven.org/maven2/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom.sha1
Downloading https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloaded https://dl.bintray.com/neelsmith/maven/edu/holycross/shot/histoutils_2.12/2.3.0/histoutils_2.12-2.3.0.pom
Downloading https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloaded https://repo1.maven.org/maven2/org/wvlet/airframe/airframe-log_2.12/20.5.2/airframe-log_2.12-20.5.2.pom
Downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.11/scala-

[32mimport [39m[36m$ivy.$                                     [39m

### Standard Scala

The remaining cells import the `histoutils` library, and create a `Histogram` object from our data about frequencies of word counts.

In [16]:
import edu.holycross.shot.histoutils._

val wcFrequencies =   
wcCounts.map{ case (words, occurs) => Frequency(words, occurs)}

[32mimport [39m[36medu.holycross.shot.histoutils._

[39m
[36mwcFrequencies[39m: [32mVector[39m[[32mFrequency[39m[[32mInt[39m]] = [33mVector[39m(
  [33mFrequency[39m([32m5[39m, [32m1194[39m),
  [33mFrequency[39m([32m10[39m, [32m536[39m),
  [33mFrequency[39m([32m1[39m, [32m31[39m),
  [33mFrequency[39m([32m6[39m, [32m3556[39m),
  [33mFrequency[39m([32m9[39m, [32m1652[39m),
  [33mFrequency[39m([32m2[39m, [32m27[39m),
  [33mFrequency[39m([32m12[39m, [32m15[39m),
  [33mFrequency[39m([32m7[39m, [32m4810[39m),
  [33mFrequency[39m([32m3[39m, [32m24[39m),
  [33mFrequency[39m([32m11[39m, [32m114[39m),
  [33mFrequency[39m([32m8[39m, [32m3411[39m),
  [33mFrequency[39m([32m4[39m, [32m269[39m)
)

In [17]:
val iliadWordsHisto = Histogram(wcFrequencies)

[36miliadWordsHisto[39m: [32mHistogram[39m[[32mInt[39m] = [33mHistogram[39m(
  [33mVector[39m(
    [33mFrequency[39m([32m5[39m, [32m1194[39m),
    [33mFrequency[39m([32m10[39m, [32m536[39m),
    [33mFrequency[39m([32m1[39m, [32m31[39m),
    [33mFrequency[39m([32m6[39m, [32m3556[39m),
    [33mFrequency[39m([32m9[39m, [32m1652[39m),
    [33mFrequency[39m([32m2[39m, [32m27[39m),
    [33mFrequency[39m([32m12[39m, [32m15[39m),
    [33mFrequency[39m([32m7[39m, [32m4810[39m),
    [33mFrequency[39m([32m3[39m, [32m24[39m),
    [33mFrequency[39m([32m11[39m, [32m114[39m),
    [33mFrequency[39m([32m8[39m, [32m3411[39m),
    [33mFrequency[39m([32m4[39m, [32m269[39m)
  )
)