# How similar/how different?

## Comparing the five extant texts of the Gettsyburg Address

This notebook looks at word lenghts and word frequency.



## Load texts from github

In [None]:
import scala.io.Source
val names = Vector("bancroft", "bliss", "everett", "hay", "nicolay")    
val baseUrl = "https://raw.githubusercontent.com/neelsmith/gettysburg/master/texts/"
val urls = names.map( f => baseUrl + f + ".txt")
val texts = urls.map(u => Source.fromURL(u).getLines.mkString("\n"))

We can zip the names together with the text content and create a map, so that we can retrieve text content by name.

In [None]:
val textsMap = (names zip texts).toMap

## Word length

We'll use Bancroft's text in this example.

We'll split (tokenize) the text on runs of 1 or more white space characters.  For counting purposes, we don't care about case, so we'll create our word list by transforming every word to its lower-case version.


In [None]:
val words = textsMap("bancroft").split("[\\s]+").toVector.map(w => w.toLowerCase)

## Clustering Vector contents with `groupBy` 




In [None]:
val groupedBySize = words.groupBy(w => w.size)

In [None]:
val sortedKeys = groupedBySize.keySet.toVector.sorted.reverse

In [None]:
sortedKeys.map { k =>
  k + " characters long: " + groupedBySize(k).size + " word(s)."
}

## Word frequency


In [None]:
val groupedByWord = words.groupBy(w => w).toVector


In [None]:
val frequencies = groupedByWord.map{ case (word, wordList) => (word, wordList.size)}

In [None]:
val mostToFewest = frequencies.sortBy{ case (w,freq) => freq }.reverse

## Options for plotting in a Jupyter notebook environment

Just because it's fun.

## Configuration


In [None]:
// Make plotly libraries available to this notebook:
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

In [None]:
// Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

In [None]:
val items = mostToFewest.map(freq => freq._1)
val counts = mostToFewest.map(freq => freq._2)

val zipf = Vector(
  Bar(x = items, y = counts)
)
plot(zipf)