# Build simplified ASCII text corpus

This notebook illstruates how to generate a citable corpus with Greek
text represented in a simplified ASCII form that is useful for computing character-level edit distance.

Nodes in the new edition are distinguished by the version-level identifier which appends the string `_simpleascii` to the version identifier.

This notebook creates simplified corpora for *Iliad* text and *scholia* of book 9 in the Upsilon 1.1, but the same `asciiCorpus` function can be used to convert any citable corpus of Greek text.


In [None]:
val personalRepo = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(personalRepo)

In [None]:
import $ivy.`edu.holycross.shot.cite::xcite:4.3.0`
import $ivy.`edu.holycross.shot::ohco2:10.20.3`
import $ivy.`edu.holycross.shot::greek:5.5.1`
import $ivy.`edu.holycross.shot.mid::orthography:2.0.0`

In [None]:
import edu.holycross.shot.cite._
import edu.holycross.shot.ohco2._
import edu.holycross.shot.greek._
import edu.holycross.shot.mid.orthography._


// Source files for corpora on project gh repository
// 
//val venetusAIliadUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/vaIliad-2020i.cex"
//val venetusAScholiaUrl = "https://raw.githubusercontent.com/neelsmith/summer2020nbs/master/data/hmt-2020i-noIliad.cex"
val twins9Url = "https://raw.githubusercontent.com/neelsmith/transmission-evolution/master/data/texts/diplomatic/twins9corpus.cex"



## Convert a corpus to simpified ASCII form

- use the `LiteraryGreekString` object to tokenize the text and filter it keep only lexical tokens
- make a `LiteraryGreekString` from each lexical token, and drop accents, breathings and diaeresis
- recompose individual tokens into a single stripped-down string for each citable passage


In [None]:
// Convert a single CitableNode to simpliefied ASCII form.
// Siglum is a String to use for the version ID of the
// nodes of this corpus.
def curateNode(cn: CitableNode, siglum: String) : CitableNode = {
  if (cn.text.isEmpty){
    println("EMPTY TEXT: " + cn.urn)
    cn
  } else {

    val lexTokens = LiteraryGreekString.tokenizeNode(cn).filter(_.tokenCategory == Some(LexicalToken))
    val lgs = lexTokens.map(tkn => LiteraryGreekString(tkn.text).toLower.stripBreathingAccent.ascii)
    val simpleAscii = lgs.mkString(" ")
    CitableNode(cn.urn.addVersion(s"${siglum}_simpleascii"),simpleAscii)
  }
}


// Convert a corpus to simpliefied ASCII form.
// Siglum is a String to use for the version ID of the
// nodes of this corpus.
def asciiCorpus(c: Corpus, siglum: String) : Corpus = {
  Corpus(c.nodes.map(n => curateNode(n, siglum)))
}

In [None]:
// create  source corpora
val twins9 = CorpusSource.fromUrl(twins9Url)
// compile ascii for individual data sets
val upsilon9iliad = twins9 ~~ CtsUrn("urn:cts:greekLit:tlg0012.tlg001.e3:")
val upsilon9scholia = twins9 ~~ CtsUrn("urn:cts:greekLit:tlg5026.e3.hmt:")


//val venetusAscholia = CorpusSource.fromUrl(venetusAScholiaUrl)
//val venetusAiliad = CorpusSource.fromUrl(venetusAIliadUrl)


In [None]:
//val upsilon9iliad_ascii = asciiCorpus(upsilon9iliad, "e3")
val upsilon9scholia_ascii = asciiCorpus(upsilon9scholia, "e3")


In [None]:
val upsilon9iliad_ascii = asciiCorpus(upsilon9iliad, "e3")
