#Language Classification (Dataset API version)
###A naive approach to language classification.
This notebook explores the language classification problem by looking at the frequency distribution of the characters used in sample text.
Using the letter frequency we build simple models that will allow us to classify the language of new sentences.

For our exploration, we will use a dataset comprised of treaties and other official documents of the European Commission. Those are available in each language spoken in the European Union and hence ideal to have a datasets of equivalent quality for each language.

## First, we will load some sample data and explore the character distribution in order to build up some intuition.

In [ ]:
val notebooksFolder = sys.env("NOTEBOOKS_DIR")
val baseFolder = s"$notebooksFolder/languageclassification/data"

notebooksFolder: String = /home/maasg/playground/sparkfun/spark-notebooks
baseFolder: String = /home/maasg/playground/sparkfun/spark-notebooks/languageclassification/data


## We import the Implicits  available in the Spark Session
They provide helpful transformations for Column operations and implicit evidence for `Encoders` required by `Dataframes` 

In [ ]:
val spark = sparkSession // we need a stable identifier to import the implicits
import spark.implicits._

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2b10db5d
import spark.implicits._


### We load the english dataset to explore the data

In [ ]:
val en = sparkSession.read.textFile(baseFolder + "/en")

en: org.apache.spark.sql.Dataset[String] = [value: string]


In [ ]:
en

res11: org.apache.spark.sql.Dataset[String] = [value: string]


### And we clean up the text from characters other than letters

In [ ]:
import java.lang.Character
val cleanedLetters = en.flatMap(str => str.filter(ch => Character.isAlphabetic(ch)).map(ch => Character.toLowerCase(ch).toString))

import java.lang.Character
cleanedLetters: org.apache.spark.sql.Dataset[String] = [value: string]


In [ ]:
cleanedLetters

res17: org.apache.spark.sql.Dataset[String] = [value: string]


### We count the total of characters in our dataset
We will need this later to obtain relative frequency values

In [ ]:
val total = cleanedLetters.count()

total: Long = 1670805


### ...and the frequency of each alphabetic character in the dataset
Note that the frequency is relative to the total count - this will enable us to compare different languages later on
Here is where the `Dataset` API and it's SQL-like dialect turns out to be very handy to use.

As a reminder, this is what it looks like in the RDD API:
```
val freq = cleanedLetters.keyBy(char => char.toString.toLowerCase).countByKey.map{case (k,v) => (k.toString, v.toDouble/total)}
val freqOrdered = freq.toList.sortBy{case (k,v)=> k}
```

In [ ]:
val freq = cleanedLetters.groupBy($"value").agg(count($"value")/total as "freq").sort($"value".asc)


freq: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: string, freq: double]


### We can now visualize the how the frequency distribution looks like

In [ ]:
freq.take(30)

res33: Array[org.apache.spark.sql.Row] = Array([a,0.07892123856464399], [b,0.013841232220396755], [c,0.04509203647343646], [d,0.030138166931509062], [e,0.12519294591529234], [f,0.026006625548762423], [g,0.014792270791624396], [h,0.044525243819595946], [i,0.08238364141835822], [j,0.002664583838329428], [k,0.002872866672053292], [l,0.040070504936243305], [m,0.024439716184713356], [n,0.07904632796765632], [o,0.08121653933283657], [p,0.025622379631375296], [q,0.0011192209743207616], [r,0.06635484092997088], [s,0.05571326396557348], [t,0.10015710989612792], [u,0.02913805022130051], [v,0.00852882293265821], [w,0.0074197766944676365], [x,0.0023743046016740433], [y,0.011944541702951571], [z,2.1187391706393026E-4], [à,4.189597230077717E-6], [á,1.0174736130188742E-5], [ã,2.3940555600444098E-6], [...