#Language Classification
###A naive approach to language classification.
This notebook explores the language classification problem by looking at the frequency distribution of the characters used in sample text.
Using the letter frequency we build simple models that will allow us to classify the language of new sentences.

For our exploration, we will use a dataset comprised of treaties and other official documents of the European Commission. Those are available in each language spoken in the European Union and hence ideal to have a datasets of equivalent quality for each language.

## First, we will load some sample data and explore the character distribution in order to build up some intuition.

In [ ]:
val notebooksFolder = sys.env("NOTEBOOKS_DIR")
val baseFolder = s"$notebooksFolder/languageclassfication/data"

notebooksFolder: String = /home/maasg/playground/sparkfun/spark-notebooks
baseFolder: String = /home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data


### We load the english dataset to explore the data

In [ ]:
val en = sparkSession.sparkContext.textFile(baseFolder + "/en")

en: org.apache.spark.rdd.RDD[String] = /home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/en MapPartitionsRDD[20] at textFile at <console>:70


### And we clean up the text from characters other than letters

In [ ]:
val cleanedLetters = en.flatMap(str => str).filter(java.lang.Character.isAlphabetic(_)).map(java.lang.Character.toLowerCase(_))

cleanedLetters: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[23] at map at <console>:72


#### We can use cells to interactively test our filter method to be sure we are getting the results that we expect
The API parity between Scala and Spark lets us easily tests on local collections. Like a string in this case.

In [ ]:
val sample = "Erwägung protección jurídica ci-après à l’aide Gemäß 987..."
sample.filter(java.lang.Character.isAlphabetic(_)).map(java.lang.Character.toLowerCase(_))


sample: String = Erwägung protección jurídica ci-après à l’aide Gemäß 987...
res31: String = erwägungprotecciónjurídicaciaprèsàlaidegemäß


### We count the total of characters in our dataset
We will need this later to obtain relative frequency values

In [ ]:
val total = cleanedLetters.count()

total: Long = 1670805


### ...and the frequency of each alphabetic character in the dataset
Note that the frequency is relative to the total count - this will enable us to compare different languages later on

In [ ]:
val freq = cleanedLetters.keyBy(char => char.toString.toLowerCase).countByKey.map{case (k,v) => (k.toString, v.toDouble/total)}

freq: scala.collection.Map[String,Double] = Map(e -> 0.12519294591529234, л -> 2.3940555600444098E-6, ς -> 5.386625010099922E-6, έ -> 1.1970277800222049E-6, s -> 0.05571326396557348, б -> 1.1970277800222049E-6, x -> 0.0023743046016740433, č -> 4.189597230077717E-6, α -> 6.583652790122127E-6, ā -> 2.9925694500555123E-6, n -> 0.07904632796765632, ε -> 6.583652790122127E-6, п -> 2.3940555600444098E-6, ω -> 1.1970277800222049E-6, ä -> 4.189597230077717E-6, ļ -> 5.985138900111024E-7, j -> 0.002664583838329428, y -> 0.011944541702951571, š -> 2.3940555600444098E-6, φ -> 1.1970277800222049E-6, μ -> 1.7955416700333072E-6, а -> 6.583652790122127E-6, t -> 0.10015710989612792, ó -> 2.9925694500555123E-6, в -> 5.985138900111024E-7, é -> 2.1546500040399688E-5, u -> 0.02913805022130051, ή -> 5.985138...

#### We are interested in the frequency of each letter, alphabetically ordered

In [ ]:
val freqOrdered = freq.toList.sortBy{case (k,v)=> k}

freqOrdered: List[(String, Double)] = List((a,0.07892123856464399), (b,0.013841232220396755), (c,0.04509203647343646), (d,0.030138166931509062), (e,0.12519294591529234), (f,0.026006625548762423), (g,0.014792270791624396), (h,0.044525243819595946), (i,0.08238364141835822), (j,0.002664583838329428), (k,0.002872866672053292), (l,0.040070504936243305), (m,0.024439716184713356), (n,0.07904632796765632), (o,0.08121653933283657), (p,0.025622379631375296), (q,0.0011192209743207616), (r,0.06635484092997088), (s,0.05571326396557348), (t,0.10015710989612792), (u,0.02913805022130051), (v,0.00852882293265821), (w,0.0074197766944676365), (x,0.0023743046016740433), (y,0.011944541702951571), (z,2.1187391706393026E-4), (à,4.189597230077717E-6), (á,1.0174736130188742E-5), (ã,2.3940555600444098E-6), (ä,4....

### We can now visualize the how the frequency distribution looks like

In [ ]:
freqOrdered

res37: List[(String, Double)] = List((a,0.07892123856464399), (b,0.013841232220396755), (c,0.04509203647343646), (d,0.030138166931509062), (e,0.12519294591529234), (f,0.026006625548762423), (g,0.014792270791624396), (h,0.044525243819595946), (i,0.08238364141835822), (j,0.002664583838329428), (k,0.002872866672053292), (l,0.040070504936243305), (m,0.024439716184713356), (n,0.07904632796765632), (o,0.08121653933283657), (p,0.025622379631375296), (q,0.0011192209743207616), (r,0.06635484092997088), (s,0.05571326396557348), (t,0.10015710989612792), (u,0.02913805022130051), (v,0.00852882293265821), (w,0.0074197766944676365), (x,0.0023743046016740433), (y,0.011944541702951571), (z,2.1187391706393026E-4), (à,4.189597230077717E-6), (á,1.0174736130188742E-5), (ã,2.3940555600444098E-6), (ä,4.189597...

### Probably better to take the relevant part

In [ ]:
freqOrdered.take(30)


res39: List[(String, Double)] = List((a,0.07892123856464399), (b,0.013841232220396755), (c,0.04509203647343646), (d,0.030138166931509062), (e,0.12519294591529234), (f,0.026006625548762423), (g,0.014792270791624396), (h,0.044525243819595946), (i,0.08238364141835822), (j,0.002664583838329428), (k,0.002872866672053292), (l,0.040070504936243305), (m,0.024439716184713356), (n,0.07904632796765632), (o,0.08121653933283657), (p,0.025622379631375296), (q,0.0011192209743207616), (r,0.06635484092997088), (s,0.05571326396557348), (t,0.10015710989612792), (u,0.02913805022130051), (v,0.00852882293265821), (w,0.0074197766944676365), (x,0.0023743046016740433), (y,0.011944541702951571), (z,2.1187391706393026E-4), (à,4.189597230077717E-6), (á,1.0174736130188742E-5), (ã,2.3940555600444098E-6), (ä,4.189597...

## Let's put together these initial steps to process other languages and compare the distributions

In [ ]:
import java.lang.Character
import org.apache.spark.rdd.RDD
  
def letterFreq(rdd: RDD[String]): Seq[(String, Double)] = {
  val cleanedChars = rdd.flatMap(str => str).collect{case char if Character.isAlphabetic(char) => Character.toLowerCase(char)}
  val total = cleanedChars.count
  val freq = cleanedChars.keyBy(identity).countByKey()
  // here we transform characters to String to help us with limited support for 'char' in Spark Datasets
  val ordered = freq.map{case (k,v) => (k.toString, v.toDouble/total)}.toList.sortBy{case (k,v)=> k}
  ordered
}



import java.lang.Character
import org.apache.spark.rdd.RDD
letterFreq: (rdd: org.apache.spark.rdd.RDD[String])Seq[(String, Double)]


In [ ]:
val es = sparkSession.sparkContext.textFile(baseFolder + "/es")

es: org.apache.spark.rdd.RDD[String] = /home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/es MapPartitionsRDD[39] at textFile at <console>:74


In [ ]:
val esLetterFreq = letterFreq(es).take(30)

esLetterFreq: Seq[(String, Double)] = List((a,0.10811093644933893), (b,0.011199629499334257), (c,0.055396417899813415), (d,0.056432327964339314), (e,0.12497161127711401), (f,0.007402220332114659), (g,0.008447036662644004), (h,0.002932388081634834), (i,0.07469295647953117), (j,0.004328445277675108), (k,4.1748121891156525E-5), (l,0.05847687265375555), (m,0.023087268047434772), (n,0.07069237311910795), (o,0.08934988711307841), (p,0.028785051723139814), (q,0.004136403916975788), (r,0.06678196570196962), (s,0.0724157355907749), (t,0.05165856938649187), (u,0.03707344552259742), (v,0.006658547120826858), (w,1.5585965506031768E-5), (x,0.0015324343942180521), (y,0.006487101500260509), (z,0.0020684802793005017), (º,9.462907628662146E-6), (à,5.566416252154203E-7), (á,0.0061186047443679), (ã,2.2265...

### Let's compare the frequency distribution of English vs Spanish

In [ ]:
BarChart(esLetterFreq)

res53: notebook.front.widgets.charts.BarChart[Seq[(String, Double)]] = <BarChart widget>


In [ ]:
BarChart(freqOrdered.take(30))

res55: notebook.front.widgets.charts.BarChart[List[(String, Double)]] = <BarChart widget>


### Let's create a classifier for 6 common european languages

In [ ]:
val languages = Seq("en", "es", "it", "de", "fr", "nl")

languages: Seq[String] = List(en, es, it, de, fr, nl)


In [ ]:
val langDataset = languages.map{lang => (lang, sparkContext.textFile(baseFolder + s"/$lang"))}

langDataset: Seq[(String, org.apache.spark.rdd.RDD[String])] = List((en,/home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/en MapPartitionsRDD[95] at textFile at <console>:76), (es,/home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/es MapPartitionsRDD[97] at textFile at <console>:76), (it,/home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/it MapPartitionsRDD[99] at textFile at <console>:76), (de,/home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/de MapPartitionsRDD[101] at textFile at <console>:76), (fr,/home/maasg/playground/sparkfun/spark-notebooks/languageclassfication/data/fr MapPartitionsRDD[103] at textFile at <console>:76), (nl,/home/maasg/playground/sparkfun/spark-notebooks/languagecl...

In [ ]:
val langLetterFreq = langDataset.map{case (lang, rdd) => (lang, letterFreq(rdd))}


langLetterFreq: Seq[(String, Seq[(String, Double)])] = List((en,List((a,0.07892123856464399), (b,0.013841232220396755), (c,0.04509203647343646), (d,0.030138166931509062), (e,0.12519294591529234), (f,0.026006625548762423), (g,0.014792270791624396), (h,0.044525243819595946), (i,0.08238364141835822), (j,0.002664583838329428), (k,0.002872866672053292), (l,0.040070504936243305), (m,0.024439716184713356), (n,0.07904632796765632), (o,0.08121653933283657), (p,0.025622379631375296), (q,0.0011192209743207616), (r,0.06635484092997088), (s,0.05571326396557348), (t,0.10015710989612792), (u,0.02913805022130051), (v,0.00852882293265821), (w,0.0074197766944676365), (x,0.0023743046016740433), (y,0.011944541702951571), (z,2.1187391706393026E-4), (à,4.189597230077717E-6), (á,1.0174736130188742E-5), (ã,2.3...

#### This case class helps to provide a schema to the dataset to help us plotting it.

In [ ]:
case class LanguageLetterFrequency(lang: String, letter: String, freq: Double)

defined class LanguageLetterFrequency


In [ ]:
val langByLetterFreq = langLetterFreq.flatMap{case (lang, freqList) => freqList.take(30).map{case (letter, freq) => LanguageLetterFrequency(lang, letter, freq)}}

langByLetterFreq: Seq[LanguageLetterFrequency] = List(LanguageLetterFrequency(en,a,0.07892123856464399), LanguageLetterFrequency(en,b,0.013841232220396755), LanguageLetterFrequency(en,c,0.04509203647343646), LanguageLetterFrequency(en,d,0.030138166931509062), LanguageLetterFrequency(en,e,0.12519294591529234), LanguageLetterFrequency(en,f,0.026006625548762423), LanguageLetterFrequency(en,g,0.014792270791624396), LanguageLetterFrequency(en,h,0.044525243819595946), LanguageLetterFrequency(en,i,0.08238364141835822), LanguageLetterFrequency(en,j,0.002664583838329428), LanguageLetterFrequency(en,k,0.002872866672053292), LanguageLetterFrequency(en,l,0.040070504936243305), LanguageLetterFrequency(en,m,0.024439716184713356), LanguageLetterFrequency(en,n,0.07904632796765632), LanguageLetterFreque...

In [ ]:
CustomPlotlyChart(langByLetterFreq,
                  layout="{title: 'Language Frequency Distribution Comparison'}",
                  dataOptions="""{type: 'bar', splitBy: 'lang'}""",
                  dataSources="{x: 'letter', y: 'freq'}")

res68: notebook.front.widgets.charts.CustomPlotlyChart[Seq[LanguageLetterFrequency]] = <CustomPlotlyChart widget>


#Naive Language Prediction

## We first need to apply the same process to the text to evaluate
Again, API parity between Spark and Scala to the rescue

In [ ]:
def sentenceFreq(str: String):List[(String, Double)] = {
  val cleaned = str.collect{case char if Character.isAlphabetic(char) => Character.toLowerCase(char)}
  val total = cleaned.size
  val freq = cleaned.groupBy(identity).map{case (letter, group) => (letter, group.size)}
  val ordered = freq.map{case (k,v) => (k.toString, v.toDouble/total)}.toList.sortBy{case (k,v)=> k}
  ordered
}



sentenceFreq: (str: String)List[(String, Double)]


In [ ]:
sentenceFreq("estamos haciendo un modelo para prediccion de lenguage")

res244: List[(String, Double)] = List((a,0.10638297872340426), (c,0.06382978723404255), (d,0.0851063829787234), (e,0.14893617021276595), (g,0.0425531914893617), (h,0.02127659574468085), (i,0.06382978723404255), (l,0.0425531914893617), (m,0.0425531914893617), (n,0.0851063829787234), (o,0.10638297872340426), (p,0.0425531914893617), (r,0.0425531914893617), (s,0.0425531914893617), (t,0.02127659574468085), (u,0.0425531914893617))


## Our classifier will use Euclidean distance between the sample and each of our pre-calculated models
The closer we are from a model, the higher likehood that we have a match 

In [ ]:
def classify(str:String): Seq[(String, Double)] = {
  val strFreq = sentenceFreq(str).toMap
  langLetterFreq.map{case (lang, refFreq) => 
                     val freqMap = refFreq.toMap
                     val score = strFreq.map{case (letter, freq) => 
                                 val modelFreq = freqMap(letter)
                                 (freq - modelFreq)*(freq-modelFreq)
                                }.sum
                     (lang, Math.sqrt(score))
                    }.sortBy(_._2)
}
  

classify: (str: String)Seq[(String, Double)]


In [ ]:
classify("estamos haciendo un modelo para clasificar el lenguage usado en un documento")

res72: Seq[(String, Double)] = List((es,0.08238430605362929), (fr,0.11604577757832807), (it,0.12226442259056455), (en,0.1267000540758025), (nl,0.14883738620927084), (de,0.1525148741851856))


In [ ]:
classify("wij maken een model om de taal te bepalen die in een document gebruikt wordt")

res74: Seq[(String, Double)] = List((nl,0.07504696135635662), (de,0.10206072502408552), (fr,0.11896964686997333), (en,0.13026324061404027), (es,0.13912996402195157), (it,0.1538555940616276))


In [ ]:
classify("language")

res76: Seq[(String, Double)] = List((it,0.3036339326924917), (es,0.30591839606321103), (nl,0.3199615720241069), (de,0.3202349895470698), (en,0.32109956126123407), (fr,0.32133700290867645))


In [ ]:
classify("these little sentences")

res78: Seq[(String, Double)] = List((nl,0.2190859315831615), (de,0.22205888191068252), (fr,0.22548803266578976), (en,0.23355825718020554), (es,0.25317339037675696), (it,0.2637927804402644))


In [ ]:
classify("are too small")

res80: Seq[(String, Double)] = List((it,0.18531944139887466), (es,0.1934450899958764), (en,0.21982178935422475), (fr,0.22420064355561053), (nl,0.2658023366151163), (de,0.2732915571268681))


In [ ]:
classify("thus too little")

res90: Seq[(String, Double)] = List((en,0.2596796104106738), (it,0.2768858204608613), (fr,0.28314036779773477), (es,0.2969931139658706), (de,0.31659343865450057), (nl,0.32414700650827355))


#Let's evaluate our classification model

## We need a test dataset

We want to see how our model performs with sentences of different lengths. We'll randomly sample the original text in order to create a labeled sample. 

Note that we are keeping only those samples where the sentence size matches the desired length.

In [ ]:
val sampleDataByLanguage = langDataset.map{case (lang, rdd) => 
                val sample = rdd.sample(withReplacement = false, fraction = 0.7)
                                           .map(str => str.collect{case char if Character.isAlphabetic(char) || Character.isWhitespace(char) => Character.toLowerCase(char)})
                sample.flatMap{str => 
                        val sentenceLength = scala.util.Random.nextInt(30)+1
                        val sampledWords = str.split("\\W").take(sentenceLength)
                        if (sampledWords.size == sentenceLength) {
                            Some((lang, sentenceLength, sampledWords.mkString(" ")))
                        } else {
                          None
                        }
                     }               
                }

sampleDataByLanguage: Seq[org.apache.spark.rdd.RDD[(String, Int, String)]] = List(MapPartitionsRDD[176] at flatMap at <console>:83, MapPartitionsRDD[179] at flatMap at <console>:83, MapPartitionsRDD[182] at flatMap at <console>:83, MapPartitionsRDD[185] at flatMap at <console>:83, MapPartitionsRDD[188] at flatMap at <console>:83, MapPartitionsRDD[191] at flatMap at <console>:83)


In [ ]:
val testDataset = sparkContext.union(sampleDataByLanguage).coalesce(8).toDF("language", "sentenceSize", "sentence")

testDataset: org.apache.spark.sql.DataFrame = [language: string, sentenceSize: int ... 1 more field]


In [ ]:
<h3>Our test dataset contains {testDataset.count()} records</h3>

res101: scala.xml.Elem = <h3>Our test dataset contains 48844 records</h3>


#### We can easily use shell functions
In this case, we clean up the target save dir to avoid issues when writing our dataset

In [ ]:
:sh rm -rf /tmp/testDataset


import sys.process._




### We save our test dataset for potential use later on...

In [ ]:
testDataset.write.parquet("/tmp/testDataset")

### Lets do some sanity checks on out test data

In [ ]:
testDataset.select($"language", $"sentenceSize").groupBy($"sentenceSize").count().orderBy($"sentenceSize").collect()

res117: Array[org.apache.spark.sql.Row] = Array([1,4530], [2,4052], [3,3894], [4,3615], [5,3456], [6,3240], [7,3106], [8,2997], [9,2930], [10,2722], [11,2541], [12,2299], [13,2158], [14,1897], [15,1557], [16,1232], [17,898], [18,668], [19,434], [20,263], [21,139], [22,101], [23,49], [24,28], [25,13], [26,6], [27,5], [28,1], [29,2])


### Use an UDF to apply the predict function to our dataset


In [ ]:
val classifyUDF = {
  val func: String => String = str => classify(str).head._1
  udf(func)
}

classifyUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))


### And we apply the classifier to the test dataset

In [ ]:
val hitAndMiss = testDataset.withColumn("lang_classification", classifyUDF($"sentence"))

hitAndMiss: org.apache.spark.sql.DataFrame = [language: string, sentenceSize: int ... 2 more fields]


#### Let's see some of the results

In [ ]:
hitAndMiss.sample(false, 0.001)

res122: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [language: string, sentenceSize: int ... 2 more fields]


## Let's evaluate the overall hit ratio
AKA Model Accuracy

In [ ]:
val hitCol = when($"language" === $"lang_classification", 1L).otherwise(0L)
val hits = hitAndMiss.withColumn("hit", hitCol).withColumn("counter", lit(1))

hitCol: org.apache.spark.sql.Column = CASE WHEN (language = lang_classification) THEN 1 ELSE 0 END
hits: org.apache.spark.sql.DataFrame = [language: string, sentenceSize: int ... 4 more fields]


In [ ]:
val hitRatio = hits.select(sum($"hit")/sum($"counter")).head.getDouble(0)

hitRatio: Double = 0.5528181242466956


## We learnt that our method performs bad on small texts
### How is the performance in relation to text length (measured in words, not characters)

In [ ]:
val hitRatioPerLength = hits.groupBy($"sentenceSize").agg(sum($"hit")/sum($"counter")).orderBy($"sentenceSize")

hitRatioPerLength: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [sentenceSize: int, (sum(hit) / sum(counter)): double]


#Model accuracy vs sentence lenght

In [ ]:
val ds = hitRatioPerLength.as[(Long, Double)]

ds: org.apache.spark.sql.Dataset[(Long, Double)] = [sentenceSize: int, (sum(hit) / sum(counter)): double]


In [ ]:
ds.collect

res134: Array[(Long, Double)] = Array((1,0.28963077603360604), (2,0.3569254185692542), (3,0.39979418574736303), (4,0.4559386973180077), (5,0.4881656804733728), (6,0.5452586206896551), (7,0.5520734409623298), (8,0.6035404141616566), (9,0.6347177848775293), (10,0.6615384615384615), (11,0.7100660707345511), (12,0.7177759056444819), (13,0.7181571815718157), (14,0.7434386716657739), (15,0.7503201024327785), (16,0.7451612903225806), (17,0.7407407407407407), (18,0.7348484848484849), (19,0.7562814070351759), (20,0.749034749034749), (21,0.7364864864864865), (22,0.6962025316455697), (23,0.6428571428571429), (24,0.5769230769230769), (25,0.5), (26,0.8), (27,0.75), (28,0.5), (29,1.0))


#=o=