# $\chi^2$ selection of most discriminative terms per category

Some implementation details for the Spark version:
*  Instead of the distributed cache functionality in Hadoop, all larger values are distributed to partitions using broadcast variables. This refers to the `stopwords.txt` file and the outputs of the intermediary overall term frequency and category frequency "stages"
* Most functionality was implemented without `groupByKey` (except for the last merge job), this might be the reason why the Spark implementation is faster.
* The input file is read using the Spark JSON parser and the resulting DataFrame is converted into an `RDD[Row]`.

In [1]:
import scala.io.Source

Intitializing Scala interpreter ...

Spark Web UI available at http://localhost:8088/proxy/application_1587827373944_3898
SparkContext available as 'sc' (version = 2.4.0-cdh6.3.2, master = yarn, app id = application_1587827373944_3898)
SparkSession available as 'spark'


import scala.io.Source


In [2]:
val reviewsFile = "hdfs:///scratch/amazon-reviews/full/reviews_devset.json"
val outputPath = "hdfs:///scratch/e0test/dic2/output_rdd"

reviewsFile: String = hdfs:///scratch/amazon-reviews/full/reviews_devset.json
outputPath: String = hdfs:///scratch/e0test/dic2/output_rdd


In [3]:
val reviewsDF = spark.read.json(reviewsFile)
val reviews = reviewsDF.select("category", "reviewText").rdd

reviewsDF: org.apache.spark.sql.DataFrame = [asin: string, category: string ... 8 more fields]
reviews: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[6] at rdd at <console>:28


In [4]:
reviews.first()

res0: org.apache.spark.sql.Row = [Patio_Lawn_and_Garde,This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!]


In [5]:
val stopwords = Source.fromFile("stopwords.txt").getLines().toSet

stopwords: scala.collection.immutable.Set[String] = Set(serious, latterly, looks, particularly, used, down, regarding, entirely, it's, regardless, moreover, please, ourselves, able, that's, behind, for, despite, maybe, viz, further, corresponding, any, wherein, across, name, allows, there's, this, haven't, instead, in, ought, myself, have, your, off, once, i'll, are, is, his, oh, why, rd, knows, too, among, course, greetings, somewhat, everyone, seen, likely, said, try, already, soon, nobody, got, given, using, less, am, consider, hence, than, accordingly, isn't, four, didn't, anyhow, want, three, forth, whereby, himself, specify, yes, throughout, inasmuch, but, you're, whether, sure, below, co, best, plus, becomes, what, unto, different, would, although, elsewhere, causes, another, cer...

In [6]:
val stopwordsVar = sc.broadcast(stopwords)

stopwordsVar: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Set[String]] = Broadcast(4)


In [8]:
def preprocess(line: org.apache.spark.sql.Row, stopwords: org.apache.spark.broadcast.Broadcast[Set[String]]): Set[(String, String)] = {
    val delimiters = """\s|\d|\.|!|\?|,|;|:|\(|\)|\[|]|\{|}|-|_|"|`|~|#|&|\*|%|\$|\|/"""
    val category = line(0).asInstanceOf[String]
    val text = line(1).asInstanceOf[String]
    val unigrams = text.split(delimiters).toSet
    val tokens = unigrams -- stopwords.value
    for (token <- tokens) yield {
        val tokenLower = token.toLowerCase()
        (category, tokenLower)
    }
}

val preprocessedReviews = reviews.flatMap(x => preprocess(x, stopwordsVar))

preprocess: (line: org.apache.spark.sql.Row, stopwords: org.apache.spark.broadcast.Broadcast[Set[String]])Set[(String, String)]
preprocessedReviews: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[7] at flatMap at <console>:43


In [9]:
preprocessedReviews.first()

res1: (String, String) = (Patio_Lawn_and_Garde,make)


In [10]:
val swappedReviews = preprocessedReviews.map(pair => pair.swap)

swappedReviews: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[8] at map at <console>:27


In [11]:
swappedReviews.first()

res2: (String, String) = (make,Patio_Lawn_and_Garde)


In [12]:
val termFrequencies = swappedReviews.aggregateByKey(0)((n, v) => n + 1, (a, b) => a + b)

termFrequencies: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at aggregateByKey at <console>:27


In [13]:
termFrequencies.first()

res3: (String, Int) = (vecindad,1)


In [14]:
val categories = reviews.map(x => (x(0), 1))
val categoryFrequencies = categories.reduceByKey((a, b) => a + b)

categories: org.apache.spark.rdd.RDD[(Any, Int)] = MapPartitionsRDD[10] at map at <console>:27
categoryFrequencies: org.apache.spark.rdd.RDD[(Any, Int)] = ShuffledRDD[11] at reduceByKey at <console>:28


In [15]:
categoryFrequencies.first()

res4: (Any, Int) = (Pet_Supplie,1235)


In [16]:
val termFrequenciesVar = sc.broadcast(termFrequencies.collect().toMap)
val categoryFrequenciesVar = sc.broadcast(categoryFrequencies.collect().toMap)
val totalRecordsVar = sc.broadcast(reviews.count()) // could also be implemented using a counter

termFrequenciesVar: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[String,Int]] = Broadcast(12)
categoryFrequenciesVar: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Any,Int]] = Broadcast(14)
totalRecordsVar: org.apache.spark.broadcast.Broadcast[Long] = Broadcast(16)


`categoryTermReduce` is the main function of the job. It is applied to each `(category, [term])` pair and is side joined with the term frequencies and category frequencies. Using this information, the $\chi^2$ value is computed and a sorted `(category, [term])` pair is returned.

In [17]:
def categoryTermReduce(category: Any,
                       terms: Iterable[Any],
                       categoryFrequencies: org.apache.spark.broadcast.Broadcast[Map[Any,Int]],
                       termFrequencies: org.apache.spark.broadcast.Broadcast[Map[String,Int]],
                       totalRecords: org.apache.spark.broadcast.Broadcast[Long]): (Any, Iterable[(String, Double)]) = {
    val termFrequenciesInCategory = terms.groupBy(identity).mapValues(_.size)
    val x2map = for (x <- termFrequenciesInCategory) yield {
        val term = x._1.toString
        val A = x._2
        val G = categoryFrequencies.value(category)
        val E = termFrequencies.value(term)
        val F = totalRecords.value - E
        val C = G - A
        val B = E - A
        val D = F - C
        val x2 = math.pow((A * D - B * C), 2) / ((A + B) * (A + C) * (B + D) * (C + D))
        (term, x2)
    }
    
    val x2top = x2map.toList.sortBy(x => -x._2).take(200)
    (category, x2top)
}

val result = preprocessedReviews.groupByKey().map(x => categoryTermReduce(x._1, x._2, categoryFrequenciesVar, termFrequenciesVar, totalRecordsVar))

categoryTermReduce: (category: Any, terms: Iterable[Any], categoryFrequencies: org.apache.spark.broadcast.Broadcast[Map[Any,Int]], termFrequencies: org.apache.spark.broadcast.Broadcast[Map[String,Int]], totalRecords: org.apache.spark.broadcast.Broadcast[Long])(Any, Iterable[(String, Double)])
result: org.apache.spark.rdd.RDD[(Any, Iterable[(String, Double)])] = MapPartitionsRDD[13] at map at <console>:58


The output is a (term,$\chi^2$) tuple for each category. This could now be further tranformed and formatted into a nicer string representation, but for now, it is simply dumped in a text file as is.

In [18]:
result.first()

res5: (Any, Iterable[(String, Double)]) = (Pet_Supplie,List((dog,0.11808064730315776), (dogs,0.0693459289377741), (cat,0.05555264367426396), (cats,0.05181526766625477), (litter,0.021968767375091326), (pet,0.02176846068868463), (puppy,0.020849370279927015), (leash,0.018338556982611172), (collar,0.018183899560217088), (vet,0.018089186938466333), (treats,0.015990441639137484), (chew,0.013506403042664305), (fleas,0.011293069645448874), (food,0.010075133881935329), (lab,0.008638117840062135), (chewed,0.00850943133729101), (terrier,0.007847700838820342), (barking,0.007395087831878466), (tank,0.006430340680335897), (chewer,0.006376894461889176), (kitten,0.006003710080323079), (paw,0.005990132121567097), (dog's,0.005790685132380752), (crate,0.0057155866159485065), (aquarium,0.005715586615948506...

In [19]:
result.saveAsTextFile(outputPath)