# Amazon-reviews predictions with Spark ML

In this notebook we will demonstrate on how to apply Spark to process large text corpora in the domain of amazon reviews and making predictions using Apache Spark ML outling four steps. At first we will
 1. Read the data, transform it to RDD
 2. Preprocess with RDDs and calculate Chi Squared values for each token
 3. Construct a Spark ML pipeline for preprocessing and feature extraction
 4. Train and validate a Text Classification Model
 

 At first, we need create a SparkSession using the following Spark configurations

In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

val conf = new SparkConf()
      .setMaster("yarn")
      .set("spark.executor.memory", "4g")
      .set("spark.driver.memory", "4g")
      .set("spark.driver.maxResultSize", "2g")
      .set("spark.executor.instances", "5")
      .set("spark.executor.cores", "4")
      .set("spark.default.parallelism", "20")

// Initialize SparkSession
val sc = SparkSession.builder.config(conf).getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://captain01.os.hpc.tuwien.ac.at:9999/proxy/application_1715326141961_2701
SparkContext available as 'sc' (version = 3.2.3, master = yarn, app id = application_1715326141961_2701)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@71347310
sc: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@507ceebf


## Read data and transform to RDD

Secondly, we will establish global variables, import the Amazon reviews dataset, and load the stopwords. Additionally, we will define helper functions that are  intended to write files to the local filesystem.

Set global variables

In [2]:
val K = 75
val file_path_stopwords = "../data/stopwords.txt"
val file_path_reviews = "hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json"
// val file_path_reviews = "hdfs:///user/dic24_shared/amazon-reviews/full/reviewscombined.json"

val tokenizePattern = "[^a-zA-Z<>^|]+"

K: Int = 75
file_path_stopwords: String = ../data/stopwords.txt
file_path_reviews: String = hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json
tokenizePattern: String = [^a-zA-Z<>^|]+


Load the amazon review dataset

In [3]:
%%time
val df = sc.read.json(file_path_reviews).select("category", "reviewText")

Time: 7.644007444381714 seconds.



df: org.apache.spark.sql.DataFrame = [category: string, reviewText: string]


Load the stopwords

In [4]:
import scala.io.Source.fromFile

val stopWords = fromFile(file_path_stopwords).getLines.toArray

import scala.io.Source.fromFile
stopWords: Array[String] = Array(a, aa, able, about, above, absorbs, accord, according, accordingly, across, actually, after, afterwards, again, against, ain, album, album, all, allow, allows, almost, alone, along, already, also, although, always, am, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, app, appear, appreciate, appropriate, are, aren, around, as, aside, ask, asking, associated, at, available, away, awfully, b, baby, bb, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, believe, below, beside, besides, best, better, between, beyond, bibs, bike, book, books, both, brief, bulbs, but, by, c, came, camera, can, cannot, cant, car, case, cause, causes, ...


### Helpers

This helper function is designed to write an RDD to the local file system.

In [5]:
import java.io.PrintWriter
import org.apache.spark.rdd.RDD
import scala.collection.immutable.TreeSet

def writeRDDToFile(rdd: RDD[(String, Seq[(String, Double)])], filePath: String): Unit = {
    var mergedTerms = TreeSet[String]()
    val writer = new PrintWriter(filePath)
    
    val collectedData = rdd.sortByKey().collect()

    for ((category, topk) <- collectedData) {
        val topKStr = topk.map { case (term, chiSquared) => 
            mergedTerms += term
            s"$term:$chiSquared"
        }.mkString(" ")
        writer.println(s"<$category> $topKStr")
    }
    
    writer.print(mergedTerms.mkString(" "))
    
    writer.close()
}


import java.io.PrintWriter
import org.apache.spark.rdd.RDD
import scala.collection.immutable.TreeSet
writeRDDToFile: (rdd: org.apache.spark.rdd.RDD[(String, Seq[(String, Double)])], filePath: String)Unit


The helper function designed to write an String Array to the local file system.

In [6]:
import java.io.PrintWriter

def writeArrToFile(arr: Array[String], filePath: String) = {
    val writer = new PrintWriter(filePath)
    
    writer.println(arr.mkString(" "))
    writer.close()
}

import java.io.PrintWriter
writeArrToFile: (arr: Array[String], filePath: String)Unit


## Calculate Chi-Square
This approaches, first calculates the different number of documents per category and afterward calculates the chi-squared values per term per category.

In [7]:
val rdd = df.rdd

rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:32


In the following cell, all neccessary function are defined that are used to evaluate the chi-squared values.

In [8]:
%%time
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row

def preprocessing(row: Row): Seq[((String, Option[String]), Int)] = {
  val category = row.getString(0)
  val reviewText = row.getString(1)

  val terms = reviewText
    .toLowerCase()
    .split(tokenizePattern)
    .filter(token => token.length > 1 && !stopWords.contains(token))
    .toSet

  val counts = Seq(((category, None), 1)) ++ terms.map(token => ((category, Some(token)), 1))
  counts.toSeq
}

def tokenToKey(row: ((String, Option[String]), Int)): (Option[String], (String, Int)) = {
  val ((category, token), count) = row
  (token, (category, count))
}

def tokenSum(row: (Option[String], Iterable[(String, Int)])): Iterable[(String, (Option[String], Int, Int))] = {
  val (token, values) = row
  val counts = values.groupBy(_._1).mapValues(_.map(_._2).sum)
  val n_t = counts.values.sum

  counts.map { case (category, count) => (category, (token, count, n_t)) }
}

def chiSquared(row: (String, Iterable[(Option[String], Int, Int)])): (String, Seq[(String, Double)]) = {
  val (category, values) = row
  val counts = values.map { case (token, count, n_t) => token -> (count, n_t) }.toMap
  val (n_c, n) = counts.getOrElse(None, (0, counts.values.map(_._2).sum))

  val results = counts
    .collect {
      case (Some(token), (a, n_t)) =>
        val b = n_t - a
        val c = n_c - a
        val d = n - a - b - c
        val chiSquaredValue = n.toDouble * math.pow(a * d - b * c, 2) / ((a + b).toDouble * (a + c) * (b + d) * (c + d))
      
        (token, chiSquaredValue)
    }
    .toSeq
    .sortBy(-_._2)
    .take(K)

  (category, results)
}

Time: 0.8895168304443359 seconds.



import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
preprocessing: (row: org.apache.spark.sql.Row)Seq[((String, Option[String]), Int)]
tokenToKey: (row: ((String, Option[String]), Int))(Option[String], (String, Int))
tokenSum: (row: (Option[String], Iterable[(String, Int)]))Iterable[(String, (Option[String], Int, Int))]
chiSquared: (row: (String, Iterable[(Option[String], Int, Int)]))(String, Seq[(String, Double)])


The previously defined functions are now applied by chaining various generic RDD functions and applying the necessary transformations and actions for chi-squared calculation

In [9]:
%%time
val topTermsPerCategory = rdd
  .flatMap(preprocessing)
  .reduceByKey(_ + _)
  .map(tokenToKey)
  .groupByKey()
  .flatMap(tokenSum)
  .groupByKey()
  .map(chiSquared)
  .sortByKey()

Time: 33.28363609313965 seconds.



topTermsPerCategory: org.apache.spark.rdd.RDD[(String, Seq[(String, Double)])] = ShuffledRDD[18] at sortByKey at <console>:46


In [10]:
%%time
writeRDDToFile(topTermsPerCategory, "../output_rdd.txt")

Time: 1.2687962055206299 seconds.



## Datasets/DataFrames: Spark ML and Pipelines

In this section, we will create an initial pipeline to delve into the process. We will apply this pipeline to the Amazon review dataset to retrieve the top 2000 selected features by ChiSqSelector. These features will be stored and later used for evaluation.

First, create all the necessary transformers!

In [7]:
import org.apache.spark.ml.feature.{StringIndexer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.{HashingTF, IDF}

val indexer = new StringIndexer()
    .setInputCol("category")
    .setOutputCol("category_index")

val tokenizer = new RegexTokenizer()
    .setInputCol("reviewText")
    .setOutputCol("raw_terms")
    .setMinTokenLength(2)
    .setPattern(tokenizePattern)
    .setToLowercase(true)

val remover = new StopWordsRemover()
    .setInputCol(tokenizer.getOutputCol)
    .setOutputCol("terms")
    .setStopWords(stopWords)

val countVectorizer = new CountVectorizer()
    .setInputCol(remover.getOutputCol)
    .setOutputCol("raw_features")
    .setMinDF(1)

val idf = new IDF()
    .setInputCol(countVectorizer.getOutputCol)
    .setOutputCol("features")

import org.apache.spark.ml.feature.{StringIndexer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
import org.apache.spark.ml.feature.{HashingTF, IDF}
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_72d2940458eb
tokenizer: org.apache.spark.ml.feature.RegexTokenizer = RegexTokenizer: uid=regexTok_d6afe954f67f, minTokenLength=2, gaps=true, pattern=[^a-zA-Z<>^|]+, toLowercase=true
remover: org.apache.spark.ml.feature.StopWordsRemover = StopWordsRemover: uid=stopWords_e61df3ed420d, numStopWords=596, locale=en_US, caseSensitive=false
countVectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_2ffc01fba4d9
idf: org.apache.spark.ml.feature.IDF = idf_c827c0fd494c


Afterward, we create the Chi^2-Selector

In [8]:
import org.apache.spark.ml.feature.{ChiSqSelector, ChiSqSelectorModel}

val selector = new ChiSqSelector()
  .setNumTopFeatures(2000)
  .setFeaturesCol(idf.getOutputCol)
  .setLabelCol("category_index")
  .setOutputCol("selectedFeatures")

import org.apache.spark.ml.feature.{ChiSqSelector, ChiSqSelectorModel}
selector: org.apache.spark.ml.feature.ChiSqSelector = chiSqSelector_e102c3526f1a


Lastly, we create the pipeline to execute all the transformers and select the top K features

In [9]:
import org.apache.spark.ml.{Pipeline, PipelineModel}

val pipeline = new Pipeline()
    .setStages(Array(tokenizer, remover, countVectorizer, idf, indexer, selector))

import org.apache.spark.ml.{Pipeline, PipelineModel}
pipeline: org.apache.spark.ml.Pipeline = pipeline_3f2c1deaf497


After creation of the pipeline, we can now fit it to our data, we want to transform

In [10]:
%%time
val model = pipeline.fit(df)

Time: 50.5444974899292 seconds.



model: org.apache.spark.ml.PipelineModel = pipeline_3f2c1deaf497


Afterward, we can extract the vocabulary and selected features to map them

In [11]:
val vocabulary = model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary
val selectedFeatures = model.stages.last.asInstanceOf[ChiSqSelectorModel].selectedFeatures

vocabulary: Array[String] = Array(great, good, love, time, work, recommend, back, easy, make, bought, made, find, buy, price, put, reading, quality, people, works, quot, years, nice, characters, long, series, lot, found, author, day, bit, feel, makes, thing, perfect, fit, end, set, loved, things, thought, music, small, hard, give, year, world, size, worth, pretty, times, sound, written, light, real, big, amazon, part, bad, highly, money, excellent, purchased, happy, high, enjoyed, problem, family, interesting, wanted, character, job, review, purchase, man, watch, days, enjoy, place, home, stars, short, writing, play, cover, top, fan, full, fine, color, side, order, wonderful, amazing, point, fact, reviews, ordered, stories, favorite, easily, needed, battery, screen, water, dvd, beautifu...


Sort the terms in ascending order

In [12]:
import scala.util.Sorting.quickSort

val top2000terms = selectedFeatures.map(i => vocabulary(i))
quickSort(top2000terms)

import scala.util.Sorting.quickSort
top2000terms: Array[String] = Array(access, accessories, account, acid, acoustic, act, acted, acting, action, actions, actor, actors, adapter, adapters, addicted, addicting, addictive, adjust, adjustable, adjustment, admit, adorable, ads, adult, adults, adventure, adventures, advertised, advice, age, ages, agree, air, albums, alive, alpha, amazing, amazon, america, american, amp, amusing, analysis, ancient, android, angle, animals, animated, animation, anime, answers, antenna, appeal, apple, applied, apply, applying, approach, apps, arch, arm, arrangements, arrived, art, artist, artists, asleep, aspects, assemble, assembled, assembly, asus, atmosphere, attach, attached, attention, attractive, audience, audio, author, authors, auto, automatically, awar...


Store the terms to local file system

In [13]:
%%time
writeArrToFile(top2000terms, "../output_ds.txt")

Time: 0.2079153060913086 seconds.



## Text Classification

In [14]:
val seed = 12041500
val fraction = 0.1

seed: Int = 12041500
fraction: Double = 0.1


### Pre-processing
First, we pre process the data by performing the whole tokenization and tdidf-calculations.

In [15]:
import org.apache.spark.ml.{Pipeline, PipelineModel}

val pipelinePreprocessing = new Pipeline()
    .setStages(Array(tokenizer, remover))

import org.apache.spark.ml.{Pipeline, PipelineModel}
pipelinePreprocessing: org.apache.spark.ml.Pipeline = pipeline_a9b6b4373fe7


Subsample the dataset to make model training easier

In [16]:
val sampledDF = df.sample(withReplacement = false, fraction = fraction, seed = seed)

sampledDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [category: string, reviewText: string]


Create train/test split

In [17]:
val Array(training, test) = sampledDF.randomSplit(Array(0.8, 0.2), seed = seed)

training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [category: string, reviewText: string]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [category: string, reviewText: string]


Fit the preprocessing pipeline (tokenization, stopword removal) on the training set

In [18]:
val preprocessingModel = pipelinePreprocessing.fit(training)

preprocessingModel: org.apache.spark.ml.PipelineModel = pipeline_a9b6b4373fe7


Apply preprocessing pipeline on training and test set

In [19]:
val preprocessedTrainining = preprocessingModel.transform(training).select(remover.getOutputCol, "category")
val preprocessedTest = preprocessingModel.transform(test).select(remover.getOutputCol, "category")

preprocessedTrainining: org.apache.spark.sql.DataFrame = [terms: array<string>, category: string]
preprocessedTest: org.apache.spark.sql.DataFrame = [terms: array<string>, category: string]


This code persists the preprocessed training and test datasets to both memory and disk storage levels to optimize their availability for subsequent computations

In [20]:
import org.apache.spark.storage.StorageLevel

val persistedProcessedTraining = preprocessedTrainining.persist(StorageLevel.MEMORY_AND_DISK)
val persistedProcessedTest = preprocessedTest.persist(StorageLevel.MEMORY_AND_DISK)

import org.apache.spark.storage.StorageLevel
persistedProcessedTraining: preprocessedTrainining.type = [terms: array<string>, category: string]
persistedProcessedTest: preprocessedTest.type = [terms: array<string>, category: string]


### Training classifier

Now that the data is split into train and test sets and already persisted, this section will implement the feature extraction pipeline as well as the grid search. Following this, the grid search will be applied to the training set and validated on the validation set, where each individual model's parameters will be stored locally for further investigation. Finally, the best-performing model from the grid search will be applied to the test set, and the persisted training and test data will be unpersisted.

Define L2 normalizer

In [21]:
import org.apache.spark.ml.feature.Normalizer

val normalizer = new Normalizer()
  .setInputCol(selector.getOutputCol)
  .setOutputCol("normFeatures")
  .setP(2.0)

import org.apache.spark.ml.feature.Normalizer
normalizer: org.apache.spark.ml.feature.Normalizer = Normalizer: uid=normalizer_55c2eb37dc15, p=2.0


Define estimator and evaluator

In [22]:
import org.apache.spark.ml.classification.{LinearSVC, OneVsRest}

val lsvc = new LinearSVC()

val classifier = new OneVsRest()
    .setClassifier(lsvc)
    .setFeaturesCol(normalizer.getOutputCol)
    .setLabelCol(indexer.getOutputCol)

import org.apache.spark.ml.classification.{LinearSVC, OneVsRest}
lsvc: org.apache.spark.ml.classification.LinearSVC = linearsvc_55ffc87a2423
classifier: org.apache.spark.ml.classification.OneVsRest = oneVsRest_117e1f890a1e


Create pipeline for feature extraction

In [23]:
val pipelineClassifier = new Pipeline()
    .setStages(Array(countVectorizer, idf, indexer, selector, normalizer, classifier))

pipelineClassifier: org.apache.spark.ml.Pipeline = pipeline_d324b5769e4e


Define parameter grid

In [24]:
import org.apache.spark.ml.tuning.ParamGridBuilder

val paramGrid = new ParamGridBuilder()
    .addGrid(lsvc.maxIter, Array(10, 50))
    .addGrid(lsvc.regParam, Array(0.001, 0.01, 0.1))
    .addGrid(lsvc.standardization, Array(false, true))
    .addGrid(selector.numTopFeatures, Array(20, 2000))
    .build()

import org.apache.spark.ml.tuning.ParamGridBuilder
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linearsvc_55ffc87a2423-maxIter: 10,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: false
}, {
	linearsvc_55ffc87a2423-maxIter: 50,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: false
}, {
	linearsvc_55ffc87a2423-maxIter: 10,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: true
}, {
	linearsvc_55ffc87a2423-maxIter: 50,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-...


Define evaluator and set metric to F1-Score

In [25]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator 

val evaluater = new MulticlassClassificationEvaluator()
    .setLabelCol(indexer.getOutputCol)
    .setMetricName("f1")

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
evaluater: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = MulticlassClassificationEvaluator: uid=mcEval_fb9c3f850193, metricName=f1, metricLabel=0.0, beta=1.0, eps=1.0E-15


Define Grid Search

In [26]:
import org.apache.spark.ml.tuning.TrainValidationSplit
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator 

val trainValidationSplit = new TrainValidationSplit()
    .setEstimator(pipelineClassifier)
    .setEvaluator(evaluater)
    .setEstimatorParamMaps(paramGrid)
    .setTrainRatio(0.8)
    .setSeed(seed)
    .setParallelism(20)

import org.apache.spark.ml.tuning.TrainValidationSplit
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
trainValidationSplit: org.apache.spark.ml.tuning.TrainValidationSplit = tvs_5803524faddd


Perform Grid Search on presisted training data

In [27]:
%%time
val model = trainValidationSplit.fit(persistedProcessedTraining)

Time: 1773.4690883159637 seconds.



model: org.apache.spark.ml.tuning.TrainValidationSplitModel = TrainValidationSplitModel: uid=tvs_5803524faddd, bestModel=pipeline_d324b5769e4e, trainRatio=0.8


#### Best classifier

In [28]:
val predictions = model.transform(persistedProcessedTest)

predictions: org.apache.spark.sql.DataFrame = [terms: array<string>, category: string ... 7 more fields]


In [29]:
println(s"F1-Score = ${evaluater.evaluate(predictions)}")

F1-Score = 0.47262102005903917


In [51]:
import org.apache.spark.ml.classification.{OneVsRestModel, LinearSVCModel}

val bestModel = model.bestModel.asInstanceOf[PipelineModel]
val bestClassifier = bestModel.stages.last.asInstanceOf[OneVsRestModel]
val bestBinaryClassifierModel = bestClassifier.models.head.asInstanceOf[LinearSVCModel]
val bestSelector = bestModel.stages(3).asInstanceOf[ChiSqSelectorModel]

println(s"Best binary classifier parameters with a F1-Score = ${evaluater.evaluate(predictions)}:\n" +
  s"  LVC = maxIter: ${bestBinaryClassifierModel.getMaxIter}, regParam: ${bestBinaryClassifierModel.getRegParam}, standardization: ${bestBinaryClassifierModel.getStandardization}\n" +
  s"  ChiSqSelector = topNumFeatures: ${bestSelector.getNumTopFeatures}")

Best binary classifier parameters with a F1-Score = 0.47262102005903917:
  LVC = maxIter: 50, regParam: 0.001, standardization: false
  ChiSqSelector = topNumFeatures: 2000


import org.apache.spark.ml.classification.{OneVsRestModel, LinearSVCModel}
bestModel: org.apache.spark.ml.PipelineModel = pipeline_d324b5769e4e
bestClassifier: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: uid=oneVsRest_117e1f890a1e, classifier=linearsvc_55ffc87a2423, numClasses=22, numFeatures=2000
bestBinaryClassifierModel: org.apache.spark.ml.classification.LinearSVCModel = LinearSVCModel: uid=linearsvc_55ffc87a2423, numClasses=2, numFeatures=2000
bestSelector: org.apache.spark.ml.feature.ChiSqSelectorModel = ChiSqSelectorModel: uid=chiSqSelector_e102c3526f1a, numSelectedFeatures=2000


#### Evaluate Grid Search Models 

The following helper function merges the parameter map and validation metrics of the trainValidationSplit model, then writes them to the local file system.

In [32]:
import java.io.PrintWriter

def evaluateGridSearch(paramMaps: Array[org.apache.spark.ml.param.ParamMap], validationMetrics: Array[Double], filePath: String) = {
    val paramsAndMetrics = paramMaps.zip(validationMetrics)
    val writer = new PrintWriter(filePath)
    
    writer.println("maxIter, NumTopFeatures, regParam, standardization, f1-score")

    paramsAndMetrics.foreach { case (paramMap, metric) =>
        val maxIter = paramMap.get(lsvc.maxIter).head
        val NumTopFeatures = paramMap.get(selector.numTopFeatures).head
        val regParam = paramMap.get(lsvc.regParam).head
        val standardization = paramMap.get(lsvc.standardization).head

        writer.println(s"$maxIter,$NumTopFeatures,$regParam,$standardization,${metric}")
    }

    writer.close()
}

import java.io.PrintWriter
evaluateGridSearch: (paramMaps: Array[org.apache.spark.ml.param.ParamMap], validationMetrics: Array[Double], filePath: String)Unit


Retrieve the parameter map of the estimator and validation metrics and write the results to the local file system

In [36]:
val paramMaps = model.getEstimatorParamMaps
val validationMetrics = model.validationMetrics

evaluateGridSearch(paramMaps, validationMetrics, "../grid_search_evaluation.csv")

paramMaps: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	linearsvc_55ffc87a2423-maxIter: 10,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: false
}, {
	linearsvc_55ffc87a2423-maxIter: 50,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: false
}, {
	linearsvc_55ffc87a2423-maxIter: 10,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: true
}, {
	linearsvc_55ffc87a2423-maxIter: 50,
	chiSqSelector_e102c3526f1a-numTopFeatures: 20,
	linearsvc_55ffc87a2423-regParam: 0.001,
	linearsvc_55ffc87a2423-standardization: true
}, {
	linearsvc_55ffc87a2423-...


### Final Model

In [34]:
val bestModelPipeline = new Pipeline().setStages(bestModel.stages)
val bestModelFull = bestModelPipeline.fit(persistedProcessedTraining)

bestModelPipeline: org.apache.spark.ml.Pipeline = pipeline_244a9c59d8c7
bestModelFull: org.apache.spark.ml.PipelineModel = pipeline_244a9c59d8c7


In [35]:
val predictions = bestModelFull.transform(persistedProcessedTest)
println(s"F1-Score full model = ${evaluater.evaluate(predictions)}")

F1-Score full model = 0.47262102005903917


predictions: org.apache.spark.sql.DataFrame = [terms: array<string>, category: string ... 7 more fields]


### Clean up

In [52]:
persistedProcessedTraining.unpersist()
persistedProcessedTest.unpersist()

res23: persistedProcessedTest.type = [terms: array<string>, category: string]
