## Classification Example in Spark
This example demonstrates running a simple logistic regression model as well as a Random Forest classification model in spark, tune their hyper parameters, and evaluate the models using cross validation. We will use stumbleupon evergreen dataset from this kaggle competition: https://www.kaggle.com/c/stumbleupon. Unfortunately, the data use agreement does not allow me to share this dataset outside of the competition platform. Therefore, to download this dataset, you will have to create an account on Kaggle, then go to https://www.kaggle.com/c/stumbleupon/data and click on Download all and accept the data use agreement to download the dataset. Once you downloaded the data, copy train.tsv file to your hdfs. This tab-delimited file would be the data we will be working on.

"StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen" (source kaggle). 

The dataset provided by stumbleupon for the kaggle competition has information on over 7.3K webpages including the boilerplate ( A json object which has the title,keywords, and body of a webpage ) as well as some other meta data information on the webpage along with a user-defined label which indicates whether a webpage is evergreen or not.
Our goal is to build a logistic regression model as well as a random forest model to predict whether a webpage is evergreen or not.

The dataset is not big; however, the program we will have here is scalable and can be run on big data. The goal of this notebook is to learn how to build and train Logistic Regression and Random Forest models in spark, tune their hyper-parameters and evaluate them using cross-validation.

As before, let's first configure our spark shell on yarn:

In [1]:
%%init_spark
launcher.master="yarn"
launcher.num_executors=6
launcher.executor_cores=2
launcher.executor_memory='2000m'
launcher.packages=["com.github.master:spark-stemming_2.10:0.2.0"]


Now let's read train.tsv from HDFS, cache it, print its schema to see what attributes it has, and view a sample of the rows.

In [2]:
val df=spark.read.option("header","true").option("delimiter","\t").option("inferschema", "true").option("escape","\"").csv("/hadoop-user/data/stumbleupon/train.tsv")

df.cache()
df.printSchema()
print("count of records is: "+df.count)
df.show(3)

Intitializing Scala interpreter ...

Spark Web UI available at http://bd-hm:8088/proxy/application_1541175619604_0006
SparkContext available as 'sc' (version = 2.3.1, master = yarn, app id = application_1541175619604_0006)
SparkSession available as 'spark'


2018-11-03 10:02:22 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-11-03 10:02:25 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-11-03 10:02:30 WARN  Client:66 - Same path resource file:///home/administrator/.ivy2/jars/com.github.master_spark-stemming_2.10-0.2.0.jar added multiple times to distributed cache.
2018-11-03 10:03:48 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
root
 |-- url: string (nullable = true)
 |-- urlid: integer (nullable = true)
 |-- boilerplate: string (nullable = true)
 |-- alchemy_category: string (nullable = true)
 |-- alchemy_category_score: string (nullable = true)
 |-- avglinksize: double (nullable = true)
 |-- commonlinkratio_1: double (nullable = true)
 |--

df: org.apache.spark.sql.DataFrame = [url: string, urlid: int ... 25 more fields]


## Extracting and Transforming Features
The only column we will be using as predictor for this lab,is the boilerplate column. The rest of the columns are found to have no correlation with the outcome by the top winners of this competition. Note that a complete data analysis cycle includes visualization and feature selection steps prior to building the model. However, we do not cover feature selection in this course as it needs more statistical background as well as exploratory data analysis and visualization skills which are out of the scope of this class. 

The boilerplate feature is a JsonObject with three attributes: title, url, and body. We will extract the body attribute and use it as predictor. I filtered out the webpages with empty body and I concatenated the title, body, and the url (which is a set of keywords)

In [3]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._  
import scala.util.parsing.json.JSON

/* get_json_object is a built-in spark sql function which allows us extract attribute from a json column. 
 * The boilerplateDF has four columns: body,title, url, and label( the outcome variable)
 * for a list of all spark sql functions refer to: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ 
 */
val boilerplateDF = df.select(get_json_object($"boilerplate", "$.body").alias("body"),
                              get_json_object($"boilerplate", "$.title").alias("title"),
                              get_json_object($"boilerplate", "$.url").alias("url"),
                              $"label")
                                        
/* filter the webpages with empty or null body. Then concatenate body,title,and url together with a space in between
 * concat_ws is a built-in spark sql function which allows to concatenate multiple strings using a given separator
 */
val boilerplate= boilerplateDF.filter("trim(body)!='' or trim(body)!=null").select(concat_ws(" ",$"body" ,$"title", $"url").alias("boilerplate"), $"label")


import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util.parsing.json.JSON
boilerplateDF: org.apache.spark.sql.DataFrame = [body: string, title: string ... 2 more fields]
boilerplate: org.apache.spark.sql.DataFrame = [boilerplate: string, label: int]


### Building TFIDF vectors
Now it is time to extract features from the boilerplate text. We tokenize the text, remove punctuations and stop words, do stemming, create bag of words, and finally, compute TFIDF vectors from the boilerplate text. Since we want to add all these stages to a pipeline later, instead of using a sql statement to remove punctuations, we used an "SQLTransformer". 

A SQLTransformer allows writing a sql to manipulate and transform data and then later add this as a stage to a pipeline model. 
Currently the only SQL syntax supported by SQLTransformer is "SELECT ... FROM __THIS__ ..." where "__THIS__" represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, and can be any select clause that Spark SQL supports. we can also use Spark SQL built-in function and UDFs to operate on these selected columns. Here we use SQL transformer with our removePuncUDF to remove punctuation from boilerplate. 

In [4]:
import org.apache.spark.ml.feature._

val tokenizer = new RegexTokenizer().setMinTokenLength(3).setToLowercase(true).setInputCol("boilerplate").setOutputCol("boilerplate_words")

//Defining a udf to remove punctuations from a sequence of words
import org.apache.spark.sql.functions.udf

def removePunc(words:Seq[String]):Seq[String]={
 return words.map(_.replaceAll("\\p{Punct}"," "))
}

//val removePuncUDF=udf(removePunc(_:Seq[String]))
spark.udf.register("removePuncUDF",removePunc(_:Seq[String]) )

//use the removePuncUDF to remove all punctuations from boilerplate_wordss
val puncRemover = new SQLTransformer().setStatement("SELECT removePuncUDF(boilerplate_words) as boilerplate, label from __THIS__ ")

val stopWordRemover=new StopWordsRemover().setInputCol("boilerplate").setOutputCol("filtered_boilerplate")

import org.apache.spark.mllib.feature.Stemmer
val stemmer = new Stemmer().setInputCol("filtered_boilerplate").setOutputCol("stemmed_boilerplate")

val vectorizer = new CountVectorizer().setInputCol("stemmed_boilerplate").setOutputCol("boilerplate_BOW")

val tfidf = new IDF().setInputCol("boilerplate_BOW").setOutputCol("boilerplate_TFIDF")




import org.apache.spark.ml.feature._
tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_31b27d646572
import org.apache.spark.sql.functions.udf
removePunc: (words: Seq[String])Seq[String]
puncRemover: org.apache.spark.ml.feature.SQLTransformer = sql_b6c539e2b149
stopWordRemover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_fb58939e9617
import org.apache.spark.mllib.feature.Stemmer
stemmer: org.apache.spark.mllib.feature.Stemmer = stemmer_82acf48c4fe9
vectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_344831b604dd
tfidf: org.apache.spark.ml.feature.IDF = idf_ef564f38cc51


## Building, Tunning, and Evaluating a Logistic Regression model 
Now we create a logistic regression model using LogisticRegression class in spark and set its input column to the boilerplate TFIDF vector and its output column to the user defined label indicating whether a website is evergreen or not.We use binaryClassificationEvaluator with AUC (Area under ROC curve) to evaluate our model. A parameter grid is set up to try different values for hyper-parameters including (regParam and elasticNetParam which are lambda and alpha parameters in elastic_net regularization, respectively, and minDocFreq which is the minimum number of different documents a term must appear in to be included in the vocabulary.) and a 5 fold cross validation is used to tune hyper-parameters. 
Finally, we create a pipeline of all the preprocesisng stages as well as the logistic regression and cross validation stages and fit it to the training data. Then we evaluate the model with the test data. The AUC we get from fitting a logistic regression to the boilerplage attributes is 86.7%
Be patient when you run this code. It will take a while for cross validation to complete. You can leave running and come back to it a couple hours later.

In [6]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("boilerplate_TFIDF")
val paramGrid =new ParamGridBuilder()
             .addGrid(lr.regParam, Array(0.01, 0.5, 2.0))
             .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("label").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(lr).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)


val pipeline = new Pipeline().setStages(Array(tokenizer,puncRemover,stopWordRemover, stemmer, vectorizer, tfidf,cv))

val Array(training,testing)=boilerplate.randomSplit(Array(0.8,0.2),111)


//Fit the training data to the pipeline
val pipelineModel = pipeline.fit(training)

// Make predictions.
val predictions = pipelineModel.transform(testing)

// Select example rows to display.
predictions.select("label", "prediction", "probability", "stemmed_boilerplate").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"Area under ROC curve(AUC) for LR on test data = $AUC")




2018-11-03 14:48:59 WARN  BlockManager:66 - Asked to remove block broadcast_46437, which does not exist
2018-11-03 14:53:11 WARN  BlockManager:66 - Asked to remove block broadcast_48936, which does not exist
+-----+----------+--------------------+--------------------+
|label|prediction|         probability| stemmed_boilerplate|
+-----+----------+--------------------+--------------------+
|    0|       0.0|[0.64249486683892...|[100 , godina haj...|
|    0|       0.0|[0.62265139839694...|[cat, ineffici, d...|
|    1|       1.0|[0.17175367401133...|[light, cake, soa...|
|    0|       0.0|[0.61628265425505...|[memeri, object, ...|
|    0|       0.0|[0.63563644784605...|[australian, news...|
+-----+----------+--------------------+--------------------+
only showing top 5 rows

Area under ROC curve(AUC) for LR on test data = 0.867149775276725


import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_3ea37684d78f
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	logreg_3ea37684d78f-elasticNetParam: 0.0,
	idf_ef564f38cc51-minDocFreq: 5,
	logreg_3ea37684d78f-regParam: 0.01
}, {
	logreg_3ea37684d78f-elasticNetParam: 0.5,
	idf_ef564f38cc51-minDocFreq: 5,
	logreg_3ea37684d78f-regParam: 0.01
}, {
	logreg_3ea37684d78f-elasticNetParam: 1.0,
	idf_ef564f38cc51-minDocFreq: 5,
	logreg_3ea37684d78f-regParam: 0.01
}, {
	logreg_3ea37684d78f-elasticNetParam: 0.0,
	idf_ef564f38cc51-minDocFreq: 10,...

## Building, Tunning, and Evaluating a RandomForest model

Now we create a Random Forest model using RandomForestClassifier in spark to predict the label for this dataset. The pipeline is very similar to the pipeline we created for logistic regression, except that the estimater is set to RandomForestClassifier and the hyper-parameters tuned for Random forest are maxDepth ( The maximum depth of each tree in the random forest) and numTrees( The total number of trees in the RandomForest model). Unfortunately, our tiny cluster runs out of memory when I try to run this code segme. Nevertheless, I leave the code segment here for you to see how RandomForest is run on spark.

In [None]:
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._

val rf = new RandomForestClassifier().setLabelCol("label").setFeaturesCol("boilerplate_TFIDF")
val paramGrid =new ParamGridBuilder()
             .addGrid(rf.maxDepth, Array(2, 5))
             .addGrid(rf.numTrees, Array(5, 20))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()

val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("label").setMetricName("areaUnderROC")


val cv_rf = new CrossValidator().setEstimator(rf).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

val pipeline_rf = new Pipeline().setStages(Array(tokenizer,puncRemover,stopWordRemover, stemmer, vectorizer, tfidf,cv_rf))

val Array(training,testing)=boilerplate.randomSplit(Array(0.8,0.2),111)

//Fit the training data to the pipeline
val pipelineModel_rf = pipeline_rf.fit(training)

// Make predictions.
val predictions = pipelineModel_rf.transform(testing)
val AUC = evaluator.evaluate(predictions)
println(s"Area under ROC curve(AUC) for RF on test data = $AUC")

2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_0 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_5 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87852_3 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87891_4 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87894_4 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87894_0 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87891_1 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87894_1 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87891_5 !
2018-11-03 15:16:02 WARN  BlockManagerMasterEndpoint:66 - No more replicas avai

2018-11-03 15:16:21 WARN  TaskSetManager:66 - Lost task 1.1 in stage 40190.0 (TID 156554, bd-s1, executor 2): FetchFailed(BlockManagerId(1, bd-hm, 34698, None), shuffleId=18816, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to bd-hm/10.92.132.60:34698
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.ap

2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88182_5 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87894_3 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_5 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87891_5 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_3 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87894_5 !
2018-11-03 15:26:00 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_87891_3 !
2018-11-03 15:26:00 ERROR YarnScheduler:70 - Lost executor 9 on bd-hm: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:26:00 WARN  YarnSchedulerBackend$YarnSchedulerEndpoin

2018-11-03 15:30:09 WARN  TaskSetManager:66 - Lost task 1.0 in stage 40340.0 (TID 157218, bd-s1, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:30:09 WARN  TaskSetManager:66 - Lost task 3.0 in stage 40340.0 (TID 157220, bd-s1, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:30:09 WARN  TaskSetManager:66 - Lost task 1.1 in stage 40340.0 (TID 157224, bd-hm, executor 10): FetchFailed(null, shuffleId=18861, mapId=-1, reduceId=1, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 18861
	at org.apache.spark.MapOutputTracker$$anon

2018-11-03 15:32:31 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 2 for reason Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:32:31 WARN  TaskSetManager:66 - Lost task 0.1 in stage 40368.0 (TID 157356, bd-hm, executor 11): FetchFailed(null, shuffleId=18870, mapId=-1, reduceId=0, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 18870
	at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:867)
	at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:863)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.sc

2018-11-03 15:39:48 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88344_5 !
2018-11-03 15:39:48 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88341_3 !
2018-11-03 15:39:48 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88344_3 !
2018-11-03 15:39:48 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_3 !
2018-11-03 15:39:48 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 10 for reason Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:39:48 ERROR YarnScheduler:70 - Lost executor 10 on bd-hm: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:39:48 WARN  TaskSetManager:66 - Lost task 0.0 in stage 40456.0 (TID 157717, bd-hm, exe

2018-11-03 15:40:07 WARN  TaskSetManager:66 - Lost task 2.1 in stage 40456.0 (TID 157723, bd-s1, executor 13): FetchFailed(BlockManagerId(10, bd-hm, 47549, None), shuffleId=18894, mapId=3, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to bd-hm/10.92.132.60:47549
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.

2018-11-03 15:42:24 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:42:24 ERROR YarnScheduler:70 - Lost executor 12 on bd-s1: Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:42:24 WARN  TaskSetManager:66 - Lost task 0.0 in stage 40484.0 (TID 157855, bd-s1, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:42:24 WARN  TaskSetManager:66 - Lost task 4.0 in stage 40484.0 (TID 157859, bd-s1, executor 12): ExecutorLostFailure (executor 12 exited caused by one of

2018-11-03 15:42:43 WARN  TaskSetManager:66 - Lost task 1.1 in stage 40484.0 (TID 157864, bd-s1, executor 13): FetchFailed(BlockManagerId(12, bd-s1, 42342, None), shuffleId=18903, mapId=3, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to bd-s1/10.92.132.61:42342
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.

2018-11-03 15:47:30 ERROR YarnScheduler:70 - Lost executor 3 on bd-s2: Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:47:30 WARN  TaskSetManager:66 - Lost task 0.0 in stage 40540.0 (TID 158119, bd-s2, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:47:30 WARN  TaskSetManager:66 - Lost task 4.0 in stage 40540.0 (TID 158123, bd-s2, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.5 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:47:30 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove 

2018-11-03 15:50:37 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_0 !
2018-11-03 15:50:37 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_1 !
2018-11-03 15:50:37 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_3 !
2018-11-03 15:50:37 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_0 !
2018-11-03 15:50:37 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_1 !
2018-11-03 15:50:37 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 14 for reason Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:50:37 ERROR YarnScheduler:70 - Lost executor 14 on bd-hm: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
201

2018-11-03 15:54:23 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_0 !
2018-11-03 15:54:23 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_5 !
2018-11-03 15:54:23 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_0 !
2018-11-03 15:54:23 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88794_0 !
2018-11-03 15:54:23 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 13 for reason Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:54:23 ERROR YarnScheduler:70 - Lost executor 13 on bd-s1: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:54:23 WARN  TaskSetManager:66 - Lost task 1.0 in stage 40612.0 (TID 158439, bd-s1, exe

2018-11-03 15:57:01 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_3 !
2018-11-03 15:57:01 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_11_1 !
2018-11-03 15:57:01 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88791_1 !
2018-11-03 15:57:01 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_88794_1 !
2018-11-03 15:57:01 ERROR YarnScheduler:70 - Lost executor 18 on bd-s2: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:57:01 WARN  TaskSetManager:66 - Lost task 0.0 in stage 40634.0 (TID 158534, bd-s2, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.4 GB of 2.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
2018-11-03 15:57:01 WARN  Task