# Predict tags on StackOverflow with linear models

In this assignment you will learn how to predict tags for posts from [StackOverflow](https://stackoverflow.com). To solve this task you will use multilabel classification approach.

### Libraries

In this task you will need the following libraries:
- [Numpy](http://www.numpy.org) — a package for scientific computing.
- [Pandas](https://pandas.pydata.org) — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
- [NLTK](http://www.nltk.org) — a platform to work with natural language.

### Data

The following cell will download all data required for this assignment into the folder `week1/data`.

In [1]:
import sys
sys.path.append("..")


### Grading
We will create a grader instace below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to platform only after running submiting function in the last part of this assignment. If you want to make partial submission, you can run that cell any time you want.

In [45]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import Word2Vec

In [46]:
spark = SparkSession.builder.appName('NLP').getOrCreate()

### Text preprocessing

For this and most of the following assignments you will need to use a list of stop words. It can be downloaded from *nltk*:

In this task you will deal with a dataset of post titles from StackOverflow. You are provided a split to 3 sets: *train*, *validation* and *test*. All corpora (except for *test*) contain titles of the posts and corresponding tags (100 tags are available). The *test* set is provided for Coursera's grading and doesn't contain answers. Upload the corpora using *pandas* and look at the data:

In [47]:
train = spark.read.csv('data/train.tsv',inferSchema=True,header=True,sep='\t')
#validation=spark.read.csv('data/validation.tsv',inferSchema=True,header=True,sep='\t')
#test = spark.read.csv('data/test.tsv',inferSchema=True,header=True,sep='\t')

In [48]:
#Column title remove punctuation and to lower. respect # and + because of C# and c++
train = train.withColumn('title', lower(regexp_replace(train.title, '[^A-Za-z0-9#+ ]+', ' ')))
#validation = validation.withColumn('title', lower(regexp_replace(validation.title, '[^A-Za-z0-9#+ ]+', ' ')))
#test = test.withColumn('title', lower(regexp_replace(test.title, '[^A-Za-z0-9#+ ]+', ' ')))


In [49]:
# tags in null replace with NoTag 
train = train.withColumn('tags_1', when( train.tags.isNull(), "['NoTag']").otherwise(train.tags))
#validation = validation.withColumn('tags_1', when( validation.tags.isNull(), "['NoTag']").otherwise(validation.tags))
#train=train.na.fill('NA')
train.select('tags','tags_1').show(truncate=True, n=290)


+--------------------+--------------------+
|                tags|              tags_1|
+--------------------+--------------------+
|               ['r']|               ['r']|
|    ['php', 'mysql']|    ['php', 'mysql']|
|              ['c#']|              ['c#']|
|['javascript', 'j...|['javascript', 'j...|
|            ['java']|            ['java']|
|   ['ruby-on-rails']|   ['ruby-on-rails']|
|['ruby', 'ruby-on...|['ruby', 'ruby-on...|
|            ['ruby']|            ['ruby']|
|['java', 'spring'...|['java', 'spring'...|
|['php', 'codeigni...|['php', 'codeigni...|
|   ['java', 'class']|   ['java', 'class']|
|['javascript', 'j...|['javascript', 'j...|
|['javascript', 'j...|['javascript', 'j...|
|  ['c++', 'eclipse']|  ['c++', 'eclipse']|
|      ['javascript']|      ['javascript']|
|  ['python', 'list']|  ['python', 'list']|
|['ios', 'objectiv...|['ios', 'objectiv...|
|['ios', 'json', '...|['ios', 'json', '...|
|      ['c#', 'xaml']|      ['c#', 'xaml']|
|   ['c#', 'asp.net']|   ['c#', 

In [50]:
train.filter("tags is NULL").show()

+--------------------+----+---------+
|               title|tags|   tags_1|
+--------------------+----+---------+
| n  or  n  or std...|null|['NoTag']|
|python   comma in...|null|['NoTag']|
|remove escape cha...|null|['NoTag']|
|can r paste  outp...|null|['NoTag']|
|how do you write ...|null|['NoTag']|
|c# hex byte 0x09 ...|null|['NoTag']|
|how to get rid of...|null|['NoTag']|
|cannot run progra...|null|['NoTag']|
|using  n  in scan...|null|['NoTag']|
|an error  +  is a...|null|['NoTag']|
|remove the last  ...|null|['NoTag']|
|how to remove tho...|null|['NoTag']|
|how to convert a ...|null|['NoTag']|
|how can i remove ...|null|['NoTag']|
|what does  u001b ...|null|['NoTag']|
|php error log out...|null|['NoTag']|
|array of uibutton...|null|['NoTag']|
|how to find strin...|null|['NoTag']|
|json loads jsonst...|null|['NoTag']|
|what is the diffe...|null|['NoTag']|
+--------------------+----+---------+
only showing top 20 rows



In [51]:
#validation.select('tags','tags_1').show(truncate=True, n=290)

In [52]:
# Pipeline Tokenizers and remove stop words

regexTokenizer = RegexTokenizer(inputCol="title", outputCol="words", pattern="\\W")
regexTokenizer2 = RegexTokenizer(inputCol="tags_1", outputCol="tags_2", pattern="[^A-Za-z0-9#+]+")

remover = StopWordsRemover(inputCol="words", outputCol="filtered")

train_trans = Pipeline(stages=[regexTokenizer,regexTokenizer2,remover]).fit(train).transform(train)
#val_trans = Pipeline(stages=[regexTokenizer,regexTokenizer2, remover]).fit(validation).transform(validation)
#test_trans = Pipeline(stages=[regexTokenizer, remover]).fit(test).transform(test)

As you can see, *title* column contains titles of the posts and *tags* colum countains the tags. It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

In [66]:
from pyspark.sql.types import IntegerType

countTokens = udf(lambda words: len(words), IntegerType())
countw=train_trans.select("title", "words").withColumn("tokens", countTokens(col("words")))

In [71]:
#count number of tokens per title
countw.sort(desc('tokens')).show()

+--------------------+--------------------+------+
|               title|               words|tokens|
+--------------------+--------------------+------+
|how to merge a ja...|[how, to, merge, ...|    30|
|python   how to c...|[python, how, to,...|    29|
|i am using struts...|[i, am, using, st...|    29|
|how to create arr...|[how, to, create,...|    29|
|javascript window...|[javascript, wind...|    29|
|how do i implemen...|[how, do, i, impl...|    28|
|when an object ha...|[when, an, object...|    28|
|how do i sense if...|[how, do, i, sens...|    28|
|how do i create a...|[how, do, i, crea...|    28|
|in c  how do i ch...|[in, c, how, do, ...|    28|
|is it possible to...|[is, it, possible...|    28|
|binary and linear...|[binary, and, lin...|    28|
|how to make my pa...|[how, to, make, m...|    28|
| to make that inp...|[to, make, that, ...|    28|
|how to make a rec...|[how, to, make, a...|    27|
| i am new to java...|[i, am, new, to, ...|    27|
|how to calculate ...|[how, to,

In [72]:
#Word TO VEC
word2Vec = Word2Vec(vectorSize=30, minCount=1,inputCol="filtered", outputCol="features")
model = word2Vec.fit(train_trans)

result = model.transform(train_trans)

In [73]:
result.select('filtered','features').show(n=2, truncate=False)

+-----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                         |features                                                                                                                                                                                                                                            

In [10]:
#BAG OF WORDS

DICT_SIZE = 24000
# Input data: Each row is a bag of words with a ID.

# fit a CountVectorizerModel from the corpus TRAIN DATASET
cv = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=DICT_SIZE, minDF=1.0)
model = cv.fit(train_trans)
result = model.transform(train_trans)

#VALIDATION DATASET
#cvv = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=DICT_SIZE, minDF=2.0)
#modelv = cvv.fit(dataset2)
#resultv = modelv.transform(dataset2)

#result.select('filtered','features').show(truncate=False, n=2)
#result.select('features','filtered').show()

In [11]:
result.select('filtered','features').show(truncate=False, n=2)

+-----------------------------------------------------------------+-------------------------------------------------------------------------+
|filtered                                                         |features                                                                 |
+-----------------------------------------------------------------+-------------------------------------------------------------------------+
|[draw, stacked, dotplot, r]                                      |(23540,[94,627,2797,12459],[1.0,1.0,1.0,1.0])                            |
|[mysql, select, records, datetime, field, less, specified, value]|(23540,[13,32,85,118,258,631,677,1135],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+-----------------------------------------------------------------+-------------------------------------------------------------------------+
only showing top 2 rows



In [12]:
#TF-IDF 

DICT_SIZE = 24000
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=DICT_SIZE)
featurizedData = hashingTF.transform(train_trans)#TRAIN
#featurizedDatav = hashingTF.transform(dataset2)#VALIDATION
# alternatively, CountVectorizer can also be used to get term frequency vectors

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)#TRAIN
#idfModelv = idf.fit(featurizedDatav)#VALIDATION
rescaledData = idfModel.transform(featurizedData)#TRAIN
#rescaledDatav = idfModelv.transform(featurizedDatav)#VALIDATION

rescaledData.select("filtered", "features").show(truncate=False, n=2)

+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                         |features                                                                                                                                                                                            |
+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[draw, stacked, dotplot, r]                                      |(24000,[1570,8167,8935,9797],[4.643921014254523,6.314438433654403,8.334881634572284,8.804885263818019])                                   

In [13]:
#Concatenate all tags to create a single class
# It doest work as produce more than 7000 different classes and program goes out of memory

#y_train=train_trans.withColumn('tags_3',concat_ws('_', 'tags_2'))
#y_val=val_trans.withColumn('tags_3',concat_ws('_', 'tags_2'))

In [14]:
#y_train.select('tags_3').show(truncate=False, n=300)

For a more comfortable usage, initialize *X_train*, *X_val*, *X_test*, *y_train*, *y_val*.

In [15]:
#X_train=train_trans.select('filtered')
#X_val =val_trans.select('filtered')
#X_test=test_trans.select('filtered')

One of the most known difficulties when working with natural data is that it's unstructured. For example, if you use it "as is" and extract tokens just by splitting the titles by whitespaces, you will see that there are many "weird" tokens like *3.5?*, *"Flip*, etc. To prevent the problems, it's usually useful to prepare the data somehow. In this task you'll write a function, which will be also used in the other assignments. 

**Task 1 (TextPrepare).** Implement the function *text_prepare* following the instructions. After that, run the function *test_test_prepare* to test it on tiny cases and submit it to Coursera.

In [80]:
#create a column with a single tag for those cases where we have multiple tags , 
#this will duplicate the title record several times , but the number of classes will be reduced drastically to 100

# use appropiate dataset BOW or TF-IDF

y_train=result.withColumn('tags_1',explode(split('tags',', ')))
y_train=y_train.withColumn('tags_1',regexp_replace(y_train.tags_1, '[^A-Za-z0-9#+ ]+', ''))

#y_val=val_trans.withColumn('tags_1',explode(split('tags',', ')))
#y_val=y_val.withColumn('tags_1',regexp_replace(y_val.tags_1, '[^A-Za-z0-9#+ ]+', ''))


In [17]:
#y_val.filter("tags is NULL").select('tags','tags_1').show()


In [18]:
#y_train.filter("tags is NULL").select('tags', 'tags_1').show(n=200)

In [19]:
#y_train.filter('tags_1 like "NoTag"').select('filtered','tags','tags_1','tags_2').show()

In [20]:
#gdf=y_train.groupBy("tags_3")
#sorted(gdf.agg({"*": "count"}).collect())

    #.orderBy(col("count").desc()) \
    #.show(n=5)

In [21]:
#y_val.groupBy("tags_1") \
#    .count() \
#    .orderBy(col("count").desc()) \
#    .show(n=5)

In [81]:
#create a column label with thag categorized
"""
StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), 
ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels 
if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. 
When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set 
the input column of the component to this string-indexed column name. In many cases, you can set the input column with 
setInputCol.
"""
label_stringIdx = StringIndexer(inputCol = "tags_1", outputCol = "label")

dataset = label_stringIdx.fit(y_train).transform(y_train)
#dataset2 = label_stringIdx.fit(y_val).transform(y_val)

#dataset.select("tags_3").distinct().count().show()

In [23]:
from pyspark.ml.feature import OneHotEncoderEstimator

In [24]:
#One Hot Encoder, not suitable here as predictor is looking for a number and not a vector
"""
encoder = OneHotEncoderEstimator(inputCols=["label_cat"],
                                 outputCols=["label"])
model = encoder.fit(dataset)
encoded = model.transform(dataset)
encoded.show()
"""

+--------------------+--------------------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+---------------+
|               title|                tags|      tags_1|               words|              tags_2|            filtered|         rawFeatures|            features|label_cat|          label|
+--------------------+--------------------+------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+---------------+
|how to draw a sta...|               ['r']|           r|[how, to, draw, a...|                 [r]|[draw, stacked, d...|(24000,[1570,8167...|(24000,[1570,8167...|     24.0|(99,[24],[1.0])|
|mysql select all ...|    ['php', 'mysql']|         php|[mysql, select, a...|        [php, mysql]|[mysql, select, r...|(24000,[585,4768,...|(24000,[585,4768,...|      3.0| (99,[3],[1.0])|
|mysql select all ...|    ['php', 'mysql']|       mysql|[mys

In [25]:
#dataset.select("tags_3").distinct().count()

In [82]:
dataset.select("tags_1").distinct().count()
#dataset2.select('tags_1','label').show(25)

100

In [83]:
dataset.select('features','tags_1','label').show(25)

+--------------------+------------+-----+
|            features|      tags_1|label|
+--------------------+------------+-----+
|[0.08730552755150...|           r| 24.0|
|[0.34796910732984...|         php|  3.0|
|[0.34796910732984...|       mysql| 14.0|
|[-0.0311543645026...|          c#|  1.0|
|[0.06817806246025...|  javascript|  0.0|
|[0.06817806246025...|      jquery|  5.0|
|[0.05053890030831...|        java|  2.0|
|[-0.0329051437671...| rubyonrails| 11.0|
|[0.15660998384867...|        ruby| 16.0|
|[0.15660998384867...|rubyonrails3| 49.0|
|[0.15660998384867...|        json| 18.0|
|[0.10343911964446...|        ruby| 16.0|
|[-0.0068720909766...|        java|  2.0|
|[-0.0068720909766...|      spring| 31.0|
|[-0.0068720909766...|   springmvc| 54.0|
|[0.17184817365237...|         php|  3.0|
|[0.17184817365237...| codeigniter| 44.0|
|[-0.0761974222170...|        java|  2.0|
|[-0.0761974222170...|       class| 65.0|
|[0.07230409234762...|  javascript|  0.0|
|[0.07230409234762...|      jquery

As you might notice, we transform the data to sparse representation, to store the useful information efficiently. There are many [types](https://docs.scipy.org/doc/scipy/reference/sparse.html) of such representations, however slkearn algorithms can work only with [csr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) matrix, so we will use this one.

**Task 3 (BagOfWords).** For the 11th row in *X_train_mybag* find how many non-zero elements it has. In this task the answer (variable *non_zero_elements_count*) should be a number, e.g. 20.

#### TF-IDF

The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

Implement function *tfidf_features* using class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. Use *train* corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary. 

Implement the function *train_classifier* for training a classifier. In this task we suggest to use One-vs-Rest approach, which is implemented in [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) class. In this approach *k* classifiers (= number of tags) are trained. As a basic classifier, use [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time, because a number of classifiers to train is large.

In [87]:
train_set=dataset.select('features','label')
#train_set.withColumnRenamed('Label_vec', 'label')
#train_val=rescaledDatav.select('features','label')
train,test=train_set.randomSplit([0.7,0.3])


In [88]:
#train_val.filter("features is NULL or label is NULL").show()

In [89]:
train.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[-0.3263096362352...|  6.0|
|[-0.3237328912530...|  1.0|
|[-0.3237328912530...| 55.0|
|[-0.3236532732844...|  2.0|
|[-0.3236532732844...| 15.0|
|[-0.3195231917003...|  1.0|
|[-0.3195231917003...|  9.0|
|[-0.3183089345499...|  0.0|
|[-0.3183089345499...|  2.0|
|[-0.3182319621555...|  1.0|
|[-0.3172013629227...|  0.0|
|[-0.3172013629227...| 93.0|
|[-0.3145520538091...|  1.0|
|[-0.3124032041523...|  3.0|
|[-0.3092142727691...| 40.0|
|[-0.3064487703765...|  3.0|
|[-0.3005962437018...| 40.0|
|[-0.2976422578096...| 34.0|
|[-0.2976422578096...| 59.0|
|[-0.2974332298229...|  0.0|
+--------------------+-----+
only showing top 20 rows



In [90]:
print(train.count())
print(test.count())

135797
58375


In [91]:
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
lrModel = lr.fit(train)
predictions = lrModel.transform(test)
predictions.filter(predictions['prediction'] == 0) \
    .select("features","label","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+--------+-----+-----------+-----+----------+
|features|label|probability|label|prediction|
+--------+-----+-----------+-----+----------+
+--------+-----+-----------+-----+----------+



In [92]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

0.01751846519256916

In [93]:
    # instantiate the base classifier.
    lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)

    # instantiate the One Vs Rest Classifier.  
    ovr = OneVsRest(classifier=lr)

    # train the multiclass model.
    ovrModel = ovr.fit(train)

    # score the model on test data.
    predictions = ovrModel.transform(test)

    # obtain evaluator.
    evaluator = MulticlassClassificationEvaluator(metricName="accuracy")



In [94]:
# compute the classification error on test data.
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
# $example off$

Test Error = 0.686852


In [95]:
predictions.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)
 |-- prediction: double (nullable = true)



In [96]:
predictions.show()

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[-0.3263096362352...| 55.0|      59.0|
|[-0.3195231917003...|  7.0|      51.0|
|[-0.3195231917003...| 51.0|      51.0|
|[-0.3182319621555...| 19.0|       2.0|
|[-0.3172013629227...| 40.0|      40.0|
|[-0.3145520538091...| 55.0|      59.0|
|[-0.3092142727691...|  0.0|       2.0|
|[-0.3092142727691...|  7.0|       2.0|
|[-0.3005962437018...|  0.0|      40.0|
|[-0.2976422578096...|  1.0|      59.0|
|[-0.2976422578096...|  9.0|      59.0|
|[-0.2976422578096...| 10.0|      59.0|
|[-0.2960290387272...| 15.0|       2.0|
|[-0.2751255961401...|  2.0|      15.0|
|[-0.2751255961401...| 92.0|      15.0|
|[-0.2749468266653...|  2.0|       2.0|
|[-0.2721032863482...|  5.0|       5.0|
|[-0.2682026606053...|  3.0|       2.0|
|[-0.2676488929428...|  8.0|       6.0|
|[-0.2676488929428...| 15.0|       6.0|
+--------------------+-----+----------+
only showing top 20 rows



In [None]:

# Train a Random Forest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=12,  maxDepth=5, maxMemoryInMB=1024)

# Chain RF in a Pipeline
pipeline = Pipeline(stages=[rf])

# Train model.
model = rf.fit(train)

# Make predictions.
predictions = model.transform(test)


In [97]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [98]:
# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [30, 400 , 200, 100]

In [99]:
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=1024, seed=1234)


In [100]:
# train the model
model = trainer.fit(train)


In [101]:
# compute accuracy on the test set
result = model.transform(test)

In [102]:
result.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[-0.3263096362352...| 55.0|[3.39891411795353...|[0.01361930924709...|       6.0|
|[-0.3195231917003...|  7.0|[5.63922809927329...|[0.30675189374613...|       0.0|
|[-0.3195231917003...| 51.0|[5.63922809927329...|[0.30675189374613...|       0.0|
|[-0.3182319621555...| 19.0|[2.34687689881737...|[0.02212588483745...|       2.0|
|[-0.3172013629227...| 40.0|[6.43084434313172...|[0.24612627622754...|       0.0|
|[-0.3145520538091...| 55.0|[2.43267761346920...|[0.00838062899082...|       6.0|
|[-0.3092142727691...|  0.0|[5.15608536870696...|[0.16421502124290...|       2.0|
|[-0.3092142727691...|  7.0|[5.15608536870696...|[0.16421502124290...|       2.0|
|[-0.3005962437018...|  0.0|[5.64968250296259...|[0.21119549024560...|       0.0|
|[-0.29764225780

In [103]:
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(result)))

Test set accuracy = 0.2947665952890792
