# PySpark Implementation
In this notebook, we will implement train a NB model using the training set & preprocess the data with our heuristics.

### Imports

In [119]:
import csv, os, sys
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, VectorAssembler

In [3]:
# Initialize a spark session.
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    return spark

In [4]:
spark = init_spark()

### Converting training data into RDD

In [5]:
filename_train = "../dataset/train.csv"
filename_test = "../dataset/valid.csv"

train_rdd = spark.read.csv(filename_train, header=True, multiLine=True, inferSchema=True, escape='"', quote='"')
test_rdd = spark.read.csv(filename_test, header=True, multiLine=True, inferSchema=True, escape='"', quote='"')
train_rdd.show(5)

+--------+--------------------+--------------------+--------------------+-------------------+--------+
|      Id|               Title|                Body|                Tags|       CreationDate|       Y|
+--------+--------------------+--------------------+--------------------+-------------------+--------+
|34552656|Java: Repeat Task...|<p>I'm already fa...|      <java><repeat>|2016-01-01 00:21:59|LQ_CLOSE|
|34553034|Why are Java Opti...|<p>I'd like to un...|    <java><optional>|2016-01-01 02:03:20|      HQ|
|34553174|Text Overlay Imag...|<p>I am attemptin...|<javascript><imag...|2016-01-01 02:48:24|      HQ|
|34553318|Why ternary opera...|<p>The question i...|<swift><operators...|2016-01-01 03:30:17|      HQ|
|34553755|hide/show fab wit...|<p>I'm using cust...|<android><materia...|2016-01-01 05:21:48|      HQ|
+--------+--------------------+--------------------+--------------------+-------------------+--------+
only showing top 5 rows



## Preprocessing data
In this part, we will preprocess our data with three different heuristics. The first using tokenization, second using stemming, third removing stop words.

### Build training & testing dataframes

In [118]:
training = train_rdd.rdd \
    .map(lambda x: (x["Title"]+" "+x["Body"], x["Y"])) \
    .toDF(["Question", "Output"]) \
    .limit(5) # change to collect()

testing = test_rdd.rdd \
    .map(lambda x: (x["Title"]+" "+x["Body"], x["Y"])) \
    .toDF(["Question", "Output"]) \
    .limit(5) # change to collect()

### Heuristics

In [108]:
# HEURISTIC 1 - Tokenize the words
regexTokenizer = RegexTokenizer(inputCol="Question", outputCol="words", pattern="\\W")

# HEURISTIC 2 - Remove the stopwords
add_stopwords = ["the", "a", "be", "of", "and", "to", "why"] 
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords)

# HEURISTIC 3 - Stem words
# Use this library (TODO) - https://github.com/master/spark-stemming

In [112]:
# Build B.o.W model 
countVectors_h1 = CountVectorizer(inputCol="words", outputCol="features")
indexed_features_h1 = StringIndexer(inputCol = "Output", outputCol = "label")

countVectors_h2 = CountVectorizer(inputCol="filtered", outputCol="features")
indexed_features_h2 = StringIndexer(inputCol = "Output", outputCol = "label")

### Pipeline

In [113]:
# Pipeline with Heurisitc 1
pipeline_h1 = Pipeline(stages=[regexTokenizer, countVectors_h1, indexed_features_h1])

# Pipeline with Heurisitc 2
pipeline_h2 = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors_h2, indexed_features_h2])

# Pipeline with Heurisitc 3
# pipeline_h3 = Pipeline(stages=[regexTokenizer, stopwordsRemover, stemRemover, countVectors, label_stringIdx])

In [117]:
pipeline_h1_fit = pipeline_h1.fit(training)
data_h1 = pipeline_h1_fit.transform(training)
data_h1.show()

pipeline_h2_fit = pipeline_h2.fit(training)
data_h2 = pipeline_h2_fit.transform(training)
data_h2.show()

# pipeline_h3_fit = pipeline_h3.fit(training)
# data_h3 = pipeline_h3_fit.transform(training)
# data_h3.show()

+--------------------+--------+--------------------+--------------------+-----+
|            Question|  Output|               words|            features|label|
+--------------------+--------+--------------------+--------------------+-----+
|Java: Repeat Task...|LQ_CLOSE|[java, repeat, ta...|(321,[0,5,6,7,8,1...|  1.0|
|Why are Java Opti...|      HQ|[why, are, java, ...|(321,[0,5,7,10,26...|  0.0|
|Text Overlay Imag...|      HQ|[text, overlay, i...|(321,[0,1,2,3,4,5...|  0.0|
|Why ternary opera...|      HQ|[why, ternary, op...|(321,[0,1,5,6,7,8...|  0.0|
|hide/show fab wit...|      HQ|[hide, show, fab,...|(321,[0,1,5,7,8,1...|  0.0|
+--------------------+--------+--------------------+--------------------+-----+

+--------------------+--------+--------------------+--------------------+--------------------+-----+
|            Question|  Output|               words|            filtered|            features|label|
+--------------------+--------+--------------------+--------------------+----

## Model training
In this part, we will train various pipelines through a NB classifier