# DiC Assignment 2

**Group 6:**
 Theresa Mayer
 Theresa Bruckner
 Jan Tölken
 Can Kenan Kandil 
 Thomas Klar


# Imports and Spark Session Creation

In [1]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, IDF, ChiSqSelector, IndexToString, StringIndexer, CountVectorizer
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Assignment2").getOrCreate()

## Part 1

## Part 2

We want to convert the review texts to a classic vector space representation with TFIDF-weighted features based on the
Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the
preparation of the pipeline for Part 3.

We start by loading the data into a Spark DataFrame.

In [2]:
path = "reviews_devset.json" #"hdfs:///user/dic25_shared/amazon-reviews/full/reviews_devset.json"
input_file = spark.read.format("json").load(path).select("category", "reviewText")

In [3]:
input_file.show(n=5)

+--------------------+--------------------+
|            category|          reviewText|
+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|
|Patio_Lawn_and_Garde|This is a very ni...|
|Patio_Lawn_and_Garde|The metal base wi...|
|Patio_Lawn_and_Garde|For the most part...|
|Patio_Lawn_and_Garde|This hose is supp...|
+--------------------+--------------------+


### Label Encoding

As the first step, we perform Label Encoding to convert the category strings into integers. To retransform them if nescessary, we also create a Reindexing Transformer. 

In [4]:
indexer = StringIndexer(inputCol="category", outputCol="label")
indexModel = indexer.fit(input_file)
input_file_1 = indexModel.transform(input_file)

In [5]:
reindexer = IndexToString(inputCol=indexer.getOutputCol(), outputCol="category_reindexed")
reindexer.transform(input_file_1).show(n=5)

+--------------------+--------------------+-----+--------------------+
|            category|          reviewText|label|  category_reindexed|
+--------------------+--------------------+-----+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...| 18.0|Patio_Lawn_and_Garde|
|Patio_Lawn_and_Garde|This is a very ni...| 18.0|Patio_Lawn_and_Garde|
|Patio_Lawn_and_Garde|The metal base wi...| 18.0|Patio_Lawn_and_Garde|
|Patio_Lawn_and_Garde|For the most part...| 18.0|Patio_Lawn_and_Garde|
|Patio_Lawn_and_Garde|This hose is supp...| 18.0|Patio_Lawn_and_Garde|
+--------------------+--------------------+-----+--------------------+


### Tokenization
As the next step, we tokenize the reviews into words. To do this, we split at whitespaces, tables, digits, and all the symbols given in the regex pattern below. Additionally, this tokenizer also performs Case folding and can filter out tokens with only one character.

In [6]:
tokenizer = RegexTokenizer(inputCol='reviewText', 
                           outputCol='tokens', 
                           pattern=r"[ \t\d(){}\[\].!?;:,\-=\"~#@&*%€$§\\'\n\r\/]+", 
                           minTokenLength=2, 
                           toLowercase=True)

In [7]:
input_2 = tokenizer.transform(input_file_1)
input_2.show(n=5)

+--------------------+--------------------+-----+--------------------+
|            category|          reviewText|label|              tokens|
+--------------------+--------------------+-----+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...| 18.0|[this, was, gift,...|
|Patio_Lawn_and_Garde|This is a very ni...| 18.0|[this, is, very, ...|
|Patio_Lawn_and_Garde|The metal base wi...| 18.0|[the, metal, base...|
|Patio_Lawn_and_Garde|For the most part...| 18.0|[for, the, most, ...|
|Patio_Lawn_and_Garde|This hose is supp...| 18.0|[this, hose, is, ...|
+--------------------+--------------------+-----+--------------------+


### Stopword Removal
The next transformer can filter out the stopwords in the tokens. We use the same stopswords as in the previous exercise. 

In [8]:
stopword_file = "stopwords.txt"
with open(stopword_file, 'r', encoding='utf-8') as f:
    # Strip whitespace and convert to lowercase
    stopwords = [line.strip() for line in f]

In [9]:
stopword_remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
                                    outputCol="tokens_nostop",
                                    stopWords=stopwords)

In [10]:
input_3 = stopword_remover.transform(input_2)
input_3.select("tokens", "tokens_nostop").show(n=5)

+--------------------+--------------------+
|              tokens|       tokens_nostop|
+--------------------+--------------------+
|[this, was, gift,...|[gift, husband, m...|
|[this, is, very, ...|[nice, spreader, ...|
|[the, metal, base...|[metal, base, hos...|
|[for, the, most, ...|[part, works, pre...|
|[this, hose, is, ...|[hose, supposed, ...|
+--------------------+--------------------+


### TF-IDF Calculation
The Calculation of the TF-IDF Vectors from the list of tokens is performed in two steps. The first uses _CountVectorizer_ to calculate Term Frequencies, outputting Sparse Vectors. We use a fixed Vocabulary Size of 60 000 tokens, which might be unnecessary large. The _CountVectorizer_ also allows us to access the vocabulary at the end to extract the most important words from it. The second step uses _IDF_ to scale the term frequency vectors with the inverse document frequencies. 

It should be noted that those are both Estimators, not Transformers, and thus require fitting. In the Pipeline, those Estimators are fitted on the training data when calling _pipeline.fit()_ and then also transform the training data. The test data is only transformed when calling _pipeline.transform()_, since the Estimators are already fitted. This also ensure that there is no Data Leakage from the test data. 

In [11]:
tf = CountVectorizer(inputCol=stopword_remover.getOutputCol(), 
                      outputCol="tf_output", 
                      vocabSize=60_000)

In [12]:
idf = IDF(inputCol=tf.getOutputCol(), 
          outputCol="tfidf_output",
          minDocFreq=4)

In [13]:
tfmodel = tf.fit(input_3)
input_4 = tfmodel.transform(input_3)
input_4.select("tokens_nostop", "tf_output").show(n=5)

+--------------------+--------------------+
|       tokens_nostop|           tf_output|
+--------------------+--------------------+
|[gift, husband, m...|(60000,[2,3,7,8,3...|
|[nice, spreader, ...|(60000,[0,1,3,21,...|
|[metal, base, hos...|(60000,[4,10,29,1...|
|[part, works, pre...|(60000,[1,3,4,9,1...|
|[hose, supposed, ...|(60000,[12,32,42,...|
+--------------------+--------------------+


In [14]:
idfModel = idf.fit(input_4)
input_5 = idfModel.transform(input_4)
input_5.select("tf_output", "tfidf_output").show(n=5)

+--------------------+--------------------+
|           tf_output|        tfidf_output|
+--------------------+--------------------+
|(60000,[2,3,7,8,3...|(60000,[2,3,7,8,3...|
|(60000,[0,1,3,21,...|(60000,[0,1,3,21,...|
|(60000,[4,10,29,1...|(60000,[4,10,29,1...|
|(60000,[1,3,4,9,1...|(60000,[1,3,4,9,1...|
|(60000,[12,32,42,...|(60000,[12,32,42,...|
+--------------------+--------------------+


### Selection of top 2000 features

In [15]:
chisq = ChiSqSelector(featuresCol=idf.getOutputCol(),
                      labelCol="label",
                      outputCol="features",
                      numTopFeatures=2000)

In [16]:
chisqModel = chisq.fit(input_5)
input_6 = chisqModel.transform(input_5)
input_6.select("features").show(n=5)

+--------------------+
|            features|
+--------------------+
|(2000,[2,3,7,8,35...|
|(2000,[0,1,3,21,3...|
|(2000,[4,10,174,3...|
|(2000,[1,3,4,9,10...|
|(2000,[12,29,101,...|
+--------------------+


### Pipeline Creation

In [17]:
def get_pipeline(n_features=2000):
    chisq.setNumTopFeatures(n_features)
    pipeline = Pipeline(stages=[
        indexer,
        tokenizer,
        stopword_remover,
        tf,
        idf,
        chisq
    ])
    return pipeline

In [18]:
pipeline = get_pipeline(n_features=2000)
preprocessing_pipeline = pipeline.fit(input_file)
preprocessing_pipeline.transform(input_file).select("label", "features").show(n=5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
| 18.0|(2000,[2,3,7,8,35...|
| 18.0|(2000,[0,1,3,21,3...|
| 18.0|(2000,[4,10,174,3...|
| 18.0|(2000,[1,3,4,9,10...|
| 18.0|(2000,[12,29,101,...|
+-----+--------------------+


### Export most important tokens to file

In [19]:
def get_top_terms_from_pipeline(pipeline):
    n = len(pipeline.stages[5].selectedFeatures)


    vocab = pipeline.stages[3].vocabulary.copy()
    top_words = " ".join(sorted([vocab[i] for i in pipeline.stages[5].selectedFeatures]))
    
    with open("output_ds.txt", "w") as f:
        f.write(top_words)
        
    return n

In [20]:
get_top_terms_from_pipeline(preprocessing_pipeline)

2000

# Part 3

To develop the SVM you can use this dataframe for testing

In [21]:
input_test = preprocessing_pipeline.transform(input_file).select("label", "features")
input_test.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
| 18.0|(2000,[2,3,7,8,35...|
| 18.0|(2000,[0,1,3,21,3...|
| 18.0|(2000,[4,10,174,3...|
| 18.0|(2000,[1,3,4,9,10...|
| 18.0|(2000,[12,29,101,...|
| 18.0|(2000,[0,3,4,8,11...|
| 18.0|(2000,[18,112,175...|
| 18.0|(2000,[6,21,32,36...|
| 18.0|(2000,[3,4,5,6,40...|
| 18.0|(2000,[6,8,38,78,...|
| 18.0|(2000,[1,13,226],...|
| 18.0|(2000,[5,17,33,40...|
| 18.0|(2000,[1,11,28,35...|
| 18.0|(2000,[40,144,339...|
| 18.0|(2000,[0,3,7,9,11...|
| 18.0|(2000,[8,26,57,80...|
| 18.0|(2000,[1,15,120,1...|
| 18.0|(2000,[2,3,221,26...|
| 18.0|(2000,[4,10,16,20...|
| 18.0|(2000,[0,18,30,42...|
+-----+--------------------+


If you are finished and have a working SVM, you can create an end to end pipeline like this: 

In [22]:
def get_model_pipeline(n_features=2000, add_more_arguments_here=None):
    chisq.setNumTopFeatures(n_features)
    # set model parameters here or something probably
    
    
    model_pipeline = Pipeline(stages=[
        indexer,
        tokenizer,
        stopword_remover,
        tf,
        idf,
        chisq,
        # add all new transformers/estimators here 
    ])
    return model_pipeline

Then you can do end-to-end testing with train and test set similar to this:

testset, trainset = split(input_data)
model_pipeline = get_model_pipeline(parameters)
model = model_pipeline.fit(trainset)
output = model.transform(testset)
calculate_metrics(output)