# Baseline Model

---

This model ignores the review text and carries out predictions only based on the marginal probabilities of labels.

Always predicting the most common wine variety is the best we can do if we completely ignore the text in the `description`.

---

In [1]:
import os
from IPython.display import Markdown, display
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

In [2]:
%matplotlib inline
def warn(string):
    display(Markdown('<span style="color:red">'+string+'</span>'))
def info(string):
    display(Markdown('<span style="color:blue">'+string+'</span>'))

In [3]:
spark = SparkSession.builder \
    .master("local[*]") \
    .config("spark.driver.memory", "20g") \
    .config("spark.executor.cores", "16") \
    .config("spark.driver.maxResultSize", "1g") \
    .appName("jojoSparkSession") \
    .getOrCreate()
    # .config("spark.default.parallelism", "16") \
sc = spark.sparkContext

In [4]:
# load the dataset
reviews_sdf = spark.read.parquet('data/reviews_cleaned')
reviews_sdf.show()

+-----+--------------------+------------------+
|index|         description|           variety|
+-----+--------------------+------------------+
|    0|aromas include tr...|       white blend|
|    1|this is ripe and ...|    portuguese red|
|    2|tart and snappy, ...|        pinot gris|
|    3|pineapple rind, l...|          riesling|
|    4|much like the reg...|        pinot noir|
|    7|this dry and rest...|    gewürztraminer|
|    8|savory dried thym...|    gewürztraminer|
|    9|this has great de...|        pinot gris|
|   10|soft, supple plum...|cabernet sauvignon|
|   11|this is a dry win...|    gewürztraminer|
|   12|slightly reduced,...|cabernet sauvignon|
|   14|building on 150 y...|        chardonnay|
|   15|zesty orange peel...|          riesling|
|   16|baked plum, molas...|            malbec|
|   17|raw black-cherry ...|            malbec|
|   18|desiccated blackb...| tempranillo blend|
|   19|red fruit aromas ...|          meritage|
|   20|ripe aromas of da...|         red

## Feature Creation

Since we are not using any features for this, there is really noting to do here...

## Model Definition

We want to predict the label based on the frequency of labels in our training dataset.

So the steps we need to do:

1. Convert the wine variety column to a categorical label.
2. Identify the most common label and use it as prediction.

_See the [README.md](./README.md#Model-Definition) for further details._

We will implement both steps using sprak Transformers and Estimators.

For **step 1** we can readily use pyspark's [StringIndexer](https://spark.apache.org/docs/3.1.1/ml-features.html#stringindexer).

**Step 2** needs a some more doing from our side.
We implement a custom estimator `MarginalMaximizer` by sub-classing pyspark's [Estimator](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.Estimator.html)
The implementation is found in [utils/Estimators.py](./utils/Estimators.py)

In [5]:
# initiate our string indexer
si = StringIndexer(inputCol='variety', outputCol='label', handleInvalid='keep')

# import our custom estimator
from utils.Estimators import MarginalMaximizer
# and initiate it
mm = MarginalMaximizer(inputCol='label', outputCol='prediction')

# all we need to do is setting up a pipeline:
marginal_ppl = Pipeline(stages=[si, mm])

## Model Training

Now we can start 'training' our baseline model.

But first, split our data into a training and a test set:

In [6]:
df_train, df_test = reviews_sdf.randomSplit([0.9, 0.1])

In [7]:
%%time
# ###
# This is simply to get an idea of how long it takes to run
# Since pipelines are lazy we add the df.show() at the end
# to force execution (otherwise timing would not make much sense)
# ###
marginal_model = marginal_ppl.fit(df_train)

CPU times: user 16.2 ms, sys: 12.5 ms, total: 28.7 ms
Wall time: 5.29 s


## Model Evaluation

With `marginal_model` being fitted to the train data we can now check its performance, both on the training and on the test set:

In [8]:
pred_train = marginal_model.transform(df_train)
pred_train.show()
# nbr_labels = int(pred_train.agg({'label': 'max'}).collect()[0]['max(label)'])
# info(f'We are dealing with {nbr_labels} different labels.')

+-----+--------------------+------------------+-----+----------+
|index|         description|           variety|label|prediction|
+-----+--------------------+------------------+-----+----------+
|    0|aromas include tr...|       white blend| 15.0|       0.0|
|    1|this is ripe and ...|    portuguese red| 14.0|       0.0|
|    2|tart and snappy, ...|        pinot gris| 19.0|       0.0|
|    3|pineapple rind, l...|          riesling|  5.0|       0.0|
|    4|much like the reg...|        pinot noir|  0.0|       0.0|
|    7|this dry and rest...|    gewürztraminer| 27.0|       0.0|
|    9|this has great de...|        pinot gris| 19.0|       0.0|
|   10|soft, supple plum...|cabernet sauvignon|  2.0|       0.0|
|   11|this is a dry win...|    gewürztraminer| 27.0|       0.0|
|   12|slightly reduced,...|cabernet sauvignon|  2.0|       0.0|
|   14|building on 150 y...|        chardonnay|  1.0|       0.0|
|   15|zesty orange peel...|          riesling|  5.0|       0.0|
|   16|baked plum, molas.

We use the **accuracy** as evaluation metric as it indicates the fraction of labels that were correctly guessed.
In this baseline model the predicted labels are for all samples the same, i.e. the most common label in the training data.
It follows that the accuracy will indicate the marginal probability of this label.

Now we define our evaluation metrics:

In [9]:
accuracy = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction',
                                             metricName='accuracy')
f1 = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction',
                                       metricName='f1')

In [10]:
info('How well is the marginal prediction doing?:')
info(f'Accuracy: **{round(accuracy.evaluate(pred_train), 4)*100}%**')
info(f'F1 score: **{round(f1.evaluate(pred_train), 4)}**')

<span style="color:blue">How well is the marginal prediction doing?:</span>

<span style="color:blue">Accuracy: **11.35%**</span>

<span style="color:blue">F1 score: **0.0231**</span>

In [11]:
info('How well are we dong on the test dataset?:')
pred_test = marginal_model.transform(df_test)
info(f'Accuracy: **{round(accuracy.evaluate(pred_test), 4)*100}%**')
info(f'F1 score: **{round(f1.evaluate(pred_test), 4)}**')

<span style="color:blue">How well are we dong on the test dataset?:</span>

<span style="color:blue">Accuracy: **11.3%**</span>

<span style="color:blue">F1 score: **0.023**</span>

Now we have a baseline:

**An accuracy of around 11% can be obtained without considering the text in the `description` column.**

Any model that performs better than that is capable to retrieve some information from the review text.

### Deployment

Since this is a rather simple model we will save it here directly in a deployable form.

Have a look at [reviewed_grapes.model_deployment.pyspark.v1.ipynb](reviewed_grapes.model_deployment.pyspark.v1.ipynb) for details on the deployment process.


In [12]:
# remove the StringIndexer
s_first = marginal_model.stages.pop(0)
# get the labels from the string indexer
labels = s_first.labels
# now we construct a index->string
s_last = marginal_model.stages[-1]
its = IndexToString(inputCol=s_last.getOutputCol(),
                    outputCol='_prediction',
                    labels=labels)
marginal_model.stages.append(its)
# the model is ready for deployment
name = 'reviewed_grapes/fitted_models/MarginalModel'
marginal_model.write().overwrite().save(name)

---