# Model Tuning Quiz
Use this Jupyter notebook to find the answer to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [20]:
from pyspark.sql import SparkSession
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.feature import CountVectorizer, IDF, RegexTokenizer, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read dataset
# 3) follow the steps below to find the answer to the quiz question

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

# Question
What is the accuracy of the best model trained with the parameter grid described above (and keeping all other parameters at their default value computed on the 10% untouched data?

### Step 1. Train Test Split
As a first step break your data set into 90% of training data and set aside 10%. Set random seed to `42`.

In [6]:
# TODO: write your code for this step

In [25]:
# train test split
train, test = df.randomSplit([0.9, 0.1], seed=42)

### Step 2. Build Pipeline

In [None]:
# TODO: write your code for this step

In [10]:
# transformers
regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
cv = CountVectorizer(inputCol="words", outputCol="TF", vocabSize=10000)
idf = IDF(inputCol="TF", outputCol="features")
indexer = StringIndexer(inputCol="oneTag", outputCol="label")

In [13]:
# estimators
lr = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

In [16]:
# pipeline
pipeline = Pipeline(stages=[regexTokenizer, cv, idf, indexer, lr])

### Step 3. Tune Model
On the first 90% of the data let's find the most accurate logistic regression model using 3-fold cross-validation with the following parameter grid:

- CountVectorizer vocabulary size: `[1000, 5000]`
- LogisticRegression regularization parameter: `[0.0, 0.1]`
- LogisticRegression max Iteration number: `[10]`

In [None]:
# TODO: write your code for this step

In [18]:
# set up param grid to iterate over
paramGrid = ParamGridBuilder() \
.addGrid(cv.vocabSize, [1000, 5000]) \
.addGrid(lr.regParam, [0.0, 0.1]) \
.build()

In [21]:
# set up crossvalidator to tune parameters and optimize
crossval = CrossValidator(estimator=pipeline,
                         estimatorParamMaps=paramGrid,
                         evaluator=MulticlassClassificationEvaluator(),
                         numFolds=3)

In [30]:
cvModel = crossval.fit(train)  # train model
results = cvModel.transform(test)  # apply model on test data

In [None]:
results.head()

### Step 4: Compute Accuracy of Best Model

In [31]:
# TODO: write your code for this step

In [36]:
cvModel.avgMetrics  # look at model scoring metrics

[0.30365027071399764,
 0.2324627282468778,
 0.3658579734838215,
 0.28637512190152187]

In [37]:
print(results.filter(results.label == results.prediction).count())  # check how many were predicted correctly

3892


In [38]:
results.count()  # how many labels

9919

In [40]:
results.filter(results.label == results.prediction).count()/results.count()   # hwow many correct

0.392378263937897