# Table of Contents
## I. Setting Up Environment
1. Setting up Spark Session and Loading Data

## II. Preprocessing
2. Non-empty reviews

3. Selecting 4 styles of beer to classify

4. Cleaning numeric fields

## III. Exploratory Analysis and Vizualizations
5. Summary statistics for each style of beer and for entire dataset

## IV. Modeling
6. Creating Pipelines and Cross Validators

7. Baseline Model for Entire Dataset (Unbalanced Classes)

    7.1 CV TF-IDF Parameters
    
8. Baseline Model for Sampled Dataset (Balanced Classes)

    8.1 CV TF-IDF Parameters
    
    8.2 Choose Best parameters for Baseline ML Models
    
10. Baseline for Machine Learning Models with Balanced data and TF-IDF Parameters

    10.1 Naive Bayes Base Model (Should be same as 8.1)
    
    10.2 Logistic Regression Base Model
    
    10.3 Random Forest Base Model
    
11. CV ML Models

    11.1 Best Naive Bayes Parameters
    
    11.2 Best Logistic Regression Parameters
    
    11.3 Best Random Forest Parameters

## V. Next Steps (might be able to finish)

12. Stemming or Lemmatization

13. N-Grams

## VI. Next Steps (probably won't be able to finish)
14. Deep Learning

15. Recommendation System

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# I . Setting Up Environment

## 1. Setting Up Spark Session and Loading Data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import NaiveBayes, LogisticRegression, RandomForestClassifier
import os
import re
import string
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Exploratory_Analysis") \
    .config("spark.executor.memory", '12g') \
    .config("spark.executor.cores", '5') \
    .config('spark.cores.max', '5') \
    .config('spark.driver.memory', '12g') \
    .getOrCreate()

sc = spark.sparkContext

In [2]:
beers = spark.read.format('csv'). \
    option("header", "true"). \
    option("inferSchema", "true"). \
    load("/home/aaron/BigData135/datasets/beers.csv")

In [3]:
reviews = spark.read.format('csv'). \
    option("header", "true"). \
    option("inferSchema", "true"). \
    load("/home/aaron/BigData135/datasets/reviews.csv")

In [45]:
beers.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- brewery_id: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)
 |-- style: string (nullable = true)
 |-- availability: string (nullable = true)
 |-- abv: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- retired: string (nullable = true)



In [46]:
reviews.printSchema()

root
 |-- beer_id: integer (nullable = true)
 |-- username: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- look: string (nullable = true)
 |-- smell: string (nullable = true)
 |-- taste: string (nullable = true)
 |-- feel: string (nullable = true)
 |-- overall: string (nullable = true)
 |-- score: string (nullable = true)



In [47]:
beers.show(5)

+------+--------------------+----------+-----+-------+--------------------+------------+----+--------------------+-------+
|    id|                name|brewery_id|state|country|               style|availability| abv|               notes|retired|
+------+--------------------+----------+-----+-------+--------------------+------------+----+--------------------+-------+
|202522|      Olde Cogitator|      2199|   CA|     US|English Oatmeal S...|    Rotating| 7.3|No notes at this ...|      f|
| 82352|Konrads Stout Rus...|     18604| null|     NO|Russian Imperial ...|    Rotating|10.4|No notes at this ...|      f|
|214879|      Scottish Right|     44306|   IN|     US|        Scottish Ale|  Year-round|   4|No notes at this ...|      t|
|320009|MegaMeow Imperial...|      4378|   WA|     US|American Imperial...|      Winter| 8.7|Every time this year|      f|
|246438|     Peaches-N-Cream|     44617|   PA|     US|  American Cream Ale|    Rotating| 5.1|No notes at this ...|      f|
+------+--------

In [48]:
beers.count()

358873

In [49]:
#How many beers in each style
beers.groupBy('style').count().sort('count', ascending = False).show(20)

+--------------------+-----+
|               style|count|
+--------------------+-----+
|        American IPA|44719|
|American Pale Ale...|22159|
|American Imperial...|18336|
|      Belgian Saison|18166|
|   American Wild Ale|12972|
|American Imperial...|11180|
|     American Porter|10168|
|American Amber / ...| 9748|
|      American Stout| 9103|
|Fruit and Field Beer| 7729|
| American Blonde Ale| 7089|
|  American Brown Ale| 7008|
|   German Hefeweizen| 6019|
|     Belgian Witbier| 5613|
|American Pale Whe...| 5266|
|     Berliner Weisse| 5036|
|      German Pilsner| 4748|
|    Belgian Pale Ale| 4523|
|Russian Imperial ...| 4426|
|English Sweet / M...| 4192|
+--------------------+-----+
only showing top 20 rows



In [50]:
reviews.show(5)

+-------+---------------+-------------------+--------------------+--------------------+--------------------+------+--------------------+-----------------+------------------+
|beer_id|       username|               date|                text|                look|               smell| taste|                feel|          overall|             score|
+-------+---------------+-------------------+--------------------+--------------------+--------------------+------+--------------------+-----------------+------------------+
| 271781|   bluejacket74|2017-03-17 00:00:00|   750 ml bottle,...|                   4|                   4|     4|                4.25|                4|              4.03|
| 125646|        _dirty_|2017-12-21 00:00:00|                    |                 4.5|                 4.5|   4.5|                 4.5|              4.5|               4.5|
| 125646|        CJDUBYA|2017-12-21 00:00:00|                    |                4.75|                4.75|  4.75|               

In [51]:
reviews.count()

9073128

# II. Preprocessing

## 2.1 Non Empty Reviews

In [52]:
#Count Non-empty reviews
(reviews.filter(reviews['text'] != '\xa0\xa0')).count()

2987993

In [4]:
non_empty_reviews = reviews.filter(reviews['text'] != '\xa0\xa0')

In [5]:
non_empty_reviews = non_empty_reviews.withColumn('text', F.regexp_replace('text', "\\.|\xa0|!|,|:", ""))\
                                    .withColumn('text', F.trim(F.col('text')))

In [53]:
non_empty_reviews.show(10)

+-------+---------------+-------------------+--------------------+--------------------+--------------------+------+--------------------+-----------------+------------------+
|beer_id|       username|               date|                text|                look|               smell| taste|                feel|          overall|             score|
+-------+---------------+-------------------+--------------------+--------------------+--------------------+------+--------------------+-----------------+------------------+
| 271781|   bluejacket74|2017-03-17 00:00:00|750 ml bottle 201...|                   4|                   4|     4|                4.25|                4|              4.03|
| 125646|GratefulBeerGuy|2017-12-20 00:00:00|" 0% 16 oz can Fu...| bloomin' like a ...| totally unfilter...| thick| all-white clumps...| mellon and mango| grainy earthiness|
| 125646|       LukeGude|2017-12-20 00:00:00|Classic TH NEIPA ...|                4.25|                 4.5|  4.25|               

In [62]:
#Counts how many beers have non-empty reviews
non_empty_reviews.agg(F.countDistinct("beer_id")).show()

+-----------------------+
|count(DISTINCT beer_id)|
+-----------------------+
|                 210311|
+-----------------------+



In [6]:
beerStyles = beers.select("id","style")

In [64]:
beerStyles.show(5)

+------+--------------------+
|    id|               style|
+------+--------------------+
|202522|English Oatmeal S...|
| 82352|Russian Imperial ...|
|214879|        Scottish Ale|
|320009|American Imperial...|
|246438|  American Cream Ale|
+------+--------------------+
only showing top 5 rows



In [7]:
beerStyles = beerStyles.withColumnRenamed('id', 'beer_id')

In [8]:
mainDF = non_empty_reviews.join(beerStyles, "beer_id")

In [67]:
mainDF.select('beer_id').distinct().count()

210294

In [68]:
#Counts how many reviews for each style of beer
mainDF.groupBy('style').count().sort('count', ascending = False).show(truncate = False)

+------------------------+------+
|style                   |count |
+------------------------+------+
|American IPA            |301774|
|American Imperial IPA   |212697|
|American Imperial Stout |150160|
|American Pale Ale (APA) |126489|
|Belgian Saison          |91000 |
|Russian Imperial Stout  |86117 |
|American Porter         |71189 |
|American Wild Ale       |63393 |
|American Amber / Red Ale|62818 |
|Fruit and Field Beer    |58342 |
|Belgian Strong Dark Ale |53097 |
|Belgian Witbier         |46545 |
|Belgian Strong Pale Ale |45732 |
|Belgian Tripel          |45686 |
|American Brown Ale      |44774 |
|American Strong Ale     |43575 |
|German Hefeweizen       |42930 |
|American Stout          |41879 |
|American Barleywine     |40873 |
|American Adjunct Lager  |39404 |
+------------------------+------+
only showing top 20 rows



In [116]:
#Counts how many unique beers of that style is reviewed
mainDF.select('beer_id', 'style').distinct().groupBy('style').count().sort('count', ascending = False).show(truncate = False)

+------------------------+-----+
|style                   |count|
+------------------------+-----+
|American IPA            |24380|
|American Pale Ale (APA) |12216|
|American Imperial IPA   |11517|
|Belgian Saison          |9744 |
|American Wild Ale       |7390 |
|American Imperial Stout |7016 |
|American Porter         |5889 |
|American Amber / Red Ale|5573 |
|American Stout          |4782 |
|Fruit and Field Beer    |4471 |
|American Brown Ale      |3682 |
|German Hefeweizen       |3395 |
|American Blonde Ale     |3329 |
|Belgian Witbier         |3036 |
|German Pilsner          |3024 |
|Russian Imperial Stout  |2946 |
|American Pale Wheat Ale |2793 |
|Belgian Pale Ale        |2630 |
|Berliner Weisse         |2586 |
|English Bitter          |2545 |
+------------------------+-----+
only showing top 20 rows



As we can see American Imperial Stout has the 3rd most reviews but only 6th most unique beers. This means that there are a few beers that are American Imperial Stouts with many reviews. We will choose Belgian Saison for classification because it has the 5th most reviews and the 4th most unique beers that have been reviewed

## 3. Selecting Targets, Top 4 most reviewed styles (with most unique beers)

In [9]:
styleTargets = ['American IPA', 'American Pale Ale (APA)', 'American Imperial IPA', 'Belgian Saison']

In [118]:
mainDF.filter(mainDF['style'].isin(styleTargets)).count()

731960

In [10]:
mainStyles = mainDF.filter(mainDF['style'].isin(styleTargets))

In [140]:
mainStyles.groupBy('style').count().sort('count', ascending = False).show(truncate = False)

+-----------------------+------+
|style                  |count |
+-----------------------+------+
|American IPA           |301774|
|American Imperial IPA  |212697|
|American Pale Ale (APA)|126489|
|Belgian Saison         |91000 |
+-----------------------+------+



In [141]:
mainStyles.select('beer_id', 'style').distinct().groupBy('style').count().sort('count', ascending = False).show(truncate = False)

+-----------------------+-----+
|style                  |count|
+-----------------------+-----+
|American IPA           |24380|
|American Pale Ale (APA)|12216|
|American Imperial IPA  |11517|
|Belgian Saison         |9744 |
+-----------------------+-----+



## 4. Clean Numeric Features

In [120]:
mainStyles.printSchema()

root
 |-- beer_id: integer (nullable = true)
 |-- username: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- look: string (nullable = true)
 |-- smell: string (nullable = true)
 |-- taste: string (nullable = true)
 |-- feel: string (nullable = true)
 |-- overall: string (nullable = true)
 |-- score: string (nullable = true)
 |-- style: string (nullable = true)



In [11]:
#Should Remove rows with NA's and Text in Numerical Columns
model_df = mainStyles.filter(F.col("look").cast("int").isNotNull() == True)\
            .filter(F.col("smell").cast("int").isNotNull() == True)\
            .filter(F.col("taste").cast("int").isNotNull() == True)\
            .filter(F.col("feel").cast("int").isNotNull() == True)\
            .filter(F.col("overall").cast("int").isNotNull() == True)\
            .filter(F.col("score").cast("int").isNotNull() == True)

In [12]:
#Cast Numerical columns as floats now that strings and na's are removed
model_df = model_df.withColumn('look', model_df['look'].cast("float"))\
        .withColumn('smell', model_df['smell'].cast("float"))\
        .withColumn('taste', model_df['taste'].cast("float"))\
        .withColumn('feel', model_df['feel'].cast("float"))\
        .withColumn('overall', model_df['overall'].cast("float"))\
        .withColumn('score', model_df['score'].cast("float"))\

In [13]:
model_df = model_df.drop("username", "date")

In [124]:
# Last check for NA's
model_df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in model_df.columns]).show()

+-------+----+----+-----+-----+----+-------+-----+-----+
|beer_id|text|look|smell|taste|feel|overall|score|style|
+-------+----+----+-----+-----+----+-------+-----+-----+
|      0|   0|   0|    0|    0|   0|      0|    0|    0|
+-------+----+----+-----+-----+----+-------+-----+-----+



In [125]:
model_df.printSchema()

root
 |-- beer_id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- look: float (nullable = true)
 |-- smell: float (nullable = true)
 |-- taste: float (nullable = true)
 |-- feel: float (nullable = true)
 |-- overall: float (nullable = true)
 |-- score: float (nullable = true)
 |-- style: string (nullable = true)



# III. Exploratory Analysis and Vizualization

## 5. Comparison of Means for Each Style and Summary Statistics for Entire Data

In [126]:
model_df.groupBy('style')\
        .agg(F.mean('look'), F.mean('smell'), F.mean('taste'), F.mean('feel'), F.mean('overall'),
            F.mean('score')).show(truncate = True)

+--------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|               style|         avg(look)|        avg(smell)|        avg(taste)|         avg(feel)|      avg(overall)|        avg(score)|
+--------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|        American IPA| 3.986425988960963|3.9190130602160207|3.9318579356773973|3.9012531309383935|3.9446924219012685| 3.932736264468442|
|American Imperial...| 4.107345298701574|4.0981487760042805|4.1012994856092435| 4.058349711549159|4.0595800643120885| 4.089419074101148|
|American Pale Ale...|3.8744497214357474|3.7821011051237554|3.8132181021097815|3.7911087770572656|3.8725111882363685|3.8200986432859327|
|      Belgian Saison|3.9922546618708705|3.9455913114461367|3.9475109053340476|3.9248201576490396| 3.955256243463177|3.9501903062389685|
+--------------------+------------------+

In [127]:
model_df.describe().show()

+-------+-----------------+--------------------+------------------+------------------+------------------+------------------+------------------+-----------------+--------------+
|summary|          beer_id|                text|              look|             smell|             taste|              feel|           overall|            score|         style|
+-------+-----------------+--------------------+------------------+------------------+------------------+------------------+------------------+-----------------+--------------+
|  count|           641356|              641356|            641356|            641356|            641356|            641356|            641356|           641356|        641356|
|   mean|86016.15441190229|            145855.5| 4.003611878582253|3.9516130666899505|3.9633885080984665| 3.931568037096402|3.9674755517996245|3.961756469417888|          null|
| stddev|86104.03033777044|   995375.9829575884|0.4889811619131977|0.5641361351138886|0.5939057486341173|0.54288064

# IV. Modeling

## 6. Setting Up Pipelines 

Note: HashingTF and IDF are updated with best parameters for parts 7 and 8

In [104]:
tokenizer = Tokenizer(inputCol = "text", outputCol = "words")
stopRem = StopWordsRemover(inputCol = 'words', outputCol = 'filtered')
hashingTF = HashingTF(inputCol = "filtered", outputCol = "rawFeatures", numFeatures = 100000)
idf = IDF(inputCol = "rawFeatures", outputCol = "features", minDocFreq = 100)
stringIdx = StringIndexer(inputCol = 'style', outputCol = 'label')

In [105]:
evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction")

In [106]:
nb = NaiveBayes()
pipelineNB = Pipeline(stages=[tokenizer, stopRem, hashingTF, idf, stringIdx, nb])

In [82]:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10, regParam=0.01)
pipelineLogReg = Pipeline(stages=[tokenizer, stopRem, hashingTF, idf, stringIdx, lr])

In [83]:
rf = RandomForestClassifier(labelCol = "label", featuresCol = "features", numTrees = 20, maxDepth = 10, maxBins = 32)
pipelineRF = Pipeline(stages = [tokenizer, stopRem, hashingTF, idf, stringIdx, rf])

In [84]:
paramGridTFIDF = ParamGridBuilder()\
    .addGrid(hashingTF.numFeatures, [1000, 10000, 50000, 100000])\
    .addGrid(idf.minDocFreq, [100, 1000, 10000])\
    .build()

In [85]:
paramGridNB = ParamGridBuilder()\
    .addGrid(nb.smoothing, [0.001, 0.01, 0.1, 0.5, 1, 1.5, 1.75])\
    .build()

In [None]:
paramGidLogReg = ParamGridBuilder()\
    .addGrid(lr.maxIter = [])\
    .addGrid(lr.regParam = [])\
    .build()

In [None]:
paramGridRF = ParamGridBuilder()\
    .addGrid(rf.numTrees = [])\
    .addGrid(rf.maxDepth = [])\
    .build()

In [86]:
cvTFIDF = CrossValidator(estimator = pipelineNB,
                              estimatorParamMaps = paramGridTFIDF,
                              evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction"),
                              numFolds = 5)

In [87]:
cvNB = CrossValidator(estimator = pipelineNB,
                         estimatorParamMaps = paramGridNB,
                         evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction"),
                         numFolds = 5)

In [None]:
cvLogReg = CrossValidator(estimator = pipelineLogReg,
                         estimatorParamMaps = paramGridLogReg,
                         evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction"),
                         numFolds = 5)

In [None]:
cvRF = CrossValidator(estimator = pipelineRF,
                         estimatorParamMaps = paramGridRF,
                         evaluator = MulticlassClassificationEvaluator(predictionCol = "prediction"),
                         numFolds = 5)

## Baseline Model
- Unbalanced
- No subset/sample of data
- No stemming or lemmatization
- No N-Grams
- NO TF-IDF parameters
- Use default Naive Bayes

In [21]:
data1 = model_df.select('text', 'style')

In [139]:
data1.count()

641356

In [142]:
data1.groupBy('style').count().sort('count', ascending = False).show()

+--------------------+------+
|               style| count|
+--------------------+------+
|        American IPA|264697|
|American Imperial...|188767|
|American Pale Ale...|109490|
|      Belgian Saison| 78402|
+--------------------+------+



In [22]:
trainDat, testDat = data1.randomSplit([0.8, 0.2], seed = 69)

In [23]:
tfidfModBase = pipelineNB.fit(trainDat)

In [24]:
predTfidfBase = tfidfModBase.transform(testDat)

In [26]:
evaluator.evaluate(predTfidfBase)

0.6332633327575419

In [29]:
cvTfidfBase = cvTFIDF.fit(trainDat)

In [30]:
predCvTfidfBase = cvTfidfBase.transform(testDat)

In [31]:
evaluator.evaluate(predCvTfidfBase)

0.6849851845318615

In [39]:
cvTfidfBase.avgMetrics

[0.6488106384442446,
 0.6488102554991351,
 0.6502629046568504,
 0.6689535237033796,
 0.6736880706207696,
 0.6659764256897451,
 0.6777045229293912,
 0.6815448256042428,
 0.6685998753329425,
 0.6814338977629593,
 0.6852826632759372,
 0.6696686955531148]

In [40]:
cvTfidfBase.extractParamMap()

{Param(parent='CrossValidatorModel_54ecc3b5de0e', name='seed', doc='random seed.'): 2507911880642868994,
 Param(parent='CrossValidatorModel_54ecc3b5de0e', name='estimator', doc='estimator to be cross-validated'): Pipeline_d24f79d6757b,
 Param(parent='CrossValidatorModel_54ecc3b5de0e', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering'): 100},
  {Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering'): 1000},
  {Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents 

### Best TF-IDF Parameters for Unbalanced Dataset
1. numFeatures = 100000, minDocFreq = 1000
2. numFeatures = 100000, minDocFreq = 100

## Sampled Data Model Performance

In [41]:
dataSampled = data1.sampleBy("style", fractions={"American IPA": 75000/264697,
                                                "American Imperial IPA": 75000/188767,
                                                "American Pale Ale (APA)": 75000/109490,
                                                "Belgian Saison": 75000/78402}, seed = 69)

In [42]:
dataSampled.count()

299780

In [43]:
dataSampled.groupBy("style").count().show()

+--------------------+-----+
|               style|count|
+--------------------+-----+
|        American IPA|74787|
|American Imperial...|74850|
|American Pale Ale...|75127|
|      Belgian Saison|75016|
+--------------------+-----+



In [44]:
TrainDatSample, HoldoutDatSample = dataSampled.randomSplit([0.8, 0.2], seed = 420)

In [45]:
baseTFidfModSamp = pipelineNB.fit(TrainDatSample)

In [46]:
predBaseTfidfModSamp = baseTFidfModSamp.transform(HoldoutDatSample)

In [47]:
evaluator.evaluate(predBaseTfidfModSamp)

0.6541101716668796

In [48]:
cvTfidfBaseSamp = cvTFIDF.fit(TrainDatSample)

In [49]:
predCvTfidfBaseSamp = cvTfidfBaseSamp.transform(HoldoutDatSample)

In [50]:
evaluator.evaluate(predCvTfidfBaseSamp)

0.7071848554410634

In [51]:
cvTfidfBaseSamp.avgMetrics

[0.6743677921406365,
 0.6744102723533023,
 0.6551742432845589,
 0.7032835655826926,
 0.7077131500897303,
 0.6599031564311474,
 0.7095983630103647,
 0.7082906562946909,
 0.6603426427815116,
 0.7107182156399433,
 0.7085748914274623,
 0.660564046513989]

In [52]:
cvTfidfBase.extractParamMap()

{Param(parent='CrossValidatorModel_54ecc3b5de0e', name='seed', doc='random seed.'): 2507911880642868994,
 Param(parent='CrossValidatorModel_54ecc3b5de0e', name='estimator', doc='estimator to be cross-validated'): Pipeline_d24f79d6757b,
 Param(parent='CrossValidatorModel_54ecc3b5de0e', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering'): 100},
  {Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents in which a term should appear for filtering'): 1000},
  {Param(parent='HashingTF_d6479245810a', name='numFeatures', doc='number of features.'): 1000,
   Param(parent='IDF_c596d013591a', name='minDocFreq', doc='minimum number of documents 

### Best TF-IDF Parameters
1. numFeatures = 100000, minDocFreq = 100
2. numFeatures = 100000, minDocFreq = 1000

In [107]:
nbModSamp = pipelineNB.fit(TrainDatSample)

In [108]:
predNbModSamp = nbModSamp.transform(HoldoutDatSample)

In [109]:
evaluator.evaluate(predNbModSamp)

0.7071848554410634

In [23]:
logModSamp = pipelineLogReg.fit(TrainDatSample)

In [24]:
predLogModSamp = logModSamp.transform(HoldoutDatSample)

In [25]:
evaluator.evaluate(predLogModSamp)

0.7357661508465958

In [24]:
rfModSamp = pipelineRF.fit(TrainDatSample)

In [25]:
predRfModSamp = rfModSamp.transform(HoldoutDatSample)

In [26]:
evaluator.evaluate(predRfModSamp)

0.567530442615334

# Cross Validating Models

In [88]:
nbCvMod = cvNB.fit(TrainDatSample)

In [89]:
predNbCvMod = nbCvMod.transform(HoldoutDatSample)

In [90]:
evaluator.evaluate(predNbCvMod)

0.7072347800265846

In [91]:
nbCvMod.avgMetrics

[0.7109425488303297,
 0.7109481126714816,
 0.7109632899735919,
 0.7109107918503748,
 0.7107182156399433,
 0.7105412947559459,
 0.7104673478633691]

In [92]:
nbCvMod.extractParamMap()

{Param(parent='CrossValidatorModel_1980b0c176e3', name='seed', doc='random seed.'): 2507911880642868994,
 Param(parent='CrossValidatorModel_1980b0c176e3', name='estimator', doc='estimator to be cross-validated'): Pipeline_94cfe90cdd3a,
 Param(parent='CrossValidatorModel_1980b0c176e3', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent='NaiveBayes_d58a45717363', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0'): 0.001},
  {Param(parent='NaiveBayes_d58a45717363', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0'): 0.01},
  {Param(parent='NaiveBayes_d58a45717363', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0'): 0.1},
  {Param(parent='NaiveBayes_d58a45717363', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0'): 0.5},
  {Param(parent='NaiveBayes_d58a45717363', name='smoothing', doc='The smoothing parameter, should be >= 0, default is 1.0'): 1.0},
 