# Project 2: What kind of wine is this?
## SGD route 1/5

Construct a system to sort wine tasting notes by wine variety: Pinot Noir, Cabernet Sauvignon, Chardonnay, Syrah, Riesling, Zinfandel, Merlot, or Sauvignon Blanc

## Deliverables


1. Using the data available at s3://ling583/wine-train.parquet and s3://ling583/wine-test.parquet, construct a classifier that can predict wine variety labels on the basis of review texts. Try out different methods and see what works best. Evaluate your best model using the test data.

2. Find the words that your model is using to predict labels (either by looking at the model coefficients or by using a tool like LIME). What aspects of review texts is your model most sensitive to? Is there evidence of overfitting?

3. For Reuters texts, we found we could greatly increase the F1 score/accuracy by excluding items that that the model was most uncertain about. How many test examples would we have to exclude to achieve better than 0.85 F1 for this task?

4. Another way to improve accuracy is to change the labels. Use a confusion matrix to examine the patterns errors and propose a new labeling scheme. For example, if the model consistently labels “merlot” as “riesling” and vice versa, you might want to create a new label “merlot/riesling”. Is it possible to get better than 0.85 F1 using your classifier trained on a different set of labels?

In [2]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()
import multiprocessing as mp

In [3]:
train = pd.read_parquet("s3://ling583/wine-train.parquet", storage_options={"anon": True})
test = pd.read_parquet("s3://ling583/wine-test.parquet", storage_options={"anon": True})

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm", 
                 exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],)

def tokenize(text):
    doc = nlp.tokenizer(text)
    return [t.norm_ for t in doc if not (t.is_space or t.is_punct or t.like_num)]

In [5]:
with mp.Pool() as p:
    train["tokens"] = pd.Series(p.imap(tokenize, tqdm(train["review_text"]), chunksize=100))
    test["tokens"] = pd.Series(p.imap(tokenize, tqdm(test["review_text"]), chunksize=100))

  0%|          | 0/130497 [00:00<?, ?it/s]

  0%|          | 0/32625 [00:00<?, ?it/s]

In [6]:
train["wine_variant"].value_counts()

Pinot Noir            38471
Cabernet Sauvignon    30234
Chardonnay            19443
Syrah                 13704
Riesling               9683
Zinfandel              8327
Merlot                 5522
Sauvignon Blanc        5113
Name: wine_variant, dtype: int64

---

## Baseline SGD classifier

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.pipeline import make_pipeline

In [8]:
baseline = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
baseline.fit(train["tokens"], train["wine_variant"])
base_predicted = baseline.predict(test["tokens"])
print(classification_report(test["wine_variant"], base_predicted))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.72      0.78      0.75      7558
        Chardonnay       0.82      0.85      0.83      4861
            Merlot       0.70      0.36      0.48      1381
        Pinot Noir       0.73      0.89      0.80      9618
          Riesling       0.83      0.77      0.80      2421
   Sauvignon Blanc       0.80      0.68      0.73      1278
             Syrah       0.77      0.52      0.62      3426
         Zinfandel       0.84      0.53      0.65      2082

          accuracy                           0.76     32625
         macro avg       0.78      0.67      0.71     32625
      weighted avg       0.76      0.76      0.75     32625



----

## Hyperparameter search

Find an optimal set of hyperparameters for a Tfidf+SGDClassifier model

In [8]:
import mlflow
from dask_ml.model_selection import RandomizedSearchCV
from logger import log_search
from scipy.stats.distributions import loguniform, randint, uniform

In [1]:
from warnings import simplefilter
simplefilter(action="ignore", category=FutureWarning)

In [13]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:37943")
client

0,1
Client  Scheduler: tcp://127.0.0.1:37943  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


In [15]:
mlflow.set_experiment("Project_2_SGD")
sgd = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)
# Skeleton classifier

INFO: 'Project_2_SGD' does not exist. Creating a new experiment


In [20]:
# First run: 
    # countvectorizer__min_df 1, 20
    # countvectorizer__max_df 0.5, 0.5
    # tfidftransformer__use_idf True, False
    # sgdclassifier__alpha 1e-6, 1e-2
    
#%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(5, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True],
        "sgdclassifier__alpha": loguniform(1e-6, 1e-4),
    },
    n_iter=50,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["wine_variant"])
log_search(search)

---

## Compare optimized model to baseline

In [16]:
sgd = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=5, max_df=0.7763),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1.14e-5),
)
sgd.fit(train["tokens"], train["wine_variant"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["wine_variant"], predicted))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.71      0.80      0.75      7558
        Chardonnay       0.81      0.86      0.84      4861
            Merlot       0.80      0.34      0.47      1381
        Pinot Noir       0.76      0.87      0.82      9618
          Riesling       0.81      0.78      0.80      2421
   Sauvignon Blanc       0.82      0.67      0.74      1278
             Syrah       0.73      0.55      0.63      3426
         Zinfandel       0.81      0.53      0.64      2082

          accuracy                           0.76     32625
         macro avg       0.78      0.68      0.71     32625
      weighted avg       0.76      0.76      0.75     32625



In [17]:
base_f1 = f1_score(test["wine_variant"], base_predicted, average="macro")
sgd_f1 = f1_score(test["wine_variant"], predicted, average="macro")

In [18]:
print(f"Base F1 score: {base_f1}")
print(f"SGD F1 score:  {sgd_f1}")
print(f"Difference:    {sgd_f1 - base_f1}") 

Base F1 score: 0.7068352414730442
SGD F1 score:  0.7100334194346126
Difference:    0.0031981779615684047


In [19]:
(sgd_f1 - base_f1) / (1 - base_f1)
# Percentage error reduction; how much we imroved over the base.

0.010909148758664115

In [20]:
from scipy.stats import binom_test, wilcoxon

In [21]:
# Predicted is the SGD prediction
# test["topics"] is the right answer
# if they are equal, the value is true, if they are not, then it is false
diff = (predicted == test["wine_variant"]).astype(int) - (base_predicted == test["wine_variant"]).astype(int)
# if both base and SGD have the same answer, thehn we get 0
# If baseline was wrong (0) and SGD was right(1) we get 1
# If baseline was right (1) and SGD was wrong (0) we get -1

print(f"SGD and baseline agreed {sum(diff == 0)} times")
print(f"SGD was right, and baseline was wrong {sum(diff == 1)} times")
print(f"Baseline was right, and SGD was wrong {sum(diff == -1)} times")

SGD and baseline agreed 31145 times
SGD was right, and baseline was wrong 830 times
Baseline was right, and SGD was wrong 650 times


In [22]:
# for those that were classified differently by the two classifiers, they theoretically have a 50/50 
# chance to get into either classifier. We run the binomial test to see if the distribution of these
# choices matches with that assumption.

binom_test([sum(diff == 1), sum(diff == -1)], alternative="greater")

# the result, approximately 0.000000375 is much lower that the standard 0.05 alpha for the test
# this just means that in the case of a true 50/50 chance scenario, the probability of achieving the same outcome as above is 
# incredibly small. This would indicate that the SGD classifier actually is better than the baseline.

1.5939857540412747e-06

In [23]:
# similar to the binomial test above.
# is only really applicable when you only care about the sign, plus or minus
wilcoxon(diff, alternative="greater")

WilcoxonResult(statistic=614615.0, pvalue=1.4422506000125995e-06)

-----

## Save model

In [24]:
import cloudpickle

In [26]:
# In this version we change the preprocessor portion and add a tokenizer
sgd = make_pipeline(
    CountVectorizer(preprocessor=identity, tokenizer=tokenize, min_df=5, max_df=0.7763),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1.14e-5),
)
sgd.fit(train["review_text"], train["wine_variant"])
predicted = sgd.predict(test["review_text"])
print(classification_report(test["wine_variant"], predicted))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.69      0.83      0.75      7558
        Chardonnay       0.82      0.85      0.84      4861
            Merlot       0.82      0.33      0.48      1381
        Pinot Noir       0.77      0.87      0.82      9618
          Riesling       0.80      0.79      0.80      2421
   Sauvignon Blanc       0.84      0.66      0.74      1278
             Syrah       0.75      0.54      0.63      3426
         Zinfandel       0.86      0.51      0.64      2082

          accuracy                           0.76     32625
         macro avg       0.80      0.67      0.71     32625
      weighted avg       0.77      0.76      0.76     32625



In [27]:
# The built in pickle function does not work with these complicated structures so we use cloudpickle
cloudpickle.dump(sgd, open("sgd.model", "wb"))