# Project 2: What kind of wine is this?
## SGD route 5/5
In this notebook I wanted to try 3 different ways of combining the wine variants. In doing so, I tried a few different configurations of the cells, initially doing each step for all 3 before moving onto the next step. This wound up making it very difficult to keep track of what was going on, so now it is structured such that we go through all of the steps for each combination before moving onto the next.

**TL/DR:** Combining Merlot and Riesling had the most drastic improvement in terms of percentage improvement over baseline, but combining Merlot with Cabernet Sauvignon resulted in the highest F1 score, though still fairly low.

## Imports

In [29]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()
import multiprocessing as mp

In [32]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.pipeline import make_pipeline

In [50]:
import mlflow
from dask_ml.model_selection import RandomizedSearchCV
from logger import log_search
from scipy.stats.distributions import loguniform, randint, uniform
from warnings import simplefilter
simplefilter(action="ignore", category=FutureWarning)

In [51]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:38747")
client

0,1
Client  Scheduler: tcp://127.0.0.1:38747  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


---

## Load data

In [30]:
train = pd.read_parquet("s3://ling583/wine-train.parquet", storage_options={"anon": True})
test = pd.read_parquet("s3://ling583/wine-test.parquet", storage_options={"anon": True})

In [31]:
import spacy

nlp = spacy.load("en_core_web_sm", 
                 exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],)

def tokenize(text):
    doc = nlp.tokenizer(text)
    return [t.norm_ for t in doc if not (t.is_space or t.is_punct or t.like_num)]

---

# Combine variants
There are three combinations being tested below. 

**Merlot/Cabernet** - Represent the highest error point in the confusion matrix.

**Merlot/Reisling** - Suggested in the assignment prompt, I'm curious.

**Syrah/Cabernet** - Represent the second highest error point in the confusion matrix

In [34]:
# Original variants
train["wine_variant"].value_counts()

Pinot Noir            38471
Cabernet Sauvignon    30234
Chardonnay            19443
Syrah                 13704
Riesling               9683
Zinfandel              8327
Merlot                 5522
Sauvignon Blanc        5113
Name: wine_variant, dtype: int64

## Combine Merlot and Cabernet Sauvignon

In [33]:
train_cabmer = train.copy()
test_cabmer = test.copy()

In [35]:
m = train_cabmer['wine_variant'].isin(['Merlot', 'Cabernet Sauvignon'])
train_cabmer['wine_variant'] = train_cabmer['wine_variant'].mask(m, 'Merlot/Cabernet')

n = test_cabmer['wine_variant'].isin(['Merlot', 'Cabernet Sauvignon'])
test_cabmer['wine_variant'] = test_cabmer['wine_variant'].mask(n, 'Merlot/Cabernet')

In [36]:
train_cabmer["wine_variant"].value_counts()

Pinot Noir         38471
Merlot/Cabernet    35756
Chardonnay         19443
Syrah              13704
Riesling            9683
Zinfandel           8327
Sauvignon Blanc     5113
Name: wine_variant, dtype: int64

In [37]:
# Merlot/Cabernet combination
with mp.Pool() as p:
    train_cabmer["tokens"] = pd.Series(p.imap(tokenize, tqdm(train_cabmer["review_text"]), chunksize=100))
    test_cabmer["tokens"] = pd.Series(p.imap(tokenize, tqdm(test_cabmer["review_text"]), chunksize=100))

  0%|          | 0/130497 [00:00<?, ?it/s]

  0%|          | 0/32625 [00:00<?, ?it/s]

### Merlot/Cabernet Baseline

In [38]:
baseline_mc = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
baseline_mc.fit(train_cabmer["tokens"], train_cabmer["wine_variant"])
base_mc_predicted = baseline_mc.predict(test_cabmer["tokens"])
print(classification_report(test_cabmer["wine_variant"], base_mc_predicted))

                 precision    recall  f1-score   support

     Chardonnay       0.82      0.85      0.84      4861
Merlot/Cabernet       0.75      0.82      0.79      8939
     Pinot Noir       0.77      0.86      0.81      9618
       Riesling       0.82      0.77      0.80      2421
Sauvignon Blanc       0.85      0.65      0.73      1278
          Syrah       0.79      0.50      0.62      3426
      Zinfandel       0.69      0.57      0.63      2082

       accuracy                           0.78     32625
      macro avg       0.79      0.72      0.74     32625
   weighted avg       0.78      0.78      0.77     32625



### Merlot/Cabernet Hyperparameter search

In [53]:
mlflow.set_experiment("Project_2_Merlot/Cabernet")
sgd = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)
# Skeleton classifier

INFO: 'Project_2_Merlot/Cabernet' does not exist. Creating a new experiment


In [55]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 20),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-6, 1e-2),
    },
    n_iter=50,
    scoring="f1_macro",
)
search.fit(train_cabmer["tokens"], train_cabmer["wine_variant"])
log_search(search)

CPU times: user 10.9 s, sys: 1.38 s, total: 12.2 s
Wall time: 6min 18s


### Merlot/Cabernet Compare to baseline

In [57]:
sgd_mc = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=3, max_df=0.78472),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=8.32e-6),
)
sgd_mc.fit(train_cabmer["tokens"], train_cabmer["wine_variant"])
predicted_mc = sgd_mc.predict(test_cabmer["tokens"])
print(classification_report(test_cabmer["wine_variant"], predicted_mc))

                 precision    recall  f1-score   support

     Chardonnay       0.85      0.84      0.84      4861
Merlot/Cabernet       0.73      0.86      0.79      8939
     Pinot Noir       0.79      0.85      0.82      9618
       Riesling       0.81      0.80      0.80      2421
Sauvignon Blanc       0.84      0.67      0.74      1278
          Syrah       0.78      0.52      0.62      3426
      Zinfandel       0.86      0.52      0.65      2082

       accuracy                           0.78     32625
      macro avg       0.81      0.72      0.75     32625
   weighted avg       0.79      0.78      0.78     32625



In [59]:
base_mc_f1 = f1_score(test_cabmer["wine_variant"], base_mc_predicted, average="macro")
sgd_mc_f1 = f1_score(test_cabmer["wine_variant"], predicted_mc, average="macro")

In [60]:
print(f"Base F1 score MC: {base_mc_f1}")
print(f"SGD F1 score MC:  {sgd_mc_f1}")
print(f"Difference:       {sgd_mc_f1 - base_mc_f1}") 

Base F1 score MC: 0.7446297670452787
SGD F1 score MC:  0.7528789038185238
Difference:    0.008249136773245125


In [71]:
mc_improvement = (sgd_mc_f1 - base_mc_f1) / (1 - base_mc_f1)
# Percentage error reduction; how much we imroved over the base.
mc_improvement

0.03230265594309791

##  Combine Merlot and Riesling 

In [39]:
train_merrie = train.copy()
test_merrie = test.copy()

In [40]:
m = train_merrie['wine_variant'].isin(['Merlot', 'Riesling'])
train_merrie['wine_variant'] = train_merrie['wine_variant'].mask(m, 'Merlot/Riesling')

n = test_merrie['wine_variant'].isin(['Merlot', 'Riesling'])
test_merrie['wine_variant'] = test_merrie['wine_variant'].mask(n, 'Merlot/Riesling')

In [41]:
test_merrie["wine_variant"].value_counts()

Pinot Noir            9618
Cabernet Sauvignon    7558
Chardonnay            4861
Merlot/Riesling       3802
Syrah                 3426
Zinfandel             2082
Sauvignon Blanc       1278
Name: wine_variant, dtype: int64

In [42]:
# Merlot/Riesling combination
with mp.Pool() as p:
    train_merrie["tokens"] = pd.Series(p.imap(tokenize, tqdm(train_merrie["review_text"]), chunksize=100))
    test_merrie["tokens"] = pd.Series(p.imap(tokenize, tqdm(test_merrie["review_text"]), chunksize=100))

  0%|          | 0/130497 [00:00<?, ?it/s]

  0%|          | 0/32625 [00:00<?, ?it/s]

### Merlot/Riesling Baseline

In [52]:
baseline_mr = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
baseline_mr.fit(train_merrie["tokens"], train_merrie["wine_variant"])
base_mr_predicted = baseline_mr.predict(test_merrie["tokens"])
print(classification_report(test_merrie["wine_variant"], base_mr_predicted))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.75      0.75      0.75      7558
        Chardonnay       0.72      0.89      0.80      4861
   Merlot/Riesling       0.80      0.59      0.68      3802
        Pinot Noir       0.79      0.84      0.81      9618
   Sauvignon Blanc       0.86      0.64      0.73      1278
             Syrah       0.60      0.63      0.61      3426
         Zinfandel       0.82      0.52      0.64      2082

          accuracy                           0.75     32625
         macro avg       0.76      0.70      0.72     32625
      weighted avg       0.75      0.75      0.74     32625



### Merlot/Riesling Hyperparameter search

In [63]:
mlflow.set_experiment("Project_2_Merlot/Riesling")
sgd = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)
# Skeleton classifier

INFO: 'Project_2_Merlot/Riesling' does not exist. Creating a new experiment


In [64]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 20),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-6, 1e-2),
    },
    n_iter=50,
    scoring="f1_macro",
)
search.fit(train_merrie["tokens"], train_merrie["wine_variant"])
log_search(search)

CPU times: user 10.9 s, sys: 1.42 s, total: 12.4 s
Wall time: 6min 7s


### Merlot/Riesling Compare to baseline

In [76]:
sgd_mr = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=2, max_df=0.7862),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1.68e-5),
)
sgd_mr.fit(train_merrie["tokens"], train_merrie["wine_variant"])
predicted_mr = sgd_mr.predict(test_merrie["tokens"])
print(classification_report(test_merrie["wine_variant"], predicted_mr))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.69      0.82      0.75      7558
        Chardonnay       0.79      0.88      0.83      4861
   Merlot/Riesling       0.81      0.59      0.68      3802
        Pinot Noir       0.78      0.86      0.82      9618
   Sauvignon Blanc       0.84      0.66      0.74      1278
             Syrah       0.74      0.55      0.63      3426
         Zinfandel       0.84      0.52      0.64      2082

          accuracy                           0.76     32625
         macro avg       0.78      0.70      0.73     32625
      weighted avg       0.77      0.76      0.75     32625



In [77]:
base_mr_f1 = f1_score(test_merrie["wine_variant"], base_mr_predicted, average="macro")
sgd_mr_f1 = f1_score(test_merrie["wine_variant"], predicted_mr, average="macro")

In [78]:
print(f"Base F1 score MR: {base_mr_f1}")
print(f"SGD F1 score MR:  {sgd_mr_f1}")
print(f"Difference:       {sgd_mr_f1 - base_mr_f1}") 

Base F1 score MR: 0.717510943778003
SGD F1 score MR:  0.7281148654199932
Difference:       0.0106039216419902


In [79]:
mr_improvement = (sgd_mr_f1 - base_mr_f1) / (1 - base_mr_f1)
# Percentage error reduction; how much we imroved over the base.
mr_improvement

0.03753745997741234

## Combine Syrah and Cabernet Sauvignon

In [44]:
train_syrcab = train.copy()
test_syrcab = test.copy()

In [45]:
m = train_syrcab['wine_variant'].isin(['Syrah', 'Cabernet Sauvignon'])
train_syrcab['wine_variant'] = train_syrcab['wine_variant'].mask(m, 'Syrah/Cabernet')

n = test_syrcab['wine_variant'].isin(['Syrah', 'Cabernet Sauvignon'])
test_syrcab['wine_variant'] = test_syrcab['wine_variant'].mask(n, 'Syrah/Cabernet')

In [46]:
train_syrcab["wine_variant"].value_counts()

Syrah/Cabernet     43938
Pinot Noir         38471
Chardonnay         19443
Riesling            9683
Zinfandel           8327
Merlot              5522
Sauvignon Blanc     5113
Name: wine_variant, dtype: int64

In [47]:
# Syrah/Cabernet combination
with mp.Pool() as p:
    train_syrcab["tokens"] = pd.Series(p.imap(tokenize, tqdm(train_syrcab["review_text"]), chunksize=100))
    test_syrcab["tokens"] = pd.Series(p.imap(tokenize, tqdm(test_syrcab["review_text"]), chunksize=100))

  0%|          | 0/130497 [00:00<?, ?it/s]

  0%|          | 0/32625 [00:00<?, ?it/s]

### Syrah/Cabernet Baseline

In [48]:
baseline_sc = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
baseline_sc.fit(train_syrcab["tokens"], train_syrcab["wine_variant"])
base_sc_predicted = baseline_sc.predict(test_syrcab["tokens"])
print(classification_report(test_syrcab["wine_variant"], base_sc_predicted))

                 precision    recall  f1-score   support

     Chardonnay       0.83      0.84      0.84      4861
         Merlot       0.79      0.35      0.48      1381
     Pinot Noir       0.80      0.83      0.82      9618
       Riesling       0.80      0.79      0.80      2421
Sauvignon Blanc       0.88      0.64      0.74      1278
 Syrah/Cabernet       0.76      0.87      0.81     10984
      Zinfandel       0.86      0.52      0.65      2082

       accuracy                           0.79     32625
      macro avg       0.82      0.69      0.73     32625
   weighted avg       0.80      0.79      0.79     32625



### Syrah/Cabernet Hyperparameter search

In [73]:
mlflow.set_experiment("Project_2_Syrah/Cabernet")
sgd = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)
# Skeleton classifier

INFO: 'Project_2_Syrah/Cabernet' does not exist. Creating a new experiment


In [74]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 20),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-6, 1e-2),
    },
    n_iter=50,
    scoring="f1_macro",
)
search.fit(train_syrcab["tokens"], train_syrcab["wine_variant"])
log_search(search)

CPU times: user 10.8 s, sys: 1.52 s, total: 12.3 s
Wall time: 6min 12s


### Syrah/Cabernet Compare to baseline

In [80]:
sgd_sc = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=3, max_df=0.8699),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1.011e-5),
)
sgd_sc.fit(train_syrcab["tokens"], train_syrcab["wine_variant"])
predicted_sc = sgd_sc.predict(test_syrcab["tokens"])
print(classification_report(test_syrcab["wine_variant"], predicted_sc))

                 precision    recall  f1-score   support

     Chardonnay       0.84      0.85      0.84      4861
         Merlot       0.87      0.33      0.48      1381
     Pinot Noir       0.81      0.83      0.82      9618
       Riesling       0.81      0.79      0.80      2421
Sauvignon Blanc       0.86      0.66      0.74      1278
 Syrah/Cabernet       0.75      0.89      0.81     10984
      Zinfandel       0.90      0.50      0.64      2082

       accuracy                           0.80     32625
      macro avg       0.84      0.69      0.73     32625
   weighted avg       0.81      0.80      0.79     32625



In [81]:
base_sc_f1 = f1_score(test_syrcab["wine_variant"], base_sc_predicted, average="macro")
sgd_sc_f1 = f1_score(test_syrcab["wine_variant"], predicted_sc, average="macro")

In [82]:
print(f"Base F1 score SC: {base_sc_f1}")
print(f"SGD F1 score SC:  {sgd_sc_f1}")
print(f"Difference:       {sgd_sc_f1 - base_sc_f1}") 

Base F1 score SC: 0.7323657986702568
SGD F1 score SC:  0.7346136619153828
Difference:       0.002247863245126047


In [103]:
sc_improvement = (sgd_sc_f1 - base_sc_f1) / (1 - base_sc_f1)
# Percentage error reduction; how much we imroved over the base.
sc_improvement

0.00839901340694693

---

## Bring it all together

In [104]:
print(f"Combining Merlot and Cabernet Sauvignon netted us a {mc_improvement*100:.3f}% improvement, with a Macro Average F1 score of {sgd_mc_f1:.3f}.")
print(f"Combining Merlot and Riesling netted us a {mr_improvement*100:.3f}% improvement, with a Macro Average F1 score of {sgd_mr_f1:.3f}.")
print(f"Combining Syrah and Cabernet Sauvignon netted us a {sc_improvement*100:.3f}% improvement, with a Macro Average F1 score of {sgd_sc_f1:.3f}.")

Combining Merlot and Cabernet Sauvignon netted us a 3.230% improvement, with a Macro Average F1 score of 0.753.
Combining Merlot and Riesling netted us a 3.754% improvement, with a Macro Average F1 score of 0.728.
Combining Syrah and Cabernet Sauvignon netted us a 0.840% improvement, with a Macro Average F1 score of 0.735.


As we can see above, combining Merlot and Riesling had the most drastic improvement in terms of percentage better than the baseline, but combining Merlot with Cabernet Sauvignon resulted in the highest F1 score, though still fairly low.