<p style="font-size:30px; text-align:center; line-height:120%">
    <br> 
        <b>
        COMS 4995 Applied ML
            Homework 4 
        <br></br>
            Predicting Wine Quality: Task 2
        <br></br>
        </b> 
    <br> 
</p>
<p style="font-size:18px; text-align:left; line-height:120%">
    <br> 
        <b>
        Kirit Dhillon, Sagar Lal
        </b>
    <br> 
        <b>
        Uni: ksd2142, sl3946
        </b>
</p>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Data Loanding and Exploration

In [2]:
data = pd.read_csv("winemag-data-130k-v2.csv")
# Remove uninformative columns like "Taster Name" and "Taster Twitter Handle"
data = data.drop(['taster_name', 'taster_twitter_handle'], axis=1)

In [3]:
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(data['description'], data['points'], stratify= data['points'])

In [4]:
print("X_trainval: \t", X_trainval.shape, y_trainval.shape)
print("X_test: \t", X_test.shape, y_test.shape)

X_trainval: 	 (97478,) (97478,)
X_test: 	 (32493,) (32493,)


### Setting up BOW

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vect = CountVectorizer(stop_words="english", max_features=1000)

X_trainval_bow = bow_vect.fit_transform(X_trainval)
X_test_bow = bow_vect.transform(X_test)

In [6]:
# Debugging
print("[BoW Text] X_trainval: \t", X_trainval_bow.shape)
print("[BoW Text] X_test:\t", X_test_bow.shape)

[BoW Text] X_trainval: 	 (97478, 1000)
[BoW Text] X_test:	 (32493, 1000)


### Pre-trained Word Embeddings Dataset

- We decided to use a pre-trained Doc2vec model called "Associated Press News DBOW" which is analogous to a skip-gram model in word2vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph. (Source: https://ibm.ent.box.com/s/9ebs3c759qqo1d8i7ed323i6shv2js7e)
- We picked this over a word2vec model because it's far more efficient and computationally inexpensive for a large corpus like our wine review dataset.

In [8]:
import gensim.models.keyedvectors as word2vec
import gensim.models as g

In [9]:
model = g.Doc2Vec.load('./apnews_dbow/doc2vec.bin')

  "C extension not loaded, training will be slow. "


In [12]:
# Preprocess the dataset so it can be processed by the doc2vec model
def preprocess(model, dataset):
    list_dataset= list(dataset.str.split(" ", expand = False))
    print("Successful splitting")
    w2v_dataset = []
    j = 0
    print(list_dataset[1])
    for i in list_dataset:
        j+=1
        if j%1000 == 0:
            print(j)
        w2v_dataset.append(model.infer_vector(i))
    return w2v_dataset

In [None]:
w2v_x_test = preprocess(model, X_test)
#print("Completed X_test preprocessing")
#w2v_x_trainval = preprocess(model, X_trainval)
#print("Completed X_trainval preprocessing")

Successful splitting
['Soft,', 'ripe', 'and', 'round,', 'this', 'has', 'a', 'slightly', 'liquorous', 'bite', 'to', 'the', 'aromatics.', 'Pear', 'and', 'nectarine', 'fruit', 'is', 'accented', 'with', 'a', 'dash', 'of', 'cinnamon.', 'The', 'wine', 'was', '20%', 'barrel', 'fermented.']


In [63]:
# Convert lists to Pandas dataframe
w2v_x_test = pd.DataFrame(w2v_x_test)
w2v_x_trainval = pd.DataFrame(w2v_x_trainval)

In [46]:
# Debugging: 
print("[w2v] X_trainval: \t", w2v_x_trainval)
print("[w2v] X_test: \t", w2v_x_test)

(97478, 300)

### Run Doc2Vec Model

In [51]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor

In [64]:
%%time
ridge_pipeline = Pipeline(steps=[
                                ('regressor', Ridge())])

param_grid =  {
               'regressor__alpha': [0.01, .1, 1, 10, 100]
              }
grid = GridSearchCV(ridge_pipeline, param_grid, cv=3, return_train_score=True)
grid.fit(w2v_x_trainval, y_trainval)

print(("Doc2Vec Ridge score with GridSearchCV %.2f"
       %grid.score(w2v_x_test, y_test)))
print("Best param:", grid.best_params_)

Doc2Vec Ridge score with GridSearchCV 0.45
Best param: {'regressor__alpha': 10}
CPU times: user 12.9 s, sys: 3.99 s, total: 16.9 s
Wall time: 11.7 s


In [None]:
%%time
xgb_pipeline = Pipeline(steps=[
                                ('regressor', XGBRegressor())])

param_grid =  {
               "regressor__max_depth": [4], 
                "regressor__alpha": [0], 
                "regressor__lambda": [0.5],
              }
grid = GridSearchCV(xgb_pipeline, param_grid, cv=3, return_train_score=True)
grid.fit(w2v_x_trainval, y_trainval)

print(("XGBoost with GridSearchCV %.2f"
       %grid.score(w2v_x_test, y_test)))
print("Best param:", grid.best_params_)

#### Analysis: 
- We tried the pre-trained Doc2Vec word embeddings for two models: Ridge and XGBRegressor
- Both perform signficantly worse than models in task 1 (approx. 24%).
- We believe that pre-trained word embeddings are not appropriate for featurization for this task because ________

### Combine Doc2Vec with BOW

In [61]:
X_trainval_bow = pd.DataFrame(X_trainval_bow.toarray())
X_test_bow = pd.DataFrame(X_test_bow.toarray())

In [67]:
full_X_trainval = pd.concat([X_trainval_bow, w2v_x_trainval], axis=1)
full_X_test = pd.concat([X_test_bow, w2v_x_test],axis=1)

In [68]:
%%time
ridge_pipeline = Pipeline(steps=[
                                ('regressor', Ridge())])

param_grid =  {
               'regressor__alpha': [0.01, 0.1, 1, 10, 100]
              }
grid = GridSearchCV(ridge_pipeline, param_grid, cv=3, return_train_score=True)
grid.fit(full_X_trainval, y_trainval)

print(("Baseline Ridge score with GridSearchCV %.2f"
       %grid.score(full_X_test, y_test)))
print("Best param:", grid.best_params_)

Baseline Ridge score with GridSearchCV 0.64
Best param: {'regressor__alpha': 10}
CPU times: user 2min 2s, sys: 37.7 s, total: 2min 40s
Wall time: 2min 2s


#### Analysis
- Combining Doc2Vec with BoW for featurization on Ridge performs very well, second only to the text-based N-gram text-based Ridge with TF-IDF stemming by 4% accuracy.
- While this improves performance compared to solely Doc2Vec featurization, we find it's still better to avoid pre-trained embeddings altogether.