### Applied Machine Learning
# HW4

### Shuai Hao (sh3831), Eugene M. Joseph (emj2152)

## Predict wine quality from review texts
This notebook explores models built using features vectors from Glove, Word2Vec, and FastText word embeddings W using [the wine reviews data](https://www.kaggle.com/zynicide/wine-reviews) from Kaggle. The data were scraped on November 22nd, 2017. For this task, only the wine from the US are used.

## Task 2 Word Vectors [50pts]
Use a pretrained word-embedding (word2vec, glove or fasttext) for featurization instead of the bag-of-words model. Does this improve classification? How about combining the embedded words with the BoW model?

First lets load all our dependencies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from operator import itemgetter
from category_encoders import TargetEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, Normalizer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder, StandardScaler, PolynomialFeatures, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.impute import SimpleImputer, KNNImputer 
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer, enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
import seaborn as sns
import spacy

In [3]:
nlp = spacy.load("en_core_web_lg",disable=["tagger", "parser", "ner"])

In [4]:
df_all = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

In [5]:
df = df_all[df_all["country"] == "US"]

In [11]:
df['all_text'] = df['title'].astype(str) + " " +df['description'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


**To recap, here is our best model from part 1**

In [6]:
y = df['points']
X = df.drop(columns=['country','points', 'taster_twitter_handle'])

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [8]:
categorical = X_train.dtypes == 'object'

In [9]:
preprocess_continuous = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

preprocess_target = make_pipeline(
    TargetEncoder(),
)

preprocess_dummy = make_pipeline(
    SimpleImputer(strategy='constant', fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore')
)

proprocess_word = make_pipeline(
    CountVectorizer(ngram_range=(1, 2), min_df = 1),
    TfidfTransformer()
)

In [10]:
preprocess = make_column_transformer(
    (preprocess_continuous, ~categorical),
    (preprocess_target, ['designation','region_1','variety','winery']),
    (preprocess_dummy, ['province','region_2','taster_name']),
    (proprocess_word, 'all_text'),
)

In [11]:
model_linear_regression = make_pipeline(preprocess, LinearRegression())

In [12]:
scores = cross_val_score(model_linear_regression, X_train, y_train)
np.mean(scores)

0.7672466043879365

In [13]:
model_linear_regression.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('pipeline-1',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                           

In [14]:
model_linear_regression.score(X_test, y_test)

0.775435040275117

### First,  lets try building a model that just using the Word2Vec features from Spacy

In [15]:
y = df['points']
X = df['all_text']

In [16]:
docs_w2v = [nlp(d).vector for d in X]
X_w2v = np.vstack(docs_w2v)

In [17]:
X_w2v.shape

(54504, 300)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, random_state=1)

In [13]:
scores = cross_val_score(LinearRegression(), X_train, y_train)
np.mean(scores)

0.5884998394991272

In [15]:
model_linear_regression = LinearRegression().fit(X_train, y_train)

In [16]:
model_linear_regression.score(X_test, y_test)

0.5928605747883822

The scores using Word2Vec features were lower than the best scores we received from part 1. That being said, this model was significantly smaller and much faster to run.

### Let's try again using Glove embeddings
The glove word embeddings we picked were trained using Wikepedia and Gigaword. We downloaded the 300 dimension pre-trained word vectors from [glove's website](http://nlp.stanford.edu/data/glove.6B.zip).

In [119]:
X = df['title'].astype(str) + " " +df['description'].astype(str)

In [120]:
y = df['points']

In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [122]:
glove_path = "D:/Natural-Language-Processing/hw2/resources/glove.6B.300d.txt"

In [123]:
df_glove = pd.read_csv(glove_path, sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df_glove.T.items()}

In [125]:
glove_docs_train = []
for doc in X_train:
    doc_embeddings = []
    for word in doc.split(" "):
        if word in glove.keys():
            doc_embeddings.append(glove[word])
    doc_embeddings = np.asarray(doc_embeddings).mean(axis=0)
    glove_docs_train.append(doc_embeddings)
X_train_vector = np.vstack(glove_docs_train)

In [126]:
X_train_vector.shape

(40878, 300)

In [127]:
scores = cross_val_score(LinearRegression(), X_train_vector, y_train)
np.mean(scores)

0.4704118742016389

In [128]:
model_linear_regression = LinearRegression().fit(X_train_vector, y_train)

In [129]:
glove_docs_test = []
for doc in X_test:
    doc_embeddings = []
    for word in doc.split(" "):
        if word in glove.keys():
            doc_embeddings.append(glove[word])
    doc_embeddings = np.asarray(doc_embeddings).mean(axis=0)
    glove_docs_test.append(doc_embeddings)
X_test_vector = np.vstack(glove_docs_test)

In [130]:
model_linear_regression.score(X_test_vector, y_test)

0.4634534419856774

The score using Glove features is lower than the word2vec features. Now lets try again with FastText embeddings

### Let's try once again using FastText embeddings
The fasttext word embeddings we used were trained on Common Crawl. We downloaded the 300 dimension pre-trained word vectors from [fasttext's website](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip).

In [7]:
X = df['title'].astype(str) + " " +df['description'].astype(str)

In [8]:
y = df['points']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [110]:
fasttext_path = "D:/Natural-Language-Processing/hw2/resources/crawl-300d-2M.vec"

In [111]:
df_fasttext = pd.read_csv(fasttext_path, sep=" ", quoting=3, header=None, index_col=0)
fasttext = {key: val.values for key, val in df_fasttext.T.items()}

In [113]:
fasttext_docs_train = []
for doc in X_train:
    doc_embeddings = []
    for word in doc.split(" "):
        if word in fasttext.keys():
            doc_embeddings.append(fasttext[word])
    doc_embeddings = np.asarray(doc_embeddings).mean(axis=0)
    fasttext_docs_train.append(doc_embeddings)
X_train_vector = np.vstack(fasttext_docs_train)
# glove_docs_train

In [114]:
X_train_vector.shape

(40878, 300)

In [115]:
scores = cross_val_score(LinearRegression(), X_train_vector, y_train)
np.mean(scores)

0.5544029996406648

In [116]:
model_linear_regression = LinearRegression().fit(X_train_vector, y_train)

In [117]:
fasttext_docs_test = []
for doc in X_test:
    doc_embeddings = []
    for word in doc.split(" "):
        if word in fasttext.keys():
            doc_embeddings.append(fasttext[word])
    doc_embeddings = np.asarray(doc_embeddings).mean(axis=0)
    fasttext_docs_test.append(doc_embeddings)
X_test_vector = np.vstack(fasttext_docs_test)

In [118]:
model_linear_regression.score(X_test_vector, y_test)

0.5496359611002772

FastText did better than Glove but Word2Vec is still the best of the three.  
Now let's try combining the our BOW features with the Word2Vec embeddings

### Word2Vec + BOW

In [20]:
y = df['points']
X = df[['all_text']]

In [21]:
X.shape

(54504, 1)

In [22]:
X_full = X.join(pd.DataFrame(X_w2v, columns = ["w2v_%d" % (i) for i in range(300)]))

In [23]:
X.shape

(54504, 1)

In [24]:
X_full.shape

(54504, 301)

In [25]:
X_full.head()

Unnamed: 0,all_text,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
2,Rainstorm 2013 Pinot Gris (Willamette Valley) ...,-0.059305,0.178837,-0.103683,-0.081007,0.114283,0.169241,0.038487,-0.107372,-0.093617,...,-0.075751,0.118462,-0.121482,-0.000904,-0.095376,-0.048598,0.007584,-0.155806,-0.029952,0.00113
3,St. Julian 2013 Reserve Late Harvest Riesling ...,-0.123089,0.166517,-0.015784,-0.14936,0.170512,0.12151,0.061644,0.007022,-0.008923,...,-0.056513,0.216762,-0.054715,-0.029836,-0.140443,-0.084087,0.016938,-0.170598,-0.004161,0.075681
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,-0.119515,0.158756,-0.012404,-0.112647,0.152178,0.069793,-0.013483,-0.062202,-0.027927,...,-0.062307,0.10963,-0.10726,-0.113588,-0.17141,-0.097073,0.051592,-0.213532,-0.068529,0.031298
10,Kirkland Signature 2011 Mountain Cuvée Caberne...,-0.021261,0.142845,-0.031244,-0.162984,0.210119,0.11259,0.08968,0.026466,-0.100519,...,-0.053804,0.095881,-0.018253,-0.053935,-0.128425,-0.008587,0.071964,-0.138455,-0.017377,0.001073
12,Louis M. Martini 2012 Cabernet Sauvignon (Alex...,-0.086759,0.226323,-0.055962,-0.272232,0.188816,0.227645,-0.039569,-0.014617,-0.131085,...,-0.071923,0.197889,-0.136351,-0.076265,-0.269704,-0.116376,0.105994,-0.301777,-0.055862,-0.049494


In [26]:
w2v_col_names = X_full.columns[11:311].values.tolist()

In [27]:
w2v_col_names

['w2v_10',
 'w2v_11',
 'w2v_12',
 'w2v_13',
 'w2v_14',
 'w2v_15',
 'w2v_16',
 'w2v_17',
 'w2v_18',
 'w2v_19',
 'w2v_20',
 'w2v_21',
 'w2v_22',
 'w2v_23',
 'w2v_24',
 'w2v_25',
 'w2v_26',
 'w2v_27',
 'w2v_28',
 'w2v_29',
 'w2v_30',
 'w2v_31',
 'w2v_32',
 'w2v_33',
 'w2v_34',
 'w2v_35',
 'w2v_36',
 'w2v_37',
 'w2v_38',
 'w2v_39',
 'w2v_40',
 'w2v_41',
 'w2v_42',
 'w2v_43',
 'w2v_44',
 'w2v_45',
 'w2v_46',
 'w2v_47',
 'w2v_48',
 'w2v_49',
 'w2v_50',
 'w2v_51',
 'w2v_52',
 'w2v_53',
 'w2v_54',
 'w2v_55',
 'w2v_56',
 'w2v_57',
 'w2v_58',
 'w2v_59',
 'w2v_60',
 'w2v_61',
 'w2v_62',
 'w2v_63',
 'w2v_64',
 'w2v_65',
 'w2v_66',
 'w2v_67',
 'w2v_68',
 'w2v_69',
 'w2v_70',
 'w2v_71',
 'w2v_72',
 'w2v_73',
 'w2v_74',
 'w2v_75',
 'w2v_76',
 'w2v_77',
 'w2v_78',
 'w2v_79',
 'w2v_80',
 'w2v_81',
 'w2v_82',
 'w2v_83',
 'w2v_84',
 'w2v_85',
 'w2v_86',
 'w2v_87',
 'w2v_88',
 'w2v_89',
 'w2v_90',
 'w2v_91',
 'w2v_92',
 'w2v_93',
 'w2v_94',
 'w2v_95',
 'w2v_96',
 'w2v_97',
 'w2v_98',
 'w2v_99',
 'w2v_100'

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_full, y, random_state=1)

In [29]:
preprocess_glove = make_pipeline(
    SimpleImputer(strategy='median'),
)

preprocess_word = make_pipeline(
    CountVectorizer(ngram_range=(1, 2), min_df = 1),
    TfidfTransformer()
)

In [30]:
preprocess = make_column_transformer(
    (preprocess_glove, w2v_col_names),
    (preprocess_word, 'all_text'))

In [31]:
model_linear_regression = make_pipeline(preprocess, LinearRegression())

In [32]:
X_train.shape

(40878, 301)

In [33]:
y_train.shape

(40878,)

In [34]:
scores = cross_val_score(model_linear_regression, X_train, y_train)
np.mean(scores)

0.7496046199731424

In [35]:
model_linear_regression.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('pipeline-1',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                           

In [36]:
model_linear_regression.score(X_test, y_test)

0.7613436995899436

The cross-validation and test scores seem to have actually decreased a little bit when combining the Word2Vec and BOW features.

### Word2Vec + BOW + Other available features

In [37]:
y = df['points']
X = df.drop(columns=['country','points', 'taster_twitter_handle'])

In [38]:
X.shape

(54504, 11)

In [39]:
X_full = X.join(pd.DataFrame(X_w2v, columns = ["w2v_%d" % (i) for i in range(300)]))

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_full, y, random_state=1)

In [41]:
preprocess_continuous = make_pipeline(
    SimpleImputer(strategy='median'),
)

preprocess_glove = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

preprocess_target = make_pipeline(
    TargetEncoder(),
)

preprocess_dummy = make_pipeline(
    SimpleImputer(strategy='constant', fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore')
)

preprocess_word = make_pipeline(
    CountVectorizer(ngram_range=(1, 2), min_df = 1),
    TfidfTransformer()
)

In [42]:
preprocess = make_column_transformer(
    (preprocess_glove, w2v_col_names),
    (preprocess_continuous, ['price']),
    (preprocess_target, ['designation','region_1','variety','winery']),
    (preprocess_dummy, ['province','region_2','taster_name']),
    (preprocess_word, 'all_text'))

In [43]:
model_linear_regression = make_pipeline(preprocess, LinearRegression())

In [44]:
X_train.shape

(40878, 311)

In [45]:
scores = cross_val_score(model_linear_regression, X_train, y_train)
np.mean(scores)

0.7672678103397905

In [46]:
model_linear_regression.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('pipeline-1',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                           

In [47]:
model_linear_regression.score(X_test, y_test)

0.7753824251271151

The model improved slightly after adding in the other available features but it's still not really better than our best model from part 1