**Applied Machine Learning - Homework 4 - Task2**

Amaury Sudrie (UNI: AS5961)
Maxime Tchibozo (UNI: MT3390)

Foreword: Some of the methods used in this notebook are highly computationally and memory intensive. To run this code, we used Google Colab notebooks, and we encourage you to do the same.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir("drive/My Drive/AML/")

In [0]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

In [0]:
df = pd.read_csv('winemag-data-130k-v2.csv')

In [0]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


## **Question 2.1**
Use a pretrained word-embedding (word2vec, glove or fasttext) for featurization instead of the  bag-of-words model. Does this improve classification?

We will download the Google News pre-trained model.

It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.

It can be downloaded at the following link.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [0]:
from gensim.models import Word2Vec, KeyedVectors
from gensim.parsing.preprocessing import remove_stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline


In [0]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df[["description", "designation", "title"]],df["points"])

We preprocess the text features to ensure uniformity as we embed the documents to the feature space.

In [0]:
def embed(string,model):
  #We remove stopwords 
  text = string
  text = remove_stopwords(string)
  text = [text]

  #We tokenize the text
  vect = CountVectorizer()
  try:
    vect.fit(text)
    tokens = vect.get_feature_names()
  except ValueError:
    #if the vocabulary for a document is empty (perhaps because of stopword removal) we will encode it as a zero vector  
    tokens = [] 

  #We embed each word using the model and compute the mean vector of the entire document
  embedded_vect = np.zeros((300,))
  for token in tokens:
    try:
      embedded_vect += model.get_vector(token)
    except KeyError:
      pass #words not in the Google model are encoded with a zero vector
  if len(tokens) > 0:
    embedded_vect /= len(tokens)
    
  return pd.Series(embedded_vect)


Once we embed the vectors of each word of a document to the feature space, we consider the vector of the sentence to be the mean vector of all these words. 

This is a very simple assumption, and it loses some information relative to the ordering of the words. Two documents will be encoded by the same vector so long as they contain the same words.

The embed function can be applied directly to a DataFrame column to extract the 300 embedded features associated with this column. We will therefore compute 3 DataFrames of 300 features each which will represent either **description**, **designation**, and **title** in the word embedded space. 

We will then apply the base model from Task 1 to this new dataset. 

In [0]:
description_embed_train = X_train.description.fillna('missing').apply(lambda x:embed(str(x),model))
designation_embed_train = X_train.designation.fillna('missing').apply(lambda x:embed(str(x),model))
title_embed_train = X_train.title.fillna('missing').apply(lambda x: embed(str(x),model))

description_embed_test = X_test.description.fillna('missing').apply(lambda x:embed(str(x),model))
designation_embed_test = X_test.designation.fillna('missing').apply(lambda x:embed(str(x),model))
title_embed_test = X_test.title.fillna('missing').apply(lambda x: embed(str(x),model))


In [0]:
X1_train = pd.concat([description_embed_train,designation_embed_train,title_embed_train],axis=1,ignore_index=True)
print(np.shape(X1_train))
X1_test = pd.concat([description_embed_test,designation_embed_test,title_embed_test],axis=1,ignore_index=True)
print(np.shape(X1_test))

(97478, 900)
(32493, 900)


In [0]:
model_LR = LinearRegression()
scores = cross_val_score(model_LR, X1_train, y_train)
model_LR.fit(X1_train, y_train)
print("Score Linear Regression", np.mean(scores))

Score Linear Regression 0.5605135324326721


In [0]:
print("Score on test", model_LR.score(X1_test,y_test))

Score on test 0.5628322699546746


Word-embedding using Word2Vec on only **description**, **designation** and **title** gives a better model than baseline model (43% accuracy on train cross-validation and 43% on test) which used all the features except **description**, **designation** and **title**.

However, it does not give a better score than dummy Bag-of-Words embedding (72% accuracy on train cross-validation and 72% on test) or Tf-Idf embedding (~77% accuracy on train cross-validation). 

There are a few important caveats here, which are that the Google pre-trained Word2Vec model was trained on Twitter data and might not be best adapted to wine text. A second point is that the dataset contains 900 features, since the Google pre-trained model uses an embedding space of 300 dimensions.The Bag-of-Words embedding model on the other hand has 4000 dimensions.

This difference in feature space size is important particularly for the Linear Regression model, where adding non-colinear dimensions greatly improves the fit.

We should also remind that we used a simple method to embed documents by taking the mean vector of all their words. Improving on this assumption by using an embedded method specialized for documents and not just words could give better results. Doc2Vec is an example of a different way to embed documents while keeping some of the structure relative to the ordering of the words.


## **Question 2.1**

How about combining the embedded  words with the BoW model?

X_train contains the original description, designation and title features
X1_train contains the Word2Vec encoded features.
We concatenate the original and Word2Vec features, and OneHotEncode the original features.

In [0]:
X2_train = pd.concat([X_train,X1_train],axis=1,ignore_index=True)
X2_train = X2_train.rename(columns={0:"description",1:"designation",2:"title"})
X2_train = X2_train.fillna('missing')
X2_train.head()


Unnamed: 0,description,designation,title,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902
51229,"Made in the popular style, this Chardonnay sho...",Light Horse,Jamieson Ranch 2012 Light Horse Chardonnay (Ca...,0.004538,0.008447,-0.034229,0.128094,0.040683,-0.029161,0.080147,-0.143331,0.079052,0.154678,0.00816,-0.136755,-0.059067,0.032295,-0.073273,0.151741,-0.011379,0.114426,-0.039024,-0.063713,-0.038864,0.175204,0.071632,-0.006744,-0.028549,-0.093174,-0.060565,0.090363,-0.047951,0.006431,-0.020082,0.071411,0.040188,0.075432,-0.040283,0.012189,0.062436,...,-0.084124,-0.050842,0.055878,-0.000355,0.140416,-0.073469,-0.11935,-0.129953,-0.075474,0.034319,-0.010931,0.001808,-0.003069,0.06829,0.064218,0.046735,-0.03615,-0.139579,-0.068346,0.072021,0.036761,0.016593,0.125035,0.004813,-0.06536,-0.029084,-0.026751,0.042794,-0.003963,0.111921,-0.010254,0.012442,-0.006627,0.126831,0.057199,-0.026437,-0.036133,-0.014439,0.037354,-0.001011
128703,Really beautiful. Walks right up to the tartne...,Dutton Ranch-Sanchietti Vineyard,Dutton-Goldfield 2007 Dutton Ranch-Sanchietti ...,0.015193,0.062552,0.023916,0.111283,-0.036816,-0.009146,0.095547,-0.107268,-0.024742,0.17627,0.007625,-0.111667,-0.083593,-0.003375,-0.128272,0.077013,-0.101976,0.158099,0.003989,-0.160057,-0.071316,0.067333,0.055727,-0.041492,0.041793,-0.048252,-0.069274,0.124249,-0.063909,-0.019145,-0.025027,0.06126,0.055101,0.015242,-0.099138,-0.035468,0.005484,...,-0.121436,-0.100146,0.020435,0.021822,0.165112,-0.002246,-0.163538,-0.118799,-0.100287,-0.081067,-0.048442,-0.004248,-0.04292,0.02876,0.102533,0.084558,0.069409,-0.122241,-0.041772,0.100562,0.036133,-0.030676,0.139648,-0.068652,0.035107,-0.019983,-0.05499,0.048804,-0.089868,0.178711,-0.074121,-0.044608,-0.067029,0.087268,0.056378,-0.1021,-0.010669,-0.032373,0.130957,0.079175
26320,Fresh-cut grass and gooseberry dominate on the...,missing,Apriori 2014 Sauvignon Blanc (Napa Valley),0.035358,0.047158,0.029969,0.058912,-0.053372,-0.023125,0.029922,-0.116879,0.043968,0.21285,-0.048026,-0.147433,-0.035764,0.0233,-0.121549,0.077498,-0.003098,0.150106,0.010394,-0.055104,0.011552,0.115344,0.079547,-0.048332,0.002214,-0.043152,-0.080736,0.11294,0.034655,0.053327,-0.015981,0.013526,0.011946,-0.000834,-0.001953,-0.051216,0.011668,...,-0.049723,-0.003988,0.012939,0.007487,0.036791,-0.051188,-0.150736,-0.055155,-0.099681,-0.112528,-0.130371,-0.05424,-0.057373,0.059408,0.153727,0.037354,-0.031209,-0.096649,-0.13505,0.114624,-0.134481,-0.001912,0.149902,-0.019613,0.080302,0.023275,-0.043416,0.00946,-0.110921,0.144938,0.06665,-0.065069,0.021729,0.113057,0.001506,-0.049845,-0.009766,-0.102132,0.099447,0.098226
86520,"Soft, sweet and robust, with upfront cherry pi...",missing,Greg Norman California Estates 2011 Cabernet S...,0.006704,0.004667,-0.001961,0.123652,0.032538,-0.052241,0.104507,-0.082836,0.010951,0.208471,-0.034405,-0.08224,-0.051816,-0.031211,-0.155997,0.128997,-0.073486,0.11073,-0.090595,-0.129211,-0.119261,0.053972,0.099284,-0.037384,0.033061,-0.065063,-0.106211,0.111762,-0.057752,-0.006688,-0.012605,0.037445,0.055527,-0.022207,-0.090266,-0.027528,0.072769,...,-0.025608,-0.012899,0.08214,-0.048991,-0.040827,-0.093316,-0.125956,-0.044542,-0.006402,-0.13482,-0.120524,0.036458,-0.007582,0.138563,0.088881,0.047404,-0.050646,-0.044881,-0.105876,0.167508,-0.017666,0.020725,0.154894,-0.04193,0.05759,-0.018656,-0.073676,0.003916,-0.033088,0.104709,0.116374,-0.012207,0.043457,0.09139,0.004293,-0.116333,-0.045844,-0.130046,0.036187,0.089949
103889,"This is an ebullient wine, full of herb, grape...",missing,Lava Cap 2016 Sauvignon Blanc (El Dorado),0.030413,-0.000655,0.031645,0.097782,0.008959,-0.029075,0.025143,-0.085933,0.00645,0.167956,-0.07179,-0.111694,-0.050964,0.031629,-0.108321,0.182315,-0.040001,0.241294,0.071578,-0.080978,-0.014779,0.123214,0.064564,-0.063891,0.018709,-0.093374,-0.039122,0.113301,-0.071429,0.026878,-0.033732,0.030345,0.045631,0.022943,-0.093063,-0.019426,-0.003678,...,0.059396,0.067836,-0.006383,-0.005136,-0.056608,-0.046392,-0.153041,-0.077515,-0.003627,-0.173619,-0.172799,-0.030169,-0.005511,0.049395,0.172355,0.174159,-0.18464,-0.12888,-0.062919,0.008048,-0.112113,0.092041,0.113281,-0.03418,0.043326,-0.059047,-0.020752,0.041574,-0.028338,0.089704,0.099854,-0.108538,0.011719,0.025426,-0.010393,0.069188,0.106445,-0.044085,0.117676,0.163644


In [0]:
tfidf1 = make_pipeline(CountVectorizer(min_df=2, max_features=2000), TfidfTransformer())
tfidf2 = make_pipeline(CountVectorizer(min_df=2, max_features=1000, max_df=0.95), TfidfTransformer())
tfidf3 = make_pipeline(CountVectorizer(min_df=2, max_features=1000, max_df=0.95), TfidfTransformer())

BoW_train = sp.sparse.hstack([
    tfidf1.fit_transform(X2_train["description"]),
    tfidf2.fit_transform(X2_train["designation"]),
    tfidf3.fit_transform(X2_train["title"]),
    description_embed_train,
    designation_embed_train,
    title_embed_train
])

BoW_train

<97478x4900 sparse matrix of type '<class 'numpy.float64'>'
	with 86834902 stored elements in COOrdinate format>

This sparse matrix contains 86 million elements, which is mind-blowingly large. For this reason, the following Linear Regression code takes a long time to execute.

In [0]:
LR = LinearRegression()
scores = cross_val_score(LR, BoW_train, y_train)

print("Score Linear Regression", np.mean(scores))

Score Linear Regression 0.7374600040355238


Using both  Word2Vec embedding and the Bag of Words model yields a better score than simple Bag of words. However, this score is only a few percentage points better than simple Bag of words with Tf-idf yet much more computationally intensive. This method could be useful if we are interested in having the best possible score no matter the computational cost.

For general applications, however, it is more efficient to use Bag of Words with Tf-idF and the non-text features than Tf-idf and word2vec features.