# Sentiment Analysis using pre-trained embeddings

We use GloVe pretrained embeddings on Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors) to use for sentiment analysis. We obtained the embeddings from https://nlp.stanford.edu/projects/glove/.

## Data Cleaning
We first import the data, and clean it by removing stopwords and tokenizing.

In [3]:
%matplotlib inline
import spacy
import pandas as pd
import numpy as np
import re
import string

##### change number of imported rows!!!!!!!!!!!!!!!!!!!!!!!
reviews = pd.read_csv('./parsed/new_data.csv')

idx1 = [True if i == 1 else False for i in reviews['stars']]
idx5 = [True if i == 5 else False for i in reviews['stars']]
y = reviews['stars'].tolist()
y = [-1 if star in [1,2] else 1 for star in y]
reviews = pd.Series(reviews['text'])
reviews = reviews.tolist()

%matplotlib inline
import pandas as pd
import numpy as np
import spacy
import re
import string

##### change number of imported rows!!!!!!!!!!!!!!!!!!!!!!!
reviews = pd.read_csv('./parsed/review_db.gz', compression='gzip', nrows=None, 
                      skiprows=[1871819, 3304881], dtype={'stars': np.int8, 'text': str})
idx1 = [True if i == 1 else False for i in reviews['stars']]
idx5 = [True if i == 5 else False for i in reviews['stars']]
reviews = reviews[idx1].sample(n=10000).append(reviews[idx5].sample(n=10000))
y = reviews['stars'].tolist()
reviews = pd.Series(reviews.iloc[:,1])
reviews = reviews.tolist()

In [5]:
# load spacy en
nlp = spacy.load('en')

# load stopwords
# inspired by https://stackoverflow.com/a/3510894
stopwords = pd.read_csv("./nlp/stopwords.txt", header = None, encoding = "UTF-8", squeeze=True)
stopwords = np.asarray(stopwords)
pattern = "|".join(stopwords)
stopwords = re.compile("\\b(" + pattern + ")\\W", re.I)
punctuation = re.compile(r'[^\w_ ]+', re.I)

# clean the text (drops punctuation, only alpha-numeric, no stopwords)
x = []
i = 0
for review in reviews:
    review = punctuation.sub("", review)
    review = stopwords.sub("", review)
    parsed_text = nlp(review)
    intermediate_text = []
    for token in parsed_text:
        intermediate_text.append(token.lemma_.lower())
    words = " ".join(intermediate_text)
    if words != '':
        x.append(words)
    i+=1
    
i=0
for review in x[:5]:
    print(i)
    print(review)
    i+=1
    
# clear memory
reviews = None

0
useless staff order correct reasonable time frame management need rehire staff
1
go boyfriend week order sizzle plate special fish   assorted seafood boil season salty compare near yonge college fish really salty teriyaki sauce will not go location hot like really try hard kitchen salty season shrimp devein bring boyfriend server eye boyfriend nerve gently touch handing debit machine lol hope help
2
go fountain attach recess order burger look pretty good absolutely cover sauce personally awful list menu burger like completely swim stuff take bite burger like taste sauce basically mayonnaise sandwich server return later ask alright explain sauce bad quietly apologized take plate away return minute later billin service mediocre inattentive unpleasant hurry despite table restaurant end pay leave unsatisfiedwill back
3
agree employee tim horton location bit rude today 10 pm   jan 16 2016 go tim horton order hot chocolate white female extremely rude racist give attitude handle drink rough

## Creating features
We then turn the Yelp reviews into vectors with the help of word embeddings. These embeddings are pre-trained tensors of the GloVe model on Twitter. So, we load these embeddings and create an average of the word tensors by review. This is, given a review, we take the average of the individual token embeddings and asign it to the review. Hence, each Yelp review now is has a feature vector built on the embeddings.

Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2. https://arxiv.org/pdf/1607.00570

###  50 dimensional embeddings

In [6]:
# load GloVe
# Pre-process the embeddings
embeddings_index = {}

# We will use the 50-dimensional embedding vectors
with open("./nlp/glove.twitter.27B.50d.txt", encoding='UTF-8') as f:
    # Each row represents a word vector
    for line in f:
        values = line.split()
        # The first part is word
        word = values[0]
        # The rest are the embedding vector
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 1193514 word vectors.


In [30]:
# create embedding feature matrix by attaching average embedding to each review
fmatrix = np.zeros((len(x), 150))
ave_matrix = np.zeros((len(x), 50))
j = 0
for review in x:
    words = review.split()
    unique_words = np.unique(words)
    embedding_matrix = np.zeros((len(words), 50))
    i = 0
    for word in words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            i+=1
    # average
    fmatrix[j] = np.append(embedding_matrix.max(0), [embedding_matrix.min(0), embedding_matrix.mean(0)])
    ave_matrix[j] = embedding_matrix.mean(0)
    j += 1

# clear memory
# x = None

###  100 dimensional embeddings

In [38]:
# load GloVe
# Pre-process the embeddings
embeddings_index = {}

# We will use the 50-dimensional embedding vectors
with open("./nlp/glove.twitter.27B.100d.txt", encoding='UTF-8') as f:
    # Each row represents a word vector
    for line in f:
        values = line.split()
        # The first part is word
        word = values[0]
        # The rest are the embedding vector
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 1193514 word vectors.


In [39]:
# create embedding feature matrix by attaching average embedding to each review
fmatrix100 = np.zeros((len(x), 300))
ave_matrix100 = np.zeros((len(x), 100))
j = 0
for review in x:
    words = review.split()
    unique_words = np.unique(words)
    embedding_matrix = np.zeros((len(words), 100))
    i = 0
    for word in words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            i+=1
    # average
    fmatrix100[j] = np.append(embedding_matrix.max(0), [embedding_matrix.min(0), embedding_matrix.mean(0)])
    ave_matrix100[j] = embedding_matrix.mean(0)
    j += 1

# clear memory
# x = None

## Model Selection
We now train several models using the review embeddings as feature matrix. We run a grid search over the models to select the best model, and the best parametrization for every model.

###  50 dimensional embeddings

#### Mean

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# split data into testing and training
xtrain, xtest, ytrain, ytest = train_test_split(ave_matrix, y, test_size=0.25, random_state=11)

models = [
         "Naive Bayes",
         "SVM",
         "Logistic Regression",
         "Random Forest",
         "Adaboost Forest"
        ]

classifiers = [
    GaussianNB(),
    SVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    AdaBoostClassifier()
]

parameters = [
              {},                                                         # Gaussian Naive Bayes
              {'C': np.logspace(-4, 3, 5),                                # SVM
               'degree': range(1, 4),
               'kernel': ['poly']},
              {'C': np.logspace(-4, 3, 15)},                              # Logistic Regression
              {'n_estimators': np.linspace(100, 1000, 5).astype('int')},  # Random Forest
              {'n_estimators': np.linspace(100, 1500, 5).astype('int'),    # Adaboost Forest
               'learning_rate': np.linspace(0.001, 2, 3)},
             ]
optimal = {}
for model, clf, params in zip(models, classifiers, parameters):
    gscv = GridSearchCV(clf, param_grid=params, cv=10, n_jobs=-1)
    gscv = gscv.fit(xtrain, ytrain)
    score = gscv.best_score_
    optimal[model] = {'clf':gscv.best_estimator_, 'score':score}
    print("{} score: {}".format(model, score))

In [18]:
optimal["Adaboost Forest"]['clf']

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.50075, n_estimators=537, random_state=None)

#### Max, Min, Mean

In [40]:
%time
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# split data into testing and training
xtrain, xtest, ytrain, ytest = train_test_split(fmatrix, y, test_size=0.25, random_state=11)

models = [
         "Naive Bayes",
         "SVM",
         "Logistic Regression",
         "Random Forest",
         "Adaboost Forest"
        ]

classifiers = [
    GaussianNB(),
    SVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    AdaBoostClassifier()
]

parameters = [
              {},                                                         # Gaussian Naive Bayes
              {'C': np.logspace(-4, 3, 5),                                # SVM
               'degree': range(1, 4),
               'kernel': ['poly']},
              {'C': np.logspace(-4, 3, 15)},                              # Logistic Regression
              {'n_estimators': np.linspace(100, 1000, 5).astype('int')},  # Random Forest
              {'n_estimators': np.linspace(100, 1500, 5).astype('int'),    # Adaboost Forest
               'learning_rate': np.linspace(0.001, 2, 3)},
             ]
optimal = {}
for model, clf, params in zip(models, classifiers, parameters):
    gscv = GridSearchCV(clf, param_grid=params, cv=10, n_jobs=-1)
    gscv = gscv.fit(xtrain, ytrain)
    score = gscv.best_score_
    optimal[model] = {'clf':gscv.best_estimator_, 'score':score}
    print("{} score: {}".format(model, score))

Wall time: 0 ns
Naive Bayes score: 0.6897333333333333
SVM score: 0.8569333333333333
Logistic Regression score: 0.8556
Random Forest score: 0.8326666666666667
Adaboost Forest score: 0.8498666666666667


In [18]:
optimal["Adaboost Forest"]['clf']

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.50075, n_estimators=537, random_state=None)

###  100 dimensional embeddings

#### Mean

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# split data into testing and training
xtrain, xtest, ytrain, ytest = train_test_split(ave_matrix100, y, test_size=0.25, random_state=11)

models = [
         "Naive Bayes",
         "SVM",
         "Logistic Regression",
         "Random Forest",
         "Adaboost Forest"
        ]

classifiers = [
    GaussianNB(),
    SVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    AdaBoostClassifier()
]

parameters = [
              {},                                                         # Gaussian Naive Bayes
              {'C': np.logspace(-4, 3, 5),                                # SVM
               'degree': range(1, 4),
               'kernel': ['poly']},
              {'C': np.logspace(-4, 3, 15)},                              # Logistic Regression
              {'n_estimators': np.linspace(100, 1000, 5).astype('int')},  # Random Forest
              {'n_estimators': np.linspace(100, 1500, 5).astype('int'),    # Adaboost Forest
               'learning_rate': np.linspace(0.001, 2, 3)},
             ]
optimal = {}
for model, clf, params in zip(models, classifiers, parameters):
    gscv = GridSearchCV(clf, param_grid=params, cv=10, n_jobs=-1)
    gscv = gscv.fit(xtrain, ytrain)
    score = gscv.best_score_
    optimal[model] = {'clf':gscv.best_estimator_, 'score':score}
    print("{} score: {}".format(model, score))

In [18]:
optimal["Adaboost Forest"]['clf']

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.50075, n_estimators=537, random_state=None)

#### Max, Min, Mean

In [40]:
%time
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# split data into testing and training
xtrain, xtest, ytrain, ytest = train_test_split(fmatrix100, y, test_size=0.25, random_state=11)

models = [
         "Naive Bayes",
         "SVM",
         "Logistic Regression",
         "Random Forest",
         "Adaboost Forest"
        ]

classifiers = [
    GaussianNB(),
    SVC(),
    LogisticRegression(),
    RandomForestClassifier(),
    AdaBoostClassifier()
]

parameters = [
              {},                                                         # Gaussian Naive Bayes
              {'C': np.logspace(-4, 3, 5),                                # SVM
               'degree': range(1, 4),
               'kernel': ['poly']},
              {'C': np.logspace(-4, 3, 15)},                              # Logistic Regression
              {'n_estimators': np.linspace(100, 1000, 5).astype('int')},  # Random Forest
              {'n_estimators': np.linspace(100, 1500, 5).astype('int'),    # Adaboost Forest
               'learning_rate': np.linspace(0.001, 2, 3)},
             ]
optimal = {}
for model, clf, params in zip(models, classifiers, parameters):
    gscv = GridSearchCV(clf, param_grid=params, cv=10, n_jobs=-1)
    gscv = gscv.fit(xtrain, ytrain)
    score = gscv.best_score_
    optimal[model] = {'clf':gscv.best_estimator_, 'score':score}
    print("{} score: {}".format(model, score))

Wall time: 0 ns
Naive Bayes score: 0.6897333333333333
SVM score: 0.8569333333333333
Logistic Regression score: 0.8556
Random Forest score: 0.8326666666666667
Adaboost Forest score: 0.8498666666666667


In [18]:
optimal["Adaboost Forest"]['clf']

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.50075, n_estimators=537, random_state=None)

## Calculate test error
We now test the best model, as obtained by the last procedure. This model is the best among the selected model families and also has the best hyper-parameter selection.

In [None]:
# get the best model
bestModel = max(map(lambda x: (optimal[x]['score'], x), optimal.keys()))[1]
bestClf = optimal[bestModel]['clf']

# calculate the predictions
predtest = bestClf.predict(xtest)
predtrain = bestClf.predict(xtrain)

# report training and test accuracy
print('Training accuracy is: ' + str(accuracy_score(ytrain, predtrain)))
print('Test accuracy is: ' + str(accuracy_score(ytest, predtest)))