## Sentiment Analysis

(a) Download Sentiment Labelled Sentences Data Set. There are three data files under the root folder. yelp_labelled.txt, amazon_cells_labelled.txt and imdb_labelled.txt. Parse each file with the specifications in readme.txt. Are the labels balanced? If not, what’s the ratio between the two labels? Explain how you process these files.

In [1]:
import pandas as pd
import re

In [2]:
## read yelp
yelp = []
for line in open('sentiment labelled sentences/yelp_labelled.txt'):
    line = line.strip('\n')
    review = line[:-2]
    label = line[-1]
    yelp.append([review, int(label)])
yelp = pd.DataFrame(yelp, columns = ['Review', 'Sent'])

In [3]:
## read amazon
amazon = []
for line in open('sentiment labelled sentences/amazon_cells_labelled.txt'):
    line = line.strip('\n')
    review = line[:-2]
    label = line[-1]
    amazon.append([review, int(label)])
amazon = pd.DataFrame(amazon, columns = ['Review', 'Sent'])

In [4]:
## read imdb
imdb = []
for line in open('sentiment labelled sentences/imdb_labelled.txt'):
    line = line.strip('\n')
    review = line[:-2]
    label = line[-1]
    imdb.append([review, int(label)])
imdb = pd.DataFrame(imdb, columns = ['Review', 'Sent'])

Check label ratio:

In [5]:
r_y = (yelp.Sent == 1).sum()/ (len(yelp) - (yelp.Sent == 1).sum())
print('The ratio between positive and negative reviews in yelp are ', r_y)

The ratio between positive and negative reviews in yelp are  1.0


In [6]:
r_a = (amazon.Sent == 1).sum()/ (len(amazon) - (amazon.Sent == 1).sum())
print('The ratio between positive and negative reviews in amazon are ', r_a)

The ratio between positive and negative reviews in amazon are  1.0


In [7]:
r_i = (imdb.Sent == 1).sum()/ (len(imdb) - (imdb.Sent == 1).sum())
print('The ratio between positive and negative reviews in imdb are ', r_i)

The ratio between positive and negative reviews in imdb are  1.0


So, the labels in each file are balanced.

(b) Pick your preprocessing strategy. Since these sentences are online reviews, they may con- tain significant amounts of noise and garbage. You may or may not want to do one or all of the following. Explain the reasons for each of your decision (why or why not).

In [8]:
yelp.head(10)

Unnamed: 0,Review,Sent
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


In [9]:
import string
import re
from nltk import *
from nltk.stem.snowball import SnowballStemmer as SBS
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [10]:
##Lemmatization 
def lem(s):
    ps = SBS('english')
    ls_stem = [ps.stem(word) for word in s.split()]
    return ' '.join(ls_stem)

In [11]:
## remove the certain list of words in the string
def rmv(s, ls):
    return ' '.join(filter(lambda s: s not in ls, s.split()))

In [12]:
def preprocess(df):
    data = df
    
    ## remove punctuation
    punc = '[' + string.punctuation + ']'
    data.Review = data.Review.apply(lambda x: re.sub(punc, '', x))
    
    ## lowercase
    data.Review = data.Review.apply(lambda x: x.lower())
    
    ## remove stopwords
    sw = stopwords.words('english')
    data.Review = data.Review.apply(lambda x: rmv(x, sw))
    
    ## Lemmatization 
    data.Review = data.Review.apply(lambda x: lem(x))
    
    return data
yelp = preprocess(yelp)
amazon = preprocess(amazon)
imdb = preprocess(imdb)
yelp.head()

Unnamed: 0,Review,Sent
0,wow love place,1
1,crust good,0
2,tasti textur nasti,0
3,stop late may bank holiday rick steve recommen...,1
4,select menu great price,1


Reason: <br>
* Lowercase: Since the uppercase words and lowercase words have the exactly same meaning in a sentence.
* Lemmatization: The words such as run, running, ran, have similar meaning. It is easier to train a model if all words are lemmatized. 
* Strip punctuation: punctuations have very small influence on the sentiment of a sentence. 
* Strip stop words: Getting rid of stops words can help the model to better understand important words such as adj, nouns. 

(c) Split training and testing set. In this assignment, for each file, please use the first 400 in- stances for each label as the training set and the remaining 100 instances as testing set. In total, there are 2400 reviews for training and 600 reviews for testing.

In [13]:
def mysplit(df):
    df_1 = df[df.Sent == 1]
    df_0 = df[df.Sent == 0]
    train1 = df_1[:400]
    test1 = df_1[400:]
    train0 = df_0[:400]
    test0 = df_0[400:]
    return train1, test1, train0, test0

In [14]:
a_train1, a_test1, a_train0, a_test0 = mysplit(amazon)
y_train1, y_test1, y_train0, y_test0 = mysplit(yelp)
i_train1, i_test1, i_train0, i_test0 = mysplit(imdb)

In [15]:
trains = [a_train1, a_train0, y_train1, y_train0, i_train1, i_train0]
tests = [a_test1, a_test0, y_test1, y_test0, i_test1, i_test1]
train = pd.concat(trains)
test = pd.concat(tests)

train.reset_index(inplace = True)
train.drop('index',axis = 1, inplace = True)
test.reset_index(inplace = True)
test.drop('index',axis = 1, inplace = True)

In [16]:
train.shape, test.shape

((2400, 2), (600, 2))

(d) Bag of Words model. Extract features and then represent each review using bag of words model, i.e., every word in the review becomes its own element in a feature vector. In order to do this, first, make one pass through all the reviews in the training set (Explain why we can’t use testing set at this point) and build a dictionary of unique words. Then, make another pass through the review in both the training set and testing set and count up the occurrences of each word in your dictionary. The ith element of a review’s feature vector is the number of occurrences of the i th dictionary word in the review. Implement the bag of words model and report feature vectors of any two reviews in the training set.

In [17]:
import numpy as np
## dictionary
dic = np.array(list(set(' '.join(train.Review).split())))
print('There are ', len(dic), ' unique words in the training data')

There are  3629  unique words in the training data


* The reason we can't use test data for creating dictionary is that we need use training data to train a model and training data may not have all the words that test data has.

In [18]:
def bag(s):
    a = s.split()
    v = np.sum([dic == w for w in a],axis=0)
    return v

In [19]:
## transfer reviews
train['vector'] = train.Review.apply(bag)
test['vector'] = test.Review.apply(bag)

In [20]:
## two reviews and feature vectors
for z in [5,10]:
    print(train.Review[z])
    print(train.vector[z])

impress go origin batteri extend batteri
[0 0 0 ..., 0 0 0]
bought use kindl fire absolut love
[0 0 0 ..., 0 0 0]



(e) Pick your post processing strategy. Since the vast majority of English words will not appear in most of the reviews, most of the feature vector elements will be 0. This suggests that we need a postprocessing or normalization strategy that combats the huge variance of the elements in the feature vector. You may want to use one of the following strategies. Whatever choices you make, explain why you made the decision.

* Using log-normalization since it can reduce the variance. 

In [21]:
train['log vector'] = train.vector.apply(lambda x: np.log(x+1))
test['log vector'] = test.vector.apply(lambda x: np.log(x+1))

(f) Sentiment prediction. Train a logistic regression model (you can use existing packages here) on the training set and test on the testing set. Report the classification accuracy and confu- sion matrix. Inspecting the weight vector of the logistic regression, what are the words that play the most important roles in deciding the sentiment of the reviews? Repeat this with a Naive Bayes classifier and compare performance.

In [22]:
from sklearn.linear_model import LogisticRegression as logit
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import confusion_matrix

In [23]:
x_train = np.vstack(train['log vector'])
y_train = train.Sent
x_test = np.vstack(test['log vector'])
y_test = test.Sent

##### Logistic Regression

In [24]:
paras = {'C': [0.1, 0.5, 1, 5, 10, 50]
        }
clf = GridSearchCV(logit(penalty = 'l2'), paras, cv = 5, n_jobs = -1, scoring = 'accuracy' )
clf.fit(x_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.1, 0.5, 1, 5, 10, 50]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring='accuracy', verbose=0)

In [25]:
logit_prediction = clf.predict(x_test)
print('Accuracy: ', clf.best_estimator_.score(x_test, y_test))

Accuracy:  0.81


In [26]:
print('Confusion Matrix:')
confusion_matrix(y_test, logit_prediction)

Confusion Matrix:


array([[176,  24],
       [ 90, 310]])

In [50]:
l = logit(penalty = 'l2', C = 5)
l.fit(x_train, y_train)
ind = abs(l.coef_)
z = dic[np.argsort(ind)]
z[0][-10:]

In [56]:
z = dic[np.argsort(ind)]
z[0][-10:]

array(['beauti', 'fantast', 'worst', 'amaz', 'excel', 'bad', 'delici',
       'poor', 'love', 'great'], 
      dtype='<U32')

##### Naive Bayes:

In [27]:
from sklearn.naive_bayes import BernoulliNB

In [28]:
BNB = BernoulliNB()
BNB.fit(x_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [29]:
BNB_prediction = BNB.predict(x_test)
print('Accuracy: ', BNB.score(x_test, y_test))

Accuracy:  0.785


In [30]:
print('Confusion Matrix:')
confusion_matrix(y_test, BNB_prediction)

Confusion Matrix:


array([[168,  32],
       [ 97, 303]])


(g) N-gram model. Similar to the bag of words model, but now you build up a dictionary of n- grams, which are contiguous sequences of words. For example, “Alice fell down the rabbit hole” would then map to the 2-grams sequence: ["Alice fell", "fell down", "down the", "the rabbit", "rabbit hole"], and all five of those symbols would be members of the n-gram dictio- nary. Try n = 2, repeat (d)-(g) and report your results.

Dictionary:

In [31]:
def two_gram(s):
    s_s = s.split()
    return [s_s[i] + ' ' + s_s[i+1] for i in range(len(s_s)-1)]

In [32]:
train['two_gram'] = train.Review.apply(two_gram)
test['two_gram'] = test.Review.apply(two_gram)

In [33]:
import itertools 

dic_w = np.array(list(set(itertools.chain(*train.two_gram))))

Feature vector:

In [34]:
## two-gram feature
def two_g(s):
    if s == []:
        return np.array([0]*len(dic_w))
    else:
        v = np.sum([dic_w == w for w in s],axis=0)
        return v

In [35]:
train['vector_2g'] = train.two_gram.apply(two_g)
test['vector_2g'] = test.two_gram.apply(two_g)

Log-normalization:

In [36]:
train['vector_2g_log'] = train.vector_2g.apply(lambda x: np.log(x+1))
test['vector_2g_log'] = test.vector_2g.apply(lambda x: np.log(x+1))

Split the data:

In [37]:
x_train_l = np.vstack(train['vector_2g_log'])
y_train_l = train.Sent
x_test_l = np.vstack(test['vector_2g_log'])
y_test_l = test.Sent

##### Logistic regression

In [38]:
paras = {'C': [0.1, 0.5, 1, 2, 5, 10, 50]}
clf = GridSearchCV(logit(penalty='l2'), paras, cv = 5, n_jobs = -1)
clf.fit(x_train_l, y_train_l)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': [0.1, 0.5, 1, 2, 5, 10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [39]:
logit_prediction = clf.predict(x_test_l)
print('Accuracy: ', clf.best_estimator_.score(x_test_l, y_test_l))

Accuracy:  0.55


In [40]:
print('Confusion Matrix:')
confusion_matrix(y_test_l, logit_prediction)

Confusion Matrix:


array([[176,  24],
       [246, 154]])

##### Naive Bayes:

In [41]:
BNB = BernoulliNB()
BNB.fit(x_train_l, y_train_l)

BNB_prediction = BNB.predict(x_test_l)
print('Accuracy: ', BNB.score(x_test_l, y_test_l))

print('Confusion Matrix:')
confusion_matrix(y_test_l, BNB_prediction)

Accuracy:  0.548333333333
Confusion Matrix:


array([[179,  21],
       [250, 150]])


(h) PCA for bag of words model. The features in the bag of words model have large redundancy. Implement PCA to reduce the dimension of features calculated in (e) to 10, 50 and 100 re- spectively. Using these lower-dimensional feature vectors and repeat (f), (g). Report corre- sponding clustering and classification results. (Note: You should implement PCA yourself, but you can use numpy.svd or some other SVD package. Feel free to double-check your PCA implementation against an existing one)

##### My PCA:

In [129]:
## feature matrix
fm = np.vstack(train['log vector'])
print(fm.shape)

(2400, 3629)


In [141]:
def sub_m(df):
    mean_vector = np.mean(df,axis=1).reshape(-1,1)

    ## substract mean
    return df - mean_vector
fm_sub_m = sub_m(fm)

In [142]:
## covariance matrix
def get_cm(c):                  ## input matrix (samples x observations)
    m = c.T

    num_f, num_o = m.shape

    ## find mean
    mean_vector = np.mean(m,axis=1).reshape(-1,1)
    
    ## substract mean
    fm_sub_m = m - mean_vector
    
    ## find covariance 
    covariance_m = fm_sub_m.dot(fm_sub_m.T)/(num_o - 1)
    
    return covariance_m

cm = get_cm(fm)

In [143]:
## diagnonalization

def diagonalize(m):
    
    evalues, evectors = np.linalg.eigh(m)
    idx = evalues.argsort()[::-1]
    evalues = np.diag(evalues[idx])
    evectors = evectors[:,idx]
    
    return evalues, evectors

In [144]:
E, P = diagonalize(cm)

In [145]:
P.shape

(3629, 3629)

In [146]:
##principal component

## dimension reduction
x_train10 = fm_sub_m.dot(P[:,:10])
x_train50 = fm_sub_m.dot(P[:,:50])
x_train100 = fm_sub_m.dot(P[:,:100])

In [147]:
##dim reduction for test data
fm_test = np.vstack(test['log vector'])
test_mean = np.mean(fm_test,axis=1).reshape(-1,1)
    
## substract mean
fm_test_sub_m = fm_test - test_mean

x_test10 = fm_test_sub_m.dot(P[:,:10])
x_test50 = fm_test_sub_m.dot(P[:,:50])
x_test100 = fm_test_sub_m.dot(P[:,:100])


##### Logistic Regression

In [148]:
for i in [10, 50,100]:
    
    exec('x_train_pc = x_train' + str(i))
    exec('x_test_pc = x_test' + str(i))

    paras = {'C': [0.1, 0.5, 1, 5, 10, 50]
            }
    clf = GridSearchCV(logit(penalty = 'l2'), paras, cv = 5, n_jobs = -1, scoring = 'accuracy' )
    clf.fit(x_train_pc, y_train)

    logit_prediction = clf.predict(x_test_pc)

    print('Performance with PC ' + str(i) +':')
    print('Accuracy: ', clf.best_estimator_.score(x_test_pc, y_test_l))
    print('Confusion Matrix: \n', confusion_matrix(y_test_l, logit_prediction))
    print('\n')

Performance with PC 10:
Accuracy:  0.548333333333
Confusion Matrix: 
 [[174  26]
 [245 155]]


Performance with PC 50:
Accuracy:  0.675
Confusion Matrix: 
 [[170  30]
 [165 235]]


Performance with PC 100:
Accuracy:  0.696666666667
Confusion Matrix: 
 [[158  42]
 [140 260]]




##### Naive Bayes

In [149]:
for i in [10, 50, 100]:
    
    exec('x_train_pc = x_train' + str(i))
    exec('x_test_pc = x_test' + str(i))
    
    BNB = BernoulliNB()
    BNB.fit(x_train_pc, y_train)
    BNB_prediction = BNB.predict(x_test_pc)
    
    print('Performance with PC ' + str(i) +':')
    print('Accuracy: ', BNB.score(x_test_pc, y_test))
    print('Confusion Matrix: \n', confusion_matrix(y_test, BNB_prediction))
    print('\n')

Performance with PC 10:
Accuracy:  0.585
Confusion Matrix: 
 [[120  80]
 [169 231]]


Performance with PC 50:
Accuracy:  0.606666666667
Confusion Matrix: 
 [[129  71]
 [165 235]]


Performance with PC 100:
Accuracy:  0.638333333333
Confusion Matrix: 
 [[129  71]
 [146 254]]




##### PCA for two grams:

In [150]:
## feature matrix
fm2 = np.vstack(train.vector_2g_log)

fm2 = sub_m(fm2)

## covariance matrix
def get_cm(c):                  ## input matrix (samples x observations)
    m = c.T

    num_f, num_o = m.shape

    ## find mean
    mean_vector = np.mean(m,axis=1).reshape(-1,1)
    
    ## substract mean
    fm_sub_m = m - mean_vector
    
    ## find covariance 
    covariance_m = fm_sub_m.dot(fm_sub_m.T)/(num_o - 1)
    
    return covariance_m

cm2 = get_cm(fm2)

In [151]:
## diagnonalization
def diagonalize(m):
    
    evalues, evectors = np.linalg.eigh(m)
    idx = evalues.argsort()[::-1]
    evalues = np.diag(evalues[idx])
    evectors = evectors[:,idx]
    
    return evalues, evectors

E2, P2 = diagonalize(cm2)

##principal component

## dimension reduction
x_2_train10 = fm2.dot(P2[:,:10])
x_2_train50 = fm2.dot(P2[:,:50])
x_2_train100 = fm2.dot(P2[:,:100])


##dim reduction for test data
fm2_test = np.vstack(test.vector_2g_log)
fm2_test = sub_m(fm2_test)

x_2_test10 = fm2_test.dot(P2[:,:10])
x_2_test50 = fm2_test.dot(P2[:,:50])
x_2_test100 = fm2_test.dot(P2[:,:100])

##### Logistic Regression

In [152]:
for i in [10, 50,100]:
    
    exec('x_2_train_pc = x_2_train' + str(i))
    exec('x_2_test_pc = x_2_test' + str(i))

    paras = {'C': [0.1, 0.5, 1, 5, 10, 50]
            }
    clf = GridSearchCV(logit(penalty = 'l2'), paras, cv = 5, n_jobs = -1, scoring = 'accuracy' )
    clf.fit(x_2_train_pc, y_train)

    logit_prediction = clf.predict(x_2_test_pc)

    print('Performance with PC ' + str(i) +':')
    print('Accuracy: ', clf.best_estimator_.score(x_2_test_pc, y_test_l))
    print('Confusion Matrix: \n', confusion_matrix(y_test, logit_prediction))
    print('\n')

Performance with PC 10:
Accuracy:  0.341666666667
Confusion Matrix: 
 [[198   2]
 [393   7]]


Performance with PC 50:
Accuracy:  0.573333333333
Confusion Matrix: 
 [[ 58 142]
 [114 286]]


Performance with PC 100:
Accuracy:  0.595
Confusion Matrix: 
 [[ 52 148]
 [ 95 305]]




##### Naive Bayes

In [153]:
for i in [10, 50, 100]:
    
    exec('x_2_train_pc = x_2_train' + str(i))
    exec('x_2_test_pc = x_2_test' + str(i))
    
    BNB = BernoulliNB()
    BNB.fit(x_2_train_pc, y_train)
    BNB_prediction = BNB.predict(x_2_test_pc)
    
    print('Performance with PC ' + str(i) +':')
    print('Accuracy: ', BNB.score(x_2_test_pc, y_test))
    print('Confusion Matrix: \n', confusion_matrix(y_test, BNB_prediction))
    print('\n')

Performance with PC 10:
Accuracy:  0.371666666667
Confusion Matrix: 
 [[194   6]
 [371  29]]


Performance with PC 50:
Accuracy:  0.426666666667
Confusion Matrix: 
 [[172  28]
 [316  84]]


Performance with PC 100:
Accuracy:  0.436666666667
Confusion Matrix: 
 [[161  39]
 [299 101]]




(i)

In [57]:
l = logit(penalty = 'l2', C = 5)
l.fit(x_train, y_train)
ind = abs(l.coef_)
z = dic[np.argsort(ind)]
z[0][-10:]

array(['beauti', 'fantast', 'worst', 'amaz', 'excel', 'bad', 'delici',
       'poor', 'love', 'great'], 
      dtype='<U32')

In [58]:
l1 = logit(penalty = 'l2', C = 5)
l1.fit(x_train_l, y_train_l)
ind = abs(l1.coef_)
z1 = dic_w[np.argsort(ind)]
z1[0][-10:]

array(['good price', 'realli good', 'easi use', 'food good',
       'great product', 'one best', 'wast time', 'great phone',
       'high recommend', 'work great'], 
      dtype='<U39')