Data source: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Data Set Information:

This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015. It contains sentences labelled with positive or negative sentiment. 

Details:      
Sentence score is either 1 (for positive) or 0 (for negative) 
The sentences come from three different websites/fields: 
- imdb.com 
- amazon.com 
- yelp.com 

For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews. 

### Summary:
- [Balanced Labels](#labels)
- [Preprocessing Sentences](#pre)
- [Split Training and Testing Set](#split)
- [Postprocessing](#post)
- [Logistic Regression vs Naive Bayes](#models)
- [N-gram Model](#2)
- [PCA for Bag of Words](#pca)

In [87]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from nltk.stem import WordNetLemmatizer 
from sklearn.decomposition import PCA
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import operator
import string
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/melissajin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/melissajin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/melissajin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
#load data sets
imdb = pd.read_csv("imdb_labelled.txt", sep="\t",header =None)
amazon = pd.read_csv("amazon_cells_labelled.txt", sep="\t",header =None)
yelp = pd.read_csv("yelp_labelled.txt", sep="\t",header =None)
print(len(imdb),len(amazon),len(yelp))

748 1000 1000


In [4]:
imdb.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


#### <a id ="labels">Balanced Labels</a>

In [5]:
#The function count the number of positive labels and the number of negative labels and calculate the ratio
def labelcount(dataset):    
    positive = 0
    for i in range(0,len(dataset)):
        if dataset[1][i] == 1:
            positive +=1
    negative = len(dataset) - positive
    ratio = positive/negative
    print("The ratio (positive/negative) is: " + str(ratio))
labelcount(imdb)
labelcount(amazon)
labelcount(yelp)

The ratio (positive/negative) is: 1.0662983425414365
The ratio (positive/negative) is: 1.0
The ratio (positive/negative) is: 1.0


#### <a id = "pre">Preprocessing Sentences</a>

In [6]:
#convert lowercase all data 
imdb = imdb.apply(lambda x: x.astype(str).str.lower())
amazon = amazon.apply(lambda x: x.astype(str).str.lower())
yelp = yelp.apply(lambda x: x.astype(str).str.lower())

In [9]:
#Lemmatization of all words
lemmatizer = WordNetLemmatizer()
for i in range(0,len(imdb)):
    imdb[0][i] = nltk.word_tokenize(imdb[0][i])
    imdb[0][i] = [lemmatizer.lemmatize(w) for w in imdb[0][i]]
for i in range(0,len(amazon)):
    amazon[0][i] = nltk.word_tokenize(amazon[0][i])
    amazon[0][i] = [lemmatizer.lemmatize(w) for w in amazon[0][i]]
for i in range(0,len(yelp)):
    yelp[0][i] = nltk.word_tokenize(yelp[0][i])
    yelp[0][i] = [lemmatizer.lemmatize(w) for w in yelp[0][i]]

In [10]:
#strip stop words
stop_words = set(stopwords.words('english'))
for i in range(0,len(imdb)):
    imdb[0][i] = [word for word in imdb[0][i] if not word in stop_words]   
for i in range(0,len(amazon)):
    amazon[0][i] = [word for word in amazon[0][i] if not word in stop_words] 
for i in range(0,len(yelp)):
    yelp[0][i] = [word for word in yelp[0][i] if not word in stop_words]

In [11]:
#strip punctuation
string.punctuation
exclude = set(string.punctuation)
for i in range(0,len(imdb)):
    imdb[0][i] = [''.join(p for p in word if p not in string.punctuation) for word in imdb[0][i]]  
    imdb[0][i] = [word for word in imdb[0][i] if word]
for i in range(0,len(amazon)):
    amazon[0][i] = [''.join(p for p in word if p not in string.punctuation) for word in amazon[0][i]]
    amazon[0][i] = [word for word in amazon[0][i] if word]
for i in range(0,len(yelp)):
    yelp[0][i] = [''.join(p for p in word if p not in string.punctuation) for word in yelp[0][i]]
    yelp[0][i] = [word for word in yelp[0][i] if word]

#### <a id="split">Split Training and Testing Set<a/>

In [38]:
x_train,y_train,x_test,y_test=[],[],[],[]
x_train.extend(imdb[0][:400])
x_train.extend(amazon[0][:400])
x_train.extend(yelp[0][:400]) 
y_train.extend(imdb[1][:400])
y_train.extend(amazon[1][:400])
y_train.extend(yelp[1][:400]) 
x_test.extend(imdb[0][400:500])
x_test.extend(amazon[0][400:500])
x_test.extend(yelp[0][400:500]) 
y_test.extend(imdb[1][400:500])
y_test.extend(amazon[1][400:500])
y_test.extend(yelp[1][400:500]) 

In [39]:
print(len(x_train),len(train_y),len(test_x),len(test_y))

1200 1200 300 300


In [41]:
x_train = [",".join(ele) for ele in x_train]
x_test = [",".join(ele) for ele in x_test]

In [64]:
ctV = CountVectorizer()
#build a dictionary of unique words for training set
x_train_bag = ctV.fit_transform(x_train).todense()

ctV_test = CountVectorizer(vocabulary=ctV.get_feature_names())
x_test_bag = ctV_test.fit_transform(x_test).todense()

In [65]:
print(x_train_bag.shape,x_test_bag.shape)

(1200, 2986) (300, 2986)


#### <a id="post">Postprocessing</a>

In [69]:
#normalize the data by using L2 norm for training set: x^ = x / ||x||
#Due to the huge variance in the dataset, we want to minimize the effect, thus using L2 minimize the variance effect
#L2 normalization is the best choice

x_train_bag_norm = normalize(x_train_bag)

#normalize the data by using L2 norm for testing set 
x_test_bag_norm = normalize(x_test_bag)

#### <a id="models">Logistic Regression vs Naive Bayes</a>

In [72]:
lrl2 = LogisticRegression()
lrl2.fit(x_train_bag_norm, y_train)
print("Logistic Regression normalized by L2 norm accuracy:{:.2f}".format(lrl2.score(x_test_bag_norm ,y_test)))

Logistic Regression normalized by L2 norm accuracy:0.74


In [53]:
print(len(x_train_bag_norm),len(y_train),len(x_test_bag_norm),len(y_test))

1200 1200 300 300


In [73]:
gaussian = GaussianNB()
gaussian.fit(x_train_bag_norm, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(gaussian.score(x_test_bag_norm ,y_test)))

bernoulli = BernoulliNB()
bernoulli.fit(x_train_bag_norm, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(bernoulli.score(x_test_bag_norm ,y_test)))

Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.69
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.72


Logistic Regression model performs slightly better than Naive Bayes Classifier.

In [88]:
sorted_vocab = sorted(ctV.vocabulary_.items(), key=operator.itemgetter(1), reverse=True)
print("Dictionary with 10 highest values:")   
for word in sorted_vocab[:10]:
    print(word)

Dictionary with 10 highest values:
('zombiez', 2985)
('zombiestudents', 2984)
('zillion', 2983)
('zero', 2982)
('yun', 2981)
('yummy', 2980)
('yum', 2979)
('yukon', 2978)
('yucky', 2977)
('youthful', 2976)


#### <a id ="2">N-gram Model</a>

In [94]:
ctV2 = CountVectorizer(ngram_range=(2, 2))
#build a 2-gram dictionary of unique words for training set
x_train_bag2 = ctV2.fit_transform(x_train).todense()

ctV_test2 = CountVectorizer(vocabulary=ctV2.get_feature_names())
x_test_bag2 = ctV_test2.fit_transform(x_test).todense()

#postprocessing
x_train_bag_norm2 = normalize(x_train_bag2)

#normalize the data by using L2 norm for testing set 
x_test_bag_norm2 = normalize(x_test_bag2)

In [96]:
lrl2_2 = LogisticRegression()
lrl2_2.fit(x_train_bag_norm2, y_train)
print("Logistic Regression normalized by L2 norm accuracy:{:.2f}".format(lrl2_2.score(x_test_bag_norm2 ,y_test)))

gaussian2 = GaussianNB()
gaussian2.fit(x_train_bag_norm2, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(gaussian2.score(x_test_bag_norm2 ,y_test)))

bernoulli2 = BernoulliNB()
bernoulli2.fit(x_train_bag_norm2, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(bernoulli2.score(x_test_bag_norm2 ,y_test)))

Logistic Regression normalized by L2 norm accuracy:0.41
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.59
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.59


In [97]:
sorted_vocab2 = sorted(ctV2.vocabulary_.items(), key=operator.itemgetter(1), reverse=True)
print("Dictionary with 10 highest values:")   
for word in sorted_vocab2[:10]:
    print(word)

Dictionary with 10 highest values:
('zombiez part', 7181)
('zombiestudents back', 7180)
('zillion time', 7179)
('zero taste', 7178)
('zero star', 7177)
('yun fat', 7176)
('yummy try', 7175)
('yum yum', 7174)
('yum sauce', 7173)
('yukon gold', 7172)


#### <a id="pca">PCA for Bag of Words<a/>

In [100]:
#PCA Features = 10
pca = PCA(n_components=10)

x_trainpca = pca.fit_transform(x_train_bag_norm) 
x_testpca = pca.transform(x_test_bag_norm) 

lr_pca = LogisticRegression() 
lr_pca.fit(x_trainpca, y_train)
print("Logistic Regression normalized by L2 norm accuracy:{:.2f}".format(lr_pca.score(x_testpca ,y_test)))

gaussian_pca = GaussianNB()
gaussian_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(gaussian_pca.score(x_testpca ,y_test)))

bernoulli_pca = BernoulliNB()
bernoulli_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(bernoulli_pca.score(x_testpca ,y_test)))

Logistic Regression normalized by L2 norm accuracy:0.55
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.51
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.52


In [101]:
#PCA Features = 50
pca = PCA(n_components=50)

x_trainpca = pca.fit_transform(x_train_bag_norm) 
x_testpca = pca.transform(x_test_bag_norm) 

lr_pca = LogisticRegression() 
lr_pca.fit(x_trainpca, y_train)
print("Logistic Regression normalized by L2 norm accuracy:{:.2f}".format(lr_pca.score(x_testpca ,y_test)))

gaussian_pca = GaussianNB()
gaussian_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(gaussian_pca.score(x_testpca ,y_test)))

bernoulli_pca = BernoulliNB()
bernoulli_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(bernoulli_pca.score(x_testpca ,y_test)))

Logistic Regression normalized by L2 norm accuracy:0.67
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.62
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.59


In [102]:
pca = PCA(n_components=100)

x_trainpca = pca.fit_transform(x_train_bag_norm) 
x_testpca = pca.transform(x_test_bag_norm) 

lr_pca = LogisticRegression() 
lr_pca.fit(x_trainpca, y_train)
print("Logistic Regression normalized by L2 norm accuracy:{:.2f}".format(lr_pca.score(x_testpca ,y_test)))

gaussian_pca = GaussianNB()
gaussian_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(gaussian_pca.score(x_testpca ,y_test)))

bernoulli_pca = BernoulliNB()
bernoulli_pca.fit(x_trainpca, y_train)
print("Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:{:.2f}"
      .format(bernoulli_pca.score(x_testpca ,y_test)))

Logistic Regression normalized by L2 norm accuracy:0.69
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.65
Navie Bayes Classifier with Gaussian assumption normalized by L2 norm accuracy:0.60


Bag of words using logistic regression performs the best. It reserves all features of words with every single word being captured. The 2-gram method is worse off because it introduces unnecessary features which are more sparse, and it is more biased. The reason that PCA doesn't work well because when it reduces dimension, it also losses some information.    
Regarding to online review language usage, we notice that people always use similar keywords for each label, refer to top 20 words for positive label and negative label in part f and g.