# Thumbs up? Sentiment Classification using Machine Learning Techniques
<ul>
    <li>by Bo Pang, Lillian Lee and Shivakumar Vaithyanathan</li>
    <li>recreated by <b>Jan Kristoffer Cheng</b> and <b>Johansson Tan</b></li>
</ul>

This Jupyter Notebook tries to recreate the results from the paper of Pang, Lee and Vaithyanathan regarding sentiment classification. They used movie reviews from IMDb as their corpus and classified the reviews as having either a positive or negative sentiment. In order to be able to do binary classification, they built different models using different features and machine learning techniques. The machine learning techniques that they used were Naive Bayes, Maximum Entropy, and SVM, but this project will only use Naive Bayes and SVM.

In [1]:
%reload_ext autoreload
%autoreload 2

<h3>Reading the corpus</h3>

The corpus is readily available online. It contains different versions with each version having cleaner data. The results from the paper used version 0.9, but we used version 1.0. 

The zip file when extracted is split into two folders neg and pos, with each having 700 text files falling into the corresponding category. The class <i>FileReader</i> reads all the files given a path. This is also where the punctuations are separated from the word to easily distinguish them later on.

In [2]:
from file_reader import FileReader

negPath = 'mix20_rand700_tokens_cleaned/tokens/neg/'
posPath = 'mix20_rand700_tokens_cleaned/tokens/pos/'

fileReader = FileReader()

negatives = fileReader.getTexts(negPath)
positives = fileReader.getTexts(posPath)
allTexts = negatives + positives

print('Negative:', len(negatives))
print('Positive:', len(positives))
print('Total:', len(allTexts))

N = len(negatives)

Negative: 700
Positive: 700
Total: 1400


<h3>Appending tags for negations</h3>

The class <i>TextNegator</i> appends <b>--n</b> to words between a negation and punctuation. The output from this function is used for unigrams. As an example, consider the sentence: <b>I don't like the movie. I didn't enjoy at all.</b> The punctuations from this example will be split from the word as this is already done in <i>FileReader</i>.

In [31]:
from features import TextNegator

texts = ["I don't like the movie . I didn't enjoy at all ."]
textNegator = TextNegator()
textNegator.getNegated(texts)

["I don't like--n the--n movie--n . I didn't enjoy--n at--n all--n ."]

The list of negations and punctuations used for <i>TextNegator</i> are inside the features file.

Both texts in negative and positive will be processed by <i>TextNegator</i> for latter use.

In [4]:
negatedNegatives = textNegator.getNegated(negatives)
negatedPositives = textNegator.getNegated(positives)

<h3>Preparing libraries</h3>

Before anything else, different libraries such as numpy and sklearn should be imported. They will be utilized in building the models later on. 

In [7]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from k_fold import KFoldBatcher

nFold = 3
nPerFold = int(N/nFold)
print(nPerFold)

kfold = KFold(nFold)
results = {'features':[], 'nFeatures': [],'nb': [], 'svm': []}

Similar to the paper, we used 3-fold cross validation with each fold having 233 texts from each category. The class <i>KFoldBatcher</i> splits the dimensions and classes into batches for cross validation. The variable <i>results</i> will hold the different outputs such as features used, average number of features, and average accuracies of both Naive Bayes and SVM.

<h3>Unigrams frequency</h3>

```python
class UnigramFeature:
    def __init__(self):
        self.unigrams = []
    def process(self, negatedTexts):
    def get(self, negatedTexts, type='pres'):
```

<i>UnigramFeature.process(negatedTexts)</i> saves all the unigrams that appeared at least 4 times in the training data and stores it in <i>self.unigrams</i>.

<i>UnigramFeature.get(negatedTexts, type='freq')</i> returns the unigrams' frequencies of the negated texts as a numpy array.

In [9]:
from features import UnigramFeature

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramFeature = UnigramFeature()
    unigramFeature.process([negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex])
    nFeatures += len(unigramFeature.unigrams)
    
    featuresNegative = unigramFeature.get(negatedNegatives, type='freq')
    featuresPositive = unigramFeature.get(negatedPositives, type='freq')
    
    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('unigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: unigrams frequency')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: unigrams frequency
Number of Features: 12660
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.782546494993


<h3>Unigrams presence</h3>

In contrast before, the models here utilize the presence of the unigrams. Thus, each feature will either have a value of 0 or 1.

<i>UnigramFeature.get(negatedTexts, type='freq')</i> returns the unigrams' presence of the negated texts as a numpy array.

In [10]:
unigramFeaturesNegative = []
unigramFeaturesPositive = []

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramFeature = UnigramFeature()
    unigramFeature.process([negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex])
    nFeatures += len(unigramFeature.unigrams)
    
    featuresNegative = unigramFeature.get(negatedNegatives, type='pres')
    featuresPositive = unigramFeature.get(negatedPositives, type='pres')
    
    unigramFeaturesNegative.append(featuresNegative)
    unigramFeaturesPositive.append(featuresPositive)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('unigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: unigrams presence')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: unigrams presence
Number of Features: 12660
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.792560801144


<h3>Bigrams</h3>

```python
class BigramFeature:
    def __init__(self):
        self.bigrams = []
    def process(self, texts):
    def get(self, texts):
```

<i>BigramFeature.process(texts)</i> saves the top 16165 bigrams that appeared at least 7 times in the training data and stores it in <i>self.bigrams</i>.

<i>BigramFeature.get(texts)</i> returns the presence of the bigrams of the texts as a numpy array.

In [17]:
from features import BigramFeature

bigramFeaturesNegative = []
bigramFeaturesPositive = []

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    bigramFeature = BigramFeature()
    bigramFeature.process([negatives[index] for index in trainIndex] + [positives[index] for index in trainIndex])
    nFeatures += len(bigramFeature.bigrams)
    
    featuresNegative = bigramFeature.get(negatives)
    featuresPositive = bigramFeature.get(positives)
    
    bigramFeaturesNegative.append(featuresNegative)
    bigramFeaturesPositive.append(featuresPositive)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('bigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: bigrams')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: bigrams
Number of Features: 16165
Naive Bayes Accuracy: 0.748927038627
SVM Accuracy: 0.752503576538


<h3>Unigrams and Bigrams</h3>

The models here utilize both unigrams and bigrams, which is essentially concatenating both features before training.

In [19]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
for i in range(nFold):
    print(i)
    featuresNegative = np.concatenate((unigramFeaturesNegative[i], bigramFeaturesNegative[i]), axis=1)
    featuresPositive = np.concatenate((unigramFeaturesPositive[i], bigramFeaturesPositive[i]), axis=1)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('unigrams+bigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: unigrams+bigrams presence')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

0


IndexError: list index out of range

<h3>Unigrams + POS</h3>

The class <i>POSTagger</i> returns the part-of-speech sequence using the nltk library given a string. As an example, consider the sentence: <b>The movie was really great! I didn't expect that plot twist!</b>

In [33]:
from features import POSTagger

texts = ["The movie was really great ! I didn't expect that plot twist !"]
posTagger = POSTagger()
posTagger.getPOS(texts)

[['DT',
  'NN',
  'VBD',
  'RB',
  'JJ',
  '.',
  'PRP',
  'VBP',
  'VB',
  'IN',
  'NN',
  'NN',
  '.']]

Process the POS sequences of the corpus for later use.

In [None]:
posNegatives = posTagger.getPOS(negatives)
posPositives = posTagger.getPOS(positives)

```python
class UnigramPOSFeature:
    def __init__(self):
        self.unigrams = []
    def process(self, negatedTexts, posOfTexts):
    def get(self, negatedTexts, posOfTexts):
```

<i>UnigramPOSFeature.process(negatedTexts, posOfTexts)</i> saves the unique unigrams and POS combination that appeared at least 4 times in the training data and stores it in <i>self.unigrams</i>.

<i>UnigramPOSFeature.get(negatedTexts, posOfTexts)</i> returns the presence of the unigrams of the texts as a numpy array.

In [23]:
from features import UnigramPOSFeature

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramPOSFeature = UnigramPOSFeature()
    negatedTextsTrain = [negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex]
    posTextsTrain = [posNegatives[index] for index in trainIndex] + [posPositives[index] for index in trainIndex]
    unigramPOSFeature.process(negatedTextsTrain, posTextsTrain)
    nFeatures += len(unigramPOSFeature.unigrams)
    
    featuresNegative = unigramPOSFeature.get(negatedNegatives, posNegatives)
    featuresPositive = unigramPOSFeature.get(negatedPositives, posPositives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('unigrams+POS')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: unigrams+POS')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: unigrams+POS
Number of Features: 13707
Naive Bayes Accuracy: 0.776824034335
SVM Accuracy: 0.79113018598


Adjectives

In [21]:
from features import AdjectiveFeature

In [22]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    adjFeature = AdjectiveFeature()
    textsTrain = [negatives[index] for index in trainIndex] + [positives[index] for index in trainIndex]
    posTextsTrain = [posNegatives[index] for index in trainIndex] + [posPositives[index] for index in trainIndex]
    adjFeature.process(textsTrain, posTextsTrain)
    nFeatures += len(adjFeature.adjectives)
    
    featuresNegative = adjFeature.get(negatives)
    featuresPositive = adjFeature.get(positives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('adjectives')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: adjectives')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: adjectives
Number of Features: 11261
Naive Bayes Accuracy: 0.777539341917
SVM Accuracy: 0.76251788269


Top 2633 Unigrams

In [24]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
for i in range(nFold):
    featuresNegative = unigramFeaturesNegative[i][:2633]
    featuresPositive = unigramFeaturesPositive[i][:2633]

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = 2633
results['features'].append('top 2633 unigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: top 2633 unigrams')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: top 2633 unigrams
Number of Features: 2633
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.792560801144


Unigrams + position

In [25]:
from features import PositionTagger

positionTagger = PositionTagger()
positionNegatives = positionTagger.getPositions(negatedNegatives)
positionPositives = positionTagger.getPositions(negatedPositives)

In [26]:
from features import UnigramPositionFeature

In [27]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramPositionFeature = UnigramPositionFeature()
    negatedTextsTrain = [negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex]
    positionTextsTrain = [positionNegatives[index] for index in trainIndex] + [positionPositives[index] for index in trainIndex]
    unigramPositionFeature.process(negatedTextsTrain, positionTextsTrain)
    nFeatures += len(unigramPositionFeature.unigrams)
    
    featuresNegative = unigramPositionFeature.get(negatedNegatives, positionNegatives)
    featuresPositive = unigramPositionFeature.get(negatedPositives, positionPositives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
results['features'].append('unigrams+bigrams')
results['nFeatures'].append(nFeatures)
results['nb'].append(nbAccuracy)
results['svm'].append(svmAccuracy)

print('Features: unigrams+position')
print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Features: unigrams+position
Number of Features: 16842
Naive Bayes Accuracy: 0.774678111588
SVM Accuracy: 0.786123032904


In [None]:
unigramPositionFeature.unigrams

In [None]:
print(negatives[0])
negates = textNegator.getNegated(negatives[:2])
print(negates[0])

In [28]:
print(results)

{'features': ['unigrams', 'unigrams', 'bigrams', 'bigrams', 'bigrams', 'adjectives', 'unigrams+POS', 'top 2633 unigrams', 'unigrams+bigrams'], 'svm': [0.78254649499284701, 0.79256080114449212, 0.5, 0.5, 0.75250357653791122, 0.76251788268955645, 0.79113018597997131, 0.79256080114449212, 0.78612303290414876], 'nFeatures': [12660, 12660, 16165, 16165, 16165, 11261, 13707, 2633, 16842], 'nb': [0.77825464949928469, 0.77825464949928469, 0.5, 0.5, 0.74892703862660948, 0.77753934191702434, 0.77682403433476388, 0.77825464949928469, 0.77467811158798272]}
