# Thumbs up? Sentiment Classification using Machine Learning Techniques
<ul>
    <li>by Bo Pang, Lillian Lee and Shivakumar Vaithyanathan</li>
    <li>recreated by <b>Jan Kristoffer Cheng</b> and <b>Johansson Tan</b></li>
</ul>

This Jupyter Notebook tries to recreate the results from the paper of Pang, Lee and Vaithyanathan regarding sentiment classification. They used movie reviews from IMDb as their corpus and classified the reviews as having either a positive or negative sentiment. In order to be able to do binary classification, they built different models using different features and machine learning techniques. The machine learning techniques that they used were Naive Bayes, Maximum Entropy, and SVM, but this project will only use Naive Bayes and SVM.

In [1]:
%reload_ext autoreload
%autoreload 2

<h3>Reading the corpus</h3>

The corpus is readily available online. It contains different versions with each version having cleaner data. The results from the paper used version 0.9, but we used version 1.0. 

The zip file when extracted is split into two folders neg and pos, with each having 700 text files falling into the corresponding category. The class <i>FileReader</i> reads all the files given a path. This is also where the punctuations are separated from the word to easily distinguish them later on.

In [2]:
from file_reader import FileReader

negPath = 'mix20_rand700_tokens_cleaned/tokens/neg/'
posPath = 'mix20_rand700_tokens_cleaned/tokens/pos/'

fileReader = FileReader()

negatives = fileReader.getTexts(negPath)
positives = fileReader.getTexts(posPath)
allTexts = negatives + positives

print('Negative:', len(negatives))
print('Positive:', len(positives))
print('Total:', len(allTexts))

N = len(negatives)

Negative: 700
Positive: 700
Total: 1400


<h3>Appending tags for negations</h3>

The class <i>TextNegator</i> appends <b>--n</b> to words between a negation and punctuation. The output from this function is used for unigrams. As an example, consider the sentence: <b>I don't like the movie. I didn't enjoy at all.</b> The punctuations from this example will be split from the word as this is already done in <i>FileReader</i>.

In [3]:
from features import TextNegator

texts = ["I don't like the movie . I didn't enjoy at all ."]
textNegator = TextNegator()
textNegator.getNegated(texts)

["I don't like--n the--n movie--n . I didn't enjoy--n at--n all--n ."]

The list of negations and punctuations used for <i>TextNegator</i> are inside the features file.

Both texts in negative and positive will be processed by <i>TextNegator</i> for latter use.

In [4]:
negatedNegatives = textNegator.getNegated(negatives)
negatedPositives = textNegator.getNegated(positives)

<h3>Preparing libraries</h3>

Before anything else, different libraries such as numpy and sklearn should be imported. They will be utilized in building the models later on. 

In [5]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from k_fold import KFoldBatcher

nFold = 3
nPerFold = int(N/nFold)
print(nPerFold)

kfold = KFold(nFold)
results = []

233


Similar to the paper, we used 3-fold cross validation with each fold having 233 texts from each category. The class <i>KFoldBatcher</i> splits the dimensions and classes into batches for cross validation. The variable <i>results</i> will hold the different outputs such as features used, average number of features, and average accuracies of both Naive Bayes and SVM.

<h3>Unigrams frequency</h3>

```python
class UnigramFeature:
    def __init__(self):
        self.unigrams = []
    def process(self, negatedTexts):
    def get(self, negatedTexts, type='pres'):
```

<i>UnigramFeature.process(negatedTexts)</i> saves all the unigrams that appeared at least 4 times in the training data and stores it in <i>self.unigrams</i>.

<i>UnigramFeature.get(negatedTexts, type='freq')</i> returns the unigrams' frequencies of the negated texts as a numpy array.

In [6]:
from features import UnigramFeature

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramFeature = UnigramFeature()
    unigramFeature.process([negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex])
    nFeatures += len(unigramFeature.unigrams)
    
    featuresNegative = unigramFeature.get(negatedNegatives, type='freq')
    featuresPositive = unigramFeature.get(negatedPositives, type='freq')
    
    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'unigrams', 
    'nFeatures': nFeatures, 
    'freqPres': 'freq', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 12660
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.782546494993


<h3>Unigrams presence</h3>

In contrast before, the models here utilize the presence of the unigrams. Thus, each feature will either have a value of 0 or 1.

<i>UnigramFeature.get(negatedTexts, type='freq')</i> returns the unigrams' presence of the negated texts as a numpy array.

In [7]:
unigramFeaturesNegative = []
unigramFeaturesPositive = []

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramFeature = UnigramFeature()
    unigramFeature.process([negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex])
    nFeatures += len(unigramFeature.unigrams)
    
    featuresNegative = unigramFeature.get(negatedNegatives, type='pres')
    featuresPositive = unigramFeature.get(negatedPositives, type='pres')
    
    unigramFeaturesNegative.append(featuresNegative)
    unigramFeaturesPositive.append(featuresPositive)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'unigrams', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 12660
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.792560801144


<h3>Bigrams</h3>

```python
class BigramFeature:
    def __init__(self):
        self.bigrams = []
    def process(self, texts):
    def get(self, texts):
```

<i>BigramFeature.process(texts)</i> saves the top 16165 bigrams that appeared at least 7 times in the training data and stores it in <i>self.bigrams</i>.

<i>BigramFeature.get(texts)</i> returns the presence of the bigrams of the texts as a numpy array.

In [8]:
from features import BigramFeature

bigramFeaturesNegative = []
bigramFeaturesPositive = []

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    bigramFeature = BigramFeature()
    bigramFeature.process([negatives[index] for index in trainIndex] + [positives[index] for index in trainIndex])
    nFeatures += len(bigramFeature.bigrams)
    
    featuresNegative = bigramFeature.get(negatives)
    featuresPositive = bigramFeature.get(positives)
    
    bigramFeaturesNegative.append(featuresNegative)
    bigramFeaturesPositive.append(featuresPositive)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'bigrams', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 16165
Naive Bayes Accuracy: 0.745350500715
SVM Accuracy: 0.752503576538


<h3>Unigrams and Bigrams</h3>

The models here utilize both unigrams and bigrams, which is essentially concatenating both features before training.

In [9]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
for i in range(nFold):
    featuresNegative = np.concatenate((unigramFeaturesNegative[i], bigramFeaturesNegative[i]), axis=1)
    featuresPositive = np.concatenate((unigramFeaturesPositive[i], bigramFeaturesPositive[i]), axis=1)
    nFeatures += unigramFeaturesNegative[i].shape[1] + bigramFeaturesNegative[i].shape[1]

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'unigrams+bigrams', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 28825
Naive Bayes Accuracy: 0.776824034335
SVM Accuracy: 0.808297567954


<h3>Unigrams + POS</h3>

The class <i>POSTagger</i> returns the part-of-speech sequence using the nltk library given a string. As an example, consider the sentence: <b>The movie was really great! I didn't expect that plot twist!</b>

In [10]:
from features import POSTagger

texts = ["The movie was really great ! I didn't expect that plot twist !"]
posTagger = POSTagger()
posTagger.getPOS(texts)

[['DT',
  'NN',
  'VBD',
  'RB',
  'JJ',
  '.',
  'PRP',
  'VBP',
  'VB',
  'IN',
  'NN',
  'NN',
  '.']]

Process the POS sequences of the corpus for later use.

In [11]:
posNegatives = posTagger.getPOS(negatives)
posPositives = posTagger.getPOS(positives)

```python
class UnigramPOSFeature:
    def __init__(self):
        self.unigrams = []
    def process(self, negatedTexts, posOfTexts):
    def get(self, negatedTexts, posOfTexts):
```

<i>UnigramPOSFeature.process(negatedTexts, posOfTexts)</i> saves the unique unigrams and POS combination that appeared at least 4 times in the training data and stores it in <i>self.unigrams</i>.

<i>UnigramPOSFeature.get(negatedTexts, posOfTexts)</i> returns the presence of the unigrams of the texts as a numpy array.

In [12]:
from features import UnigramPOSFeature

nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramPOSFeature = UnigramPOSFeature()
    negatedTextsTrain = [negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex]
    posTextsTrain = [posNegatives[index] for index in trainIndex] + [posPositives[index] for index in trainIndex]
    unigramPOSFeature.process(negatedTextsTrain, posTextsTrain)
    nFeatures += len(unigramPOSFeature.unigrams)
    
    featuresNegative = unigramPOSFeature.get(negatedNegatives, posNegatives)
    featuresPositive = unigramPOSFeature.get(negatedPositives, posPositives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'unigrams+POS', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 13707
Naive Bayes Accuracy: 0.776824034335
SVM Accuracy: 0.79113018598


<h3>Adjectives</h3>

```python
class AdjectiveFeature:
    def __init__(self):
        self.adjectives = []
    def process(self, texts, posOfTexts):
    def get(self, texts):
```

<i>AdjectiveFeature.process(texts, posOfTexts)</i> saves the adjectives that appeared in the training data and stores it in <i>self.adjectives</i>.

<i>AdjectiveFeature.get(texts)</i> returns the presence of the adjectives of the texts as a numpy array.

In [13]:
from features import AdjectiveFeature

In [14]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    adjFeature = AdjectiveFeature()
    textsTrain = [negatives[index] for index in trainIndex] + [positives[index] for index in trainIndex]
    posTextsTrain = [posNegatives[index] for index in trainIndex] + [posPositives[index] for index in trainIndex]
    adjFeature.process(textsTrain, posTextsTrain)
    nFeatures += len(adjFeature.adjectives)
    
    featuresNegative = adjFeature.get(negatives)
    featuresPositive = adjFeature.get(positives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'adjectives', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 11261
Naive Bayes Accuracy: 0.777539341917
SVM Accuracy: 0.76251788269


<h3>Top 2633 unigrams</h3>

This model utilizes the top 2633 unigrams as features.

In [15]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
for i in range(nFold):
    featuresNegative = unigramFeaturesNegative[i][:2633]
    featuresPositive = unigramFeaturesPositive[i][:2633]

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = 2633
result = {
    'features': 'top 2633 unigrams', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 2633
Naive Bayes Accuracy: 0.778254649499
SVM Accuracy: 0.792560801144


<h3>Unigrams + position</h3>

The class <i>PositionTagger</i> returns the position sequence given a string, considering the first quarter, middle and last quarter. As an example, consider the sentence: <b>The movie was really great! I didn't expect that plot twist!</b>

In [16]:
from features import PositionTagger

texts = ["The movie was really great ! I didn't expect that plot twist !"]
positionTagger = PositionTagger()
positionTagger.getPositions(texts)

[[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]]

As seen in the example, when the sentence is splitted, it results to 12 words. Therefore, each quarter should have 3 words. The position tag of the first 3 words is 0, the position tag of the middle part is 1, and the position tag of the last quarter is 2.

Process the position sequences of the corpus for later use.

In [17]:
positionNegatives = positionTagger.getPositions(negatedNegatives)
positionPositives = positionTagger.getPositions(negatedPositives)

```python
class UnigramPositionFeature:
    def __init__(self):
        self.unigrams = []
    def process(self, negatedTexts, positionsOfTexts):
    def get(self, negatedTexts, positionsOfTexts):
```

<i>UnigramPositionFeature.process(negatedTexts, positionsOfTexts)</i> saves the combination of unigram and position that appeared at least 4 times in the training data and stores it in <i>self.unigrams</i>.

<i>UnigramPositionFeature.get(negatedTexts, positionsOfTexts)</i> returns the presence of the unigrams of the texts as a numpy array.

In [18]:
from features import UnigramPositionFeature

In [19]:
nbAccuracy = 0
svmAccuracy = 0
nFeatures = 0
i = 0
for trainIndex, testIndex in kfold.split(negatives):
    unigramPositionFeature = UnigramPositionFeature()
    negatedTextsTrain = [negatedNegatives[index] for index in trainIndex] + [negatedPositives[index] for index in trainIndex]
    positionTextsTrain = [positionNegatives[index] for index in trainIndex] + [positionPositives[index] for index in trainIndex]
    unigramPositionFeature.process(negatedTextsTrain, positionTextsTrain)
    nFeatures += len(unigramPositionFeature.unigrams)
    
    featuresNegative = unigramPositionFeature.get(negatedNegatives, positionNegatives)
    featuresPositive = unigramPositionFeature.get(negatedPositives, positionPositives)

    kfoldBatcher = KFoldBatcher(nFold, featuresNegative, featuresPositive)
    
    trainX = kfoldBatcher.getTrainX(i)
    trainY = kfoldBatcher.getTrainY(i)
    
    testX = kfoldBatcher.getTestX(i)
    testY = kfoldBatcher.getTestY(i)
    
    nb = BernoulliNB()
    nb.fit(trainX, trainY)
    nbAccuracy += accuracy_score(nb.predict(testX), testY)

    svm = LinearSVC()
    svm.fit(trainX, trainY)
    svmAccuracy += accuracy_score(svm.predict(testX), testY)
    
    i += 1
    
nbAccuracy /= nFold
svmAccuracy /= nFold
nFeatures = int(nFeatures/nFold)
result = {
    'features': 'unigrams+position', 
    'nFeatures': nFeatures, 
    'freqPres': 'pres', 
    'nb': nbAccuracy, 
    'svm': svmAccuracy
}
results.append(result)

print('Number of Features:', nFeatures)
print('Naive Bayes Accuracy:', nbAccuracy)
print('SVM Accuracy:', svmAccuracy)

Number of Features: 16842
Naive Bayes Accuracy: 0.774678111588
SVM Accuracy: 0.786123032904


<h3>Results</h3>

This block of code is for displaying the results in table format for better presentation. The results from this recreation and the original paper will be compared.

In [20]:
from IPython.display import HTML, display
html = '<center><b>Table 1 Average three-fold cross validation accuracies</b>'

html += '''
    <table>
        <tr>
            <th></th>
            <th>Features</th>
            <th># of features</th>
            <th>frequency or presence?</th>
            <th>NB</th>
            <th>SVM</th>
        </tr>
'''

for i in range(len(results)):
    result = results[i]
    html += '<td>(' + str(i+1) + ')</td>' + '<td>' + result['features'] + '</td>' + '<td>' + str(result['nFeatures']) + '</td>' + '<td>' + result['freqPres'] + '</td>' + '<td>' + str(round(result['nb']*100, 1)) + '</td>' + '<td>' + str(round(result['svm']*100,1)) + '</td>' + '</tr>'

html += '</table></center>'
    
display(HTML(html))

Unnamed: 0,Features,# of features,frequency or presence?,NB,SVM


<center><b>Table 2 Average three-fold cross validation accuracies from the paper</b></center>
<table>
    <tr>
        <th></th>
        <th>Features</th>
        <th># of features</th>
        <th>frequency or presence?</th>
        <th>NB</th>
        <th>SVM</th>
    </tr>
    <tr>
        <td>(1)</td>
        <td>unigrams</td>
        <td>16165</td>
        <td>freq</td>
        <td>78.7</td>
        <td>72.8</td>
    </tr>
    <tr>
        <td>(2)</td>
        <td>unigrams</td>
        <td>16165</td>
        <td>pres</td>
        <td>81.0</td>
        <td>82.9</td>
    </tr>
    <tr>
        <td>(3)</td>
        <td>bigrams</td>
        <td>16165</td>
        <td>pres</td>
        <td>77.3</td>
        <td>77.1</td>
    </tr>
    <tr>
        <td>(4)</td>
        <td>unigrams+bigrams</td>
        <td>32330</td>
        <td>pres</td>
        <td>80.6</td>
        <td>82.7</td>
    </tr>
    <tr>
        <td>(5)</td>
        <td>unigrams+POS</td>
        <td>16695</td>
        <td>pres</td>
        <td>81.5</td>
        <td>81.9</td>
    </tr>
    <tr>
        <td>(6)</td>
        <td>adjectives</td>
        <td>2633</td>
        <td>pres</td>
        <td>77.0</td>
        <td>75.1</td>
    </tr>
    <tr>
        <td>(7)</td>
        <td>top 2633 unigrams</td>
        <td>2633</td>
        <td>pres</td>
        <td>80.3</td>
        <td>81.4</td>
    </tr>
    <tr>
        <td>(8)</td>
        <td>unigrams+position</td>
        <td>22430</td>
        <td>pres</td>
        <td>81.0</td>
        <td>81.6</td>
    </tr>
</table>

The results above show the accuracies in percent. Table 1 shows the results of the recreation, while Table 2 shows the results from the paper. The results recreated are different from the results of the original paper, but they are close to each other. The main reason behind this might be due to having different preprocessing methods such as tokenization including punctuation detection, text negation, and POS tagging.

For tokenization, the sentences were only splitted by spaces and the punctuations were only detected if they are at the end of a word. For text negation, the paper does not explicitly say what negation words were used so we listed our own negation words (see features.py -> NEGATIONS variable). As for POS tagging, different libraries were used. We used the NLTK library, while they used Oliver Mason's QTag program.

The researchers of the study intend to prove that Naive Bayes performs well in sentiment classification. It is almost always the case that Support Vector Machines(SVM) outperforms Naive Bayes(NB) in different classification problems, but not when it comes to sentiment classification. The results show that the performace of SVM and NB are at par with each other.

Further study can be done on sentiment classification to increase the performance of the models. Some recommendations are having cleaner data and combining the different features and see if it 