In [None]:
import math
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

First, I started off my journey with parsing the data properly.
The code I used to parse the data can be viewed below:

In [None]:
lines = open("all_sentiment_shuffled.txt", "r", encoding="utf8")
x = []
y = []
first_column = []
third_column = []
for num, line in enumerate(lines):
    a = line.rstrip("\n").split(" ", 3)
    if a[1] == "neg":
        y.append(0)
    elif a[1] == "pos":
        y.append(1)
    x.append(a[0] + " " + a[2] + " " + a[3])

If I want to not care about the first column and third column in the data, I can replace the 12th line with the following one:

In [None]:
x.append(a[3])

So it can be re-executed as a whole:

In [None]:
lines = open("all_sentiment_shuffled.txt", "r", encoding="utf8")
x = []
y = []
first_column = []
third_column = []
for num, line in enumerate(lines):
    a = line.rstrip("\n").split(" ", 3)
    if a[1] == "neg":
        y.append(0)
    elif a[1] == "pos":
        y.append(1)
    x.append(a[3])

Data is splitted to intended proportions.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

Then, we group the positive labels and negative labels:

In [None]:
pos_x_train = []
neg_x_train = []
for i, j in zip(x_train, y_train):
    if j == 0:
        neg_x_train.append(i)
    elif j == 1:
        pos_x_train.append(i)

Then we work out their posterior probabilities with the following two lines.

In [None]:
posterior_pos_x = len(pos_x_train) / (len(pos_x_train) + len(neg_x_train))
posterior_neg_x = len(neg_x_train) / (len(pos_x_train) + len(neg_x_train))

Then, we can leverage sklearn to perform feature extraction phase with the following line.

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1, 1),  # to use bigrams ngram_range=(2,2)
                                   stop_words='english',
                                   analyzer='word')

If we want to get the form of them where TF-IDF algorithms had run. Then we can likewise the following line:

In [None]:
tfid_vectorizer = TfidfVectorizer(ngram_range=(1, 1),  # to use bigrams ngram_range=(2,2)
                                  stop_words='english',
                                  analyzer='word')

These two objects do all for us. The only thing that remains for us is to change the parameters to see the varying results when they are plugged into our Naive Bayes implementation.

I've implemented Naive Bayes as below:

In [3]:
def naive_bayes(l_count_vectorizer):
    pos_x_train_ar = l_count_vectorizer.fit_transform(pos_x_train).toarray()
    pos_indices = l_count_vectorizer.vocabulary_
    pos_freq = np.sum(pos_x_train_ar, axis=0)
    sum_pos_freq = np.sum(pos_freq)
    neg_x_train_ar = l_count_vectorizer.fit_transform(neg_x_train).toarray()
    neg_indices = l_count_vectorizer.vocabulary_
    neg_freq = np.sum(neg_x_train_ar, axis=0)
    sum_neq_freq = np.sum(neg_freq)
    alpha = 1
    predictions = []
    for r in x_test:
        row = l_count_vectorizer.fit_transform([r]).toarray()[0]
        pos_prob = math.log(posterior_pos_x)
        neg_prob = math.log(posterior_neg_x)
        for vocab, freq in zip(l_count_vectorizer.vocabulary_.keys(), row):
            if vocab in pos_indices:
                pos_prob += math.log((pos_freq[pos_indices[vocab]] + alpha) / (sum_pos_freq + alpha * len(pos_freq))) * freq
            else:
                pos_prob += math.log(alpha / (sum_pos_freq + alpha * len(pos_freq))) * freq

            if vocab in neg_indices:
                neg_prob += math.log((neg_freq[neg_indices[vocab]] + alpha) / (sum_neq_freq + alpha * len(neg_freq))) * freq
            else:
                neg_prob += math.log(alpha / (sum_neq_freq + alpha * len(neg_freq))) * freq

        if pos_prob > neg_prob:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions

I can get predictions from my naive bayes implementation as follows.

In [None]:
predictions = naive_bayes(count_vectorizer)

In [None]:
tfid_predictions = naive_bayes(tfid_vectorizer)

We can assess the results by means of the functions provided by sklearn.

In [None]:
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

In [None]:
print(confusion_matrix(y_test, tfid_predictions))
print(accuracy_score(y_test, tfid_predictions))

Now let's dissect the gotten results:

Let's start with the case in which we benefit from only the unigrams.

In [None]:
# count-vectorizer
[[387  90]
 [114 409]]
accuracy_rate = 0.796

# tf-idf vectorizer
[[396  81]
 [ 90 433]]
accuracy_rate = 0.829

The takeaway from the results above is that TF-IDF increases the success rate of the model.

Let's see what happens if we set the n_grams to bigrams.

In [None]:
# count-vectorizer
[[351 117]
 [155 377]]
accuracy_rate = 0.728

# tf-idf vectorizer
[[400  68]
 [217 315]]
accuracy_rate = 0.715

It reduces the success rate for both of them, notably the TF-IDF vectorizer is affected mostly.

If we use unigrams and bigrams together, then we'll get the best estimation result as can be verified below.

In [None]:
# count-vectorizer
[[394  98]
 [ 66 442]]
accuracy_rate = 0.836

# tf-idf vectorizer
[[405  87]
 [ 67 441]]
accuracy_rate = 0.846

Now let's try to not use stop words filtering.

In [None]:
# count-vectorizer
[[413  70]
 [101 416]]
accuracy_rate = 0.829

# tf-idf vectorizer
[[420  63]
 [103 414]]
accuracy_rate = 0.834

The remarkable inference is that when not using filtering out, we get a higher success rate.

Now let's end our writing with the last configuration containing too the first and third columns as training data.

In [None]:
# count-vectorizer
[[416  78]
 [ 87 419]]
accuracy_rate = 0.835

# tf-idf vectorizer
[[446  48]
 [125 381]]
accuracy_rate = 0.827

It doesn't give rise to a meaningful change in the success rate.

Naive Bayes is a substantially fast algorithm that assumes the features are independent of each other. Despite the assumption seems not feasible, it really works well than expected. Being fast and easy to implement makes it a really good baseline performance algorithm. That is, when you want to develop a model, you may want to implement the first Naive Bayes to compare your actual algorithm against it in the future to decide how good your model is. And lastly, I want to emphasize that TF-IDF implementation works usually better than the normal count-vectorizer, but occasionally it results in relatively bad results. Thereby, we can infer that TF-IDF is more susceptible to the training data changes.