# Naive Bayes Text Classification
In this notebook, you will first see a simple Naive Bayes (NB) classifier, which is trained on a tiny toy corpus to classify texts into categories of 'sports' and 'not sports'. Then you are asked to apply and adjust the NB classifier to perform sentiment analysis. 

In [None]:
# toy tiny corpus: sports and non-sports sentences
sports = ['A great game', 'Very clean match','A clean but forgettable game']
non_sports = ['The election was over','It was a close election']

In [None]:
# build vocabulary
all_words = []
sport_words = []
non_sport_words = []
for sent in sports:
    sport_words += [ww.lower() for ww in sent.split()]
for sent in non_sports:
    non_sport_words += [ww.lower() for ww in sent.split()]

all_words = sport_words + non_sport_words
vocab = list(set(all_words))

print(all_words)
print(len(vocab), vocab)

print('sport token nums', len(sport_words))
print('sport type nums', len(set(sport_words)))
print('non-sport token nums', len(non_sport_words))
print('non-sport type nums', len(set(non_sport_words)))

In [None]:
# get the prior distribution
prior_sport = len(sports)*1./(len(sports)+len(non_sports))
prior_non_sport = len(non_sports)*1./(len(sports)+len(non_sports))

In [None]:
# get the word frequencies, which will be later used to compute likelihood
from nltk import FreqDist
sport_fd = FreqDist(sport_words)
non_sport_fd = FreqDist(non_sport_words)

print(sport_fd['close'])

In [None]:
# NB classifier
import numpy as np
def predict_class(words):
    sport_likelihood = []
    non_likelihood = []
    for ww in words:
        sport_likelihood.append((sport_fd[ww]+1.)/(len(sport_words)+len(vocab)))
        non_likelihood.append((non_sport_fd[ww]+1.)/(len(non_sport_words)+len(vocab)))
    print(sport_likelihood)
    print(non_likelihood)
    s_loglhd = np.sum([np.log(l) for l in sport_likelihood])
    n_loglhd = np.sum([np.log(l) for l in non_likelihood])
    print(s_loglhd, n_loglhd)
    sprob = np.log(prior_sport)+s_loglhd
    nprob = np.log(prior_non_sport)+n_loglhd
    if sprob > nprob: return 'sport'
    else: return 'non_sport'
    
print(predict_class('a very interesting game'.split()))

## Exercise: NB-based Sentiment Analysis
*Sentiment analysis* is probably the most commerical application of text classification. It takes a customer review and checks the overall sentiment of the review. Here we use the movie review corpus to train a NB-based sentiment analyzer. 

In [None]:
# obtain the data
from nltk.corpus import movie_reviews
import random
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

print('document num', len(documents))
print('labels:', set([dd[1] for dd in documents]))
print(documents[0][0], documents[0][1])

In [None]:
# split the data into train, dev-test and test

train_data = documents[:1200]
dev_data = documents[1200:1600]
test_data = documents[1600:]

In [None]:
# build the prior probability of pos and neg (based on train_data)
prior_pos = ...
prior_neg = ...

In [None]:
# build vocabulary based on train_data
# you may investigate whether to remove stopwords and punctuations and 
# whether to apply lemmatization/stemming, and compare their performance on dev-test set 

vocab = ...

In [None]:
# for each class (pos and neg), maintain the frequency of each type, so as to compute likelihood
from nltk import FreqDist
pos_fd = FreqDist(pos_words)
neg_fd = FreqDist(neg_words)

In [None]:
# build the class prediction function
def predict_sentiment(input_text):
    pass

# evaluate your model's performance on the dev-test set
dev_pred_labels = []
dev_true_labels = [ll for (dd,ll) in dev_data]
for tt,_ in dev_data:
    dev_pred_labels.append(predict_sentiment(tt))

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
print('acc', accuracy_score(dev_true_labels, dev_pred_labels))
print(precision_recall_fscore_support(dev_true_labels, dev_pred_labels, average=None, labels=['pos', 'neg']))

# develop different models (with and without stopwords/punctuations/stemming/lemmatization),
# and select the best model by its performance on the dev-test set;
# the selected best model will be applied to test data in the next step

In [None]:
# test the performance of the best model on test set
test_pred_labels = []
test_true_labels = [ll for (dd,ll) in test_data]
for tt,_ in test_data:
    test_pred_labels.append(predict_sentiment(tt))

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
print('acc', accuracy_score(test_true_labels, test_pred_labels))
print(precision_recall_fscore_support(test_true_labels, test_pred_labels, average=None, labels=['pos', 'neg']))
