# Sentiment Analysis - NLU

## Version: **Multinomial Naive Bayes**
Student: Francesco Laiti

---

This notebook contains the source code to build, train and evaluate a Naive Bayes-based sentiment analysis model using the scikit-learn library.

This version is considered as baseline.

## Pre requirements

Define the requirements to run correctly the notebook and load properly the datasets.

In [1]:
import nltk
from nltk.corpus import movie_reviews
from nltk.corpus import subjectivity
import numpy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate

nltk.download('movie_reviews')
nltk.download('subjectivity')

Declare the global constants used in this notebook.

In [2]:
N_SPLIT = 5 # default value of Stratified K-Fold function
N_AFTER_COMMA = 1

### Utility functions

In [3]:
def lol2str(doc): # lol: list of list
    return " ".join([w for sent in doc for w in sent])

def low2str(sent): # low: list of word
    return " ".join([w for w in sent])

## Dataset

In [4]:
def get_movie_review_data():
    neg = movie_reviews.paras(categories='neg')
    pos = movie_reviews.paras(categories='pos')
    return neg, pos

def get_subjectivity_data():
    subj = subjectivity.sents(categories='subj')
    obj = subjectivity.sents(categories='obj')
    return subj, obj

In [5]:
def filter_subj_sents(classifier, vectorizer, data):
    corpus = [low2str(d) for d in data]
    vectors = vectorizer.transform(corpus)
    filter_sents = classifier.predict(vectors)

    sbj_sents = [d for d, estimate in zip(data, filter_sents) if estimate == 0] # 0 is assigned to subj

    return sbj_sents

In [6]:
subj, obj = get_subjectivity_data()
neg, pos  = get_movie_review_data()

## Train & Evaluation

In [7]:
def train_subjectivity(subj, obj):
    vectorizer = CountVectorizer()
    classifier = MultinomialNB()

    corpus = [low2str(d) for d in subj] + [low2str(d) for d in obj]
    labels = numpy.array([0] * len(subj) + [1] * len(obj))
    
    vectors = vectorizer.fit_transform(corpus)

    scores = cross_validate(classifier, vectors, labels, cv=StratifiedKFold(n_splits=N_SPLIT) , scoring=['accuracy', 'f1'])
    test_accuracy = numpy.array(scores['test_accuracy'])*100
    test_f1 = numpy.array(scores['test_f1'])*100
    print('Naive Bayes subjectivity classification')
    print('\taccuracy:', round(test_accuracy.mean(), N_AFTER_COMMA), '+-', round(test_accuracy.std(), N_AFTER_COMMA), \
            '%\n\tF1-score:', round(test_f1.mean(), N_AFTER_COMMA), '+-', round(test_f1.std(), N_AFTER_COMMA), '%')
    classifier.fit(vectors, labels)

    return classifier, vectorizer

def train_polarity(neg, pos):
    vectorizer = CountVectorizer()
    classifier = MultinomialNB()

    corpus = [lol2str(d) for d in neg] + [lol2str(d) for d in pos]
    labels = numpy.array([0] * len(neg) + [1] * len(pos))

    vectors = vectorizer.fit_transform(corpus)

    scores = cross_validate(classifier, vectors, labels, cv=StratifiedKFold(n_splits=N_SPLIT) , scoring=['accuracy', 'f1'])
    test_accuracy = numpy.array(scores['test_accuracy'])*100
    test_f1 = numpy.array(scores['test_f1'])*100
    print('Naive Bayes subjectivity classification')
    print('\taccuracy:', round(test_accuracy.mean(), N_AFTER_COMMA), '+-', round(test_accuracy.std(), N_AFTER_COMMA), \
            '%\n\tF1-score:', round(test_f1.mean(), N_AFTER_COMMA), '+-', round(test_f1.std(), N_AFTER_COMMA), '%')

## Train subjectivity classifier

In [8]:
classifier, vectorizer = train_subjectivity(subj, obj)

Naive Bayes subjectivity classification
	accuracy: 92.0 +- 0.7 %
	F1-score: 91.9 +- 0.6 %


## Train no-filter sents polarity classifier

In [9]:
train_polarity(neg, pos)

Naive Bayes subjectivity classification
	accuracy: 81.4 +- 1.4 %
	F1-score: 81.1 +- 1.6 %


## Train filter sents polarity classifier

In [10]:
neg_filtered = [filter_subj_sents(classifier, vectorizer, d) for d in neg]
pos_filtered = [filter_subj_sents(classifier, vectorizer, d) for d in pos]

train_polarity(neg_filtered, pos_filtered)

Naive Bayes subjectivity classification
	accuracy: 84.1 +- 2.1 %
	F1-score: 83.9 +- 2.2 %
