A class project for CS585: Introduction to Natural Language Processing. A Naive Bayes classifier for sentiment analysis of movie reviews.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Naieve Bayes Bag-of-Words Sentiment Classifier

A project for CS585 - Introduction to Natural Language Processing

Assignment Description

Starter code


Instructor: Brendan T. O'Connor


Trains a naieve bayes classifier to predict sentiment of a movie review (positive or negative). The assignment code has been cleaned up and streamlined to facilitate reading and usage. This means the complete solution to the assignment is not here, just what I deemed the most relevant part for sharing.

Instructor Implementations

  • tokenize_doc
  • train
  • report_statistics_after_training

Modifications to Instructor Implementations

  • __init__: Added feature_extractor member that defaults to tokenize_doc
  • tokenize_and_update_model: Switched to use feature_extractor member rather than tokenize_doc

Implementations I provided

  • tokenize_doc_stopwords
  • tokenize_doc_stopwords_custom
  • tokenize_doc_stopwords_and_stemming
  • update_model
  • p_word_given_label
  • log_likelihood
  • p_word_given_label_and_psuedocount
  • log_likelihood
  • log_prior
  • unnormalized_log_posterior
  • classify
  • likelihood_ratio
  • evaluate_classifier_accuracy


To train a Naive Bayes classifier on the large_movie_review_dataset data using a feature extractor that stems, removes stopwords, and custom stopwords:

python nb_sentiment_classify.py

This command trains the model with every pseudocount from 1 to 25 (inclusive), creates a graph of pseudocount vs accuracy, returns the best pseudocount and the accuracy associated with that pseudocount.


from nb_sentiment_classify import NaiveBayes;

# Initialize model with default feature extractor
nb = NaiveBayes()

# Train model on large_movie_review_dataset

# Evaluate accuracy given a pseudocount (1 used in this example)