Skip to content

nminnie/Sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMDB Sentiment Analysis

COMP 551 project 2.

IMDb is one of the most popular online databases for movies and personalities, a platform where millions of users read and write movie reviews. This provides a large and diverse dataset for sentiment analysis. In this project, we were tasked to implement different classification models to predict the sentiment of IMDb reviews, either as positive or negative, using only text each review contains. The goal is to find the model with the highest accuracy and generalization. We trained different models using multiple combinations of text features and hyper-parameter settings for both the classifiers and the features, which we found could potentially impact the performance significantly. Every model was evaluated by k-fold cross validation to ensure the consistency of their performance. We found that our best performing model was the Naïve Bayes - Support Vector Machines classifier with bag of words, which reported an accuracy score of 91.880 on the final test set.

Project Structure

  • src -> contains source code.
    • data_loader.py: utility functions to load dataset
    • nlp_processing.py: Custom CountVectorizer for preprocessing data
    • naive_bayes.py: Implementation of Bernoulli Naïve Bayes
    • nbsvm.py: Implementation of a variation SVM and NB [1] and [2].
    • models.ipynb: Jupyter Notebook with most models implemented.
    • lstm_imdb.ipynb: Jupyter Notebook with LSTM NN. Note: We couldn't run this model because of the limitation of GPU, but we aim to work on it to test the accuracy.
  • data -> this folder is created dinamically and is where datasets and models are stored.
  • test -> sanity check of the implementation of Bernoulli Naïve Bayes.
  • data_load.sh: script to download dataset.
  • make_submission.sh: script to submit results to Kaggle.

Installing

Create environment

conda create -n newenvironment --file requirements.txt

Download data

Ensure your Kaggle token is in ~/.kaggle. You can get a new token in My Profile->API->Create new API Token.

pip install kaggle --upgrade

Download data set

sh data_load.sh

Running

Once you've downloaded the data, you can reproduce the experiments done using the notebook provided.

At the end, you can make a submission with the best model found, executing sh make_submisson.sh

Note: for executing the stemming, the Punkt Sentence Tokenizer is necessary. It is downloaded automatically.

Reproducibility

The model in reported in Kaggle can be found in the section Best model in leaderboard of the notebook provided.

The model is:

best_pipeline = Pipeline([
    ('vect', LemmaCountVectorizer(analyzer='word', binary=False, decode_error='strict',
            encoding='utf-8', input='content', lowercase=True, max_df=6000, max_features=None, 
            min_df=2, ngram_range=(1, 3), preprocessing=True, preprocessor=None, 
            strip_accents='unicode', token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=nltk.word_tokenize, 
            vocabulary=None, stem=False)),
    ('clf', NBSVM(beta=0.31925992753471094, alpha=1, C=0.40531603281740625, fit_intercept=False))
])
best_pipeline.fit(X_train, y_train)

About

Comp 551 Mini Project 2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published