IMDB Sentiment Analysis

COMP 551 project 2.

IMDb is one of the most popular online databases for movies and personalities, a platform where millions of users read and write movie reviews. This provides a large and diverse dataset for sentiment analysis. In this project, we were tasked to implement different classification models to predict the sentiment of IMDb reviews, either as positive or negative, using only text each review contains. The goal is to find the model with the highest accuracy and generalization. We trained different models using multiple combinations of text features and hyper-parameter settings for both the classifiers and the features, which we found could potentially impact the performance significantly. Every model was evaluated by k-fold cross validation to ensure the consistency of their performance. We found that our best performing model was the Naïve Bayes - Support Vector Machines classifier with bag of words, which reported an accuracy score of 91.880 on the final test set.

Project Structure

src -> contains source code.
- data_loader.py: utility functions to load dataset
- nlp_processing.py: Custom CountVectorizer for preprocessing data
- naive_bayes.py: Implementation of Bernoulli Naïve Bayes
- nbsvm.py: Implementation of a variation SVM and NB [1] and [2].
- models.ipynb: Jupyter Notebook with most models implemented.
- lstm_imdb.ipynb: Jupyter Notebook with LSTM NN. Note: We couldn't run this model because of the limitation of GPU, but we aim to work on it to test the accuracy.
data -> this folder is created dinamically and is where datasets and models are stored.
test -> sanity check of the implementation of Bernoulli Naïve Bayes.
data_load.sh: script to download dataset.
make_submission.sh: script to submit results to Kaggle.

Installing

Create environment

conda create -n newenvironment --file requirements.txt

Download data

Ensure your Kaggle token is in ~/.kaggle. You can get a new token in My Profile->API->Create new API Token.

pip install kaggle --upgrade

Download data set

sh data_load.sh

Running

Once you've downloaded the data, you can reproduce the experiments done using the notebook provided.

At the end, you can make a submission with the best model found, executing sh make_submisson.sh

Note: for executing the stemming, the Punkt Sentence Tokenizer is necessary. It is downloaded automatically.

Reproducibility

The model in reported in Kaggle can be found in the section Best model in leaderboard of the notebook provided.

The model is:

best_pipeline = Pipeline([
    ('vect', LemmaCountVectorizer(analyzer='word', binary=False, decode_error='strict',
            encoding='utf-8', input='content', lowercase=True, max_df=6000, max_features=None, 
            min_df=2, ngram_range=(1, 3), preprocessing=True, preprocessor=None, 
            strip_accents='unicode', token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=nltk.word_tokenize, 
            vocabulary=None, stem=False)),
    ('clf', NBSVM(beta=0.31925992753471094, alpha=1, C=0.40531603281740625, fit_intercept=False))
])
best_pipeline.fit(X_train, y_train)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test

test

.gitignore

.gitignore

README.md

README.md

data_load.sh

data_load.sh

make_submission.sh

make_submission.sh

requirements.txt

requirements.txt

Repository files navigation

IMDB Sentiment Analysis

COMP 551 project 2.

Project Structure

Installing

Create environment

Download data

Running

Reproducibility

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
test		test
.gitignore		.gitignore
README.md		README.md
data_load.sh		data_load.sh
make_submission.sh		make_submission.sh
requirements.txt		requirements.txt

nminnie/Sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

IMDB Sentiment Analysis

COMP 551 project 2.

Project Structure

Installing

Create environment

Download data

Running

Reproducibility

About

Resources

Stars

Watchers

Forks

Languages