Sentiment Analysis of product based reviews using Machine Learning Approaches.
This project aims to perform sentiment classification of online product reviews using various Machine Learning classifiers. This project analyzes sentiment on dataset from document level (review level). Data used in this project are online product reviews collected from amazon.com. The Amazon reviews dataset used in this project consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. The final dataset is constructed by randomly taking 200,000 samples for each review score from 1 to 5. In total there are 1,000,000 samples. This project involves comparative study of the performance of 4 Machine Learning classifier models - Multinomial Naïve Bayes, Logistic Regression, Linear SVC and Random Forest. The best classifier was chosen to standardize the model to classify any product reviews in the future with promising outcomes. The user review taken as input is classified using the chosen model with respect to sentiment classes/categories - Postive and Negative, based on the Sentimental Orientation of the opinions it contains.
Make sure you have the following list of dependencies for this project installed and setup on your system first:
- Unix/Linux Operating System (Recommended but not necessarry)
- Python 3.6+
- Anaconda Distribution 5.2+
- NLTK Toolkit 3.3+
Some hardware requirements should also be fulfilled to run this project smoothly:
- At least 8GB RAM
- At least 50GB of usable Hard Disk space
First download the project as zip archive and extract it to your desired location or just clone the repository using,
$ git clone https://github.com/pranitbose/sentiment-analysis.git
Donwload the dataset using the link provided in the dataset_link.txt within the datasets directory. Move the the downloaded dataset or whichever dataset you want to use into the datasets directory. In case you are using your own dataset, you have to modify the filenames in the source code of main.py to the one you'll be using. Many lines of the source code are commented on purpose and the state of the project is pickled wherever necessary to save computing resource and speed up the execution process eliminating repeatition of same steps more than once. A boolean variable named do_pickle is provided in main.py to switch pickling on/off in the entire file by changing it's value in only one place in _main_.
You only need to execute main.py in your terminal to run this project. For the first run, you should enable the flags such as do_fetch_data, do_preprocess_data and others in the source code in main.py. Read the source code carefully before you do so. You should also enable pickling. All these will generate bunch of pickled files and various graph plots as a result of first execution. From following execution you should enable or disable the flags in the source code as per your requirements.
This project is licensed under the terms of the MIT license.