This project seeks to predict the rating of a text review using NLP algorithms and the Amazon Customer Reviews dataset.
- Create a model capable of predicting customer satisfaction from information available in the form of comments and articles.
The dataset contains the customer review text with accompanying metadata, which consists of three main components:
-
A collection of reviews written on the Amazon.com marketplace and associated metadata from 1995 to 2015. This is intended to facilitate the study of the properties (and evolution) of customer reviews, potentially including how people evaluate and express their experiences with scale products. (130M + customer reviews)
-
A collection of multi-lingual product reviews from different Amazon marketplaces intended to facilitate analysis of customer perceptions of the same products and broader consumer preferences in all languages and countries. (More than 200K customer reviews in 5 countries)
-
A collection of reviews that have been identified as non-compliant concerning Amazon policies. This is intended to provide a baseline data set for research on detecting promotional or biased reviews. (several thousand customer comments). This part of the data set is distributed separately and is available on request; Please contact the email address below if you are interested in obtaining this dataset.
To run this project's code you can leverage Docker with the instructions below:
-
For first time usage, you need to build the Docker image:
docker compose build .
-
After that you can start the container with:
docker compose up -d
-
Go to Docker desktop, to retrieve your jupyterlab access URL.
- Click the container name, to open the logs.
- Retrieve the jupyterlab URL, it should look like this:
http://127.0.0.1:8888/lab?token=4150032f3603c85febf54b8b40bb761a2eb46e2fb593d5dc
-
To stop the container, run:
docker compose down
- The model can still be optimized by improving its hyperparameters.
- By balancing the dataset we can expect better results.
- The training dataset was delimited for reasons of computational resources, we can estimate that with a greater number of records and columns of the DTM, the model will have a greater predictive power.
- I consider that the model obtained a good performance, taking into account the previous limitations.