# Automated Customers Review Project

## Executive Summary

This business case outlines the development of an NLP model to automate the processing of customer feedback for a retail company.

The goal is to evaluate how a traditional ML solutions (NaiveBayes, SVM, RandomForest, etc) compares against a Deep Learning solution (e.g, a Transformer from HuggingFace) when trying to analyse a user review, in terms of its score (positive, negative or neutral).

## Problem Statement

The company receives thousands of text reviews every month, making it challenging to manually categorize and analyze, and visualize them. An automated system can save time, reduce costs, and provide real-time insights into customer sentiment.
Automatically classyfing a review as positive, negative or neutral is important, as often:
- Users don't leave a score, along with their review
- Different users cannot be compared (for one user, a 4 might be great, for another user a 4 means "not a 5" and it is actually bad)

## Project goals

- The ML/AI system should be able to run classification of customers' reviews (the textual content of the reviews) into positive, neutral, or negative.
- You should be able to compare which solution yeilds better results:
  - One that reads the text with a Language Model and classifies into "Positive", "Negative" or "Neutral"
  - One that transforms reviews into tabular data and classifies them using traditional Machine Learning techniques

## Data Details
The publicly available and downsized dataset of Amazon customer reviews from their online marketplace was used and can be found [here](https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products/data)

In order to do this, you should transform all the scores with the following logic:
- Scores of 1,2 or 3: Negative
- Scores of 4: Neutral
- Scores of 5: Positive

<hr/>

## Part 1: Traditional NLP & ML approach

### Data Preprocessing

#### Data Cleaning

We begin by downloading the dataset from Kaggle, putting it in a dataframe, and then exploring the data

In [19]:
import kagglehub
from pathlib import Path
import os

try:
  path0
except:
  # Download latest version
  path0 = kagglehub.dataset_download("datafiniti/consumer-reviews-of-amazon-products")

print("Path to dataset files:", path0)
path = os.path.join(path0, '1429_1.csv')

print("Path to csv:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/datafiniti/consumer-reviews-of-amazon-products/versions/5
Path to csv: /root/.cache/kagglehub/datasets/datafiniti/consumer-reviews-of-amazon-products/versions/5/1429_1.csv


In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(path, low_memory=False)
df.head()

Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,Inexpensive tablet for him to use and learn on...,Beginner tablet for our 9 year old son.,,,DaveZ
3,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,4.0,http://reviews.bestbuy.com/3545/5620406/review...,I've had my Fire HD 8 two weeks now and I love...,Good!!!,,,Shacks
4,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-12T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,I bought this for my grand daughter when she c...,Fantastic Tablet for kids,,,explore42


In [34]:
display(df.info())
print()

null_counts = df.isnull().sum()

# Print the null counts for each column
for column, count in null_counts.items():
    if count > 0:  # Only print columns that have nulls
        print(f"{column:<20} has {count:<7} null values.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34660 entries, 0 to 34659
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    34660 non-null  object 
 1   name                  27900 non-null  object 
 2   asins                 34658 non-null  object 
 3   brand                 34660 non-null  object 
 4   categories            34660 non-null  object 
 5   keys                  34660 non-null  object 
 6   manufacturer          34660 non-null  object 
 7   reviews.date          34621 non-null  object 
 8   reviews.dateAdded     24039 non-null  object 
 9   reviews.dateSeen      34660 non-null  object 
 10  reviews.didPurchase   1 non-null      object 
 11  reviews.doRecommend   34066 non-null  object 
 12  reviews.id            1 non-null      float64
 13  reviews.numHelpful    34131 non-null  float64
 14  reviews.rating        34627 non-null  float64
 15  reviews.sourceURLs 

None


name                 has 6760    null values.
asins                has 2       null values.
reviews.date         has 39      null values.
reviews.dateAdded    has 10621   null values.
reviews.didPurchase  has 34659   null values.
reviews.doRecommend  has 594     null values.
reviews.id           has 34659   null values.
reviews.numHelpful   has 529     null values.
reviews.rating       has 33      null values.
reviews.text         has 1       null values.
reviews.title        has 6       null values.
reviews.userCity     has 34660   null values.
reviews.userProvince has 34660   null values.
reviews.username     has 7       null values.


We observe that there are many columns with null values. We will begin by removing the ones that are mostly null. Then we will decide what to do with the others

In [36]:
# remove columns with more that 30000 missing values
df = df.dropna(thresh=df.shape[0]*0.5, axis=1)
df.info()
print()
null_counts = df.isnull().sum()

# Print the null counts for each column
for column, count in null_counts.items():
    if count > 0:  # Only print columns that have nulls
        print(f"{column:<20} has {count:<7} null values.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34660 entries, 0 to 34659
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   34660 non-null  object 
 1   name                 27900 non-null  object 
 2   asins                34658 non-null  object 
 3   brand                34660 non-null  object 
 4   categories           34660 non-null  object 
 5   keys                 34660 non-null  object 
 6   manufacturer         34660 non-null  object 
 7   reviews.date         34621 non-null  object 
 8   reviews.dateAdded    24039 non-null  object 
 9   reviews.dateSeen     34660 non-null  object 
 10  reviews.doRecommend  34066 non-null  object 
 11  reviews.numHelpful   34131 non-null  float64
 12  reviews.rating       34627 non-null  float64
 13  reviews.sourceURLs   34660 non-null  object 
 14  reviews.text         34659 non-null  object 
 15  reviews.title        34654 non-null 

Since we will be dealing with the reviews, we will drop the rest of the columns and just leave the title, text, and rating. Then we will merge the title and the text, and transform the ratings to just three values: negative (1-3), neutral (4), or positive (5). Since we need the labels for at least the training, we will also remove the rows without labels (just 33 reviews).

In [92]:
# select only the columns that are useful for the analysis (title, reviews, and ratings). also drop the rows with missing ratings
reviews = df[['reviews.text', 'reviews.title', 'reviews.rating']]
reviews = reviews.dropna(subset=['reviews.rating'])
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 34627 entries, 0 to 34659
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   reviews.text    34626 non-null  object 
 1   reviews.title   34621 non-null  object 
 2   reviews.rating  34627 non-null  float64
dtypes: float64(1), object(2)
memory usage: 1.1+ MB


In [93]:
#Now we merge the titles with the text of the reviews
reviews['text'] = '(' + reviews['reviews.title'] + ') ' + reviews['reviews.text']

# and map ratings to negative (1-3), neutral (4), and positive (5), they are floats...
reviews['rating'] = reviews['reviews.rating'].map({1.0: 'negative', 2.0: 'negative', 3.0: 'negative', 4.0: 'neutral', 5.0: 'positive'})

# drop the other columns
reviews = reviews[['text', 'rating']]

# and drop rows where text is missing
reviews = reviews.dropna(subset=['text'])

reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 34620 entries, 0 to 34659
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    34620 non-null  object
 1   rating  34620 non-null  object
dtypes: object(2)
memory usage: 811.4+ KB


##### Fixing text
For traditional NLP and ML models, we need to propperly clean the text. We will:
- Remove special characters, punctuation, and unnecessary whitespace from the text data.
- Convert text to lowercase to ensure consistency in word representations.

In [64]:
import re
from bs4 import BeautifulSoup
def clean_text(text):
    # Step 1: Remove inline JavaScript/CSS
    text = re.sub(r'<(script|style).*?>.*?</\1>', '', text, flags=re.DOTALL)

    # Step 2: Remove HTML comments
    text = re.sub(r'<!--.*?-->', '', text, flags=re.DOTALL)

    # Step 3: Remove remaining HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text(separator=' ')  # Extract text and separate with spaces

    # Step 4: General regex to remove any encodings like =XX (two hexadecimal digits)
    text = re.sub(r'=[0-9A-Fa-f]{2}', ' ', text)

    # Step 5: Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)

    # Remove standalone single characters
    text = re.sub(r'\b\w\b', '', text)

    # Remove prefixed 'b'
    text = text.lstrip('b')

    # Remove any extra spaces again, just to be sure
    text = re.sub(r'\s+', ' ', text)

    # Convert to lowercase
    text = text.lower()

    return text


In [94]:
# Apply the clean_text function and overwrite the 'text' column
review_nlp = reviews.copy()

# Does this go now, or after tokenizing?
# review_nlp['text'] = review_nlp['text'].apply(clean_text)

# and see the first 5 texts
print(review_nlp.values[:5,0])

['(Kindle) This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.'
 '(very fast) great for beginner or experienced person. Bought as a gift and she loves it'
 '(Beginner tablet for our 9 year old son.) Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already...'
 "(Good!!!) I've had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like th

#### Tokenization and Lemmatization

In [84]:
import nltk
dler = nltk.downloader.Downloader()
dler._update_index()
dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
dler.download('stopwords')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [87]:
import string
from nltk.corpus import stopwords

# try:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
dler.download('punkt_tab')
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [95]:
# Tokenize
review_nlp['text'] = review_nlp['text'].apply(word_tokenize)
review_nlp.head()

Unnamed: 0,text,rating
0,"[(, Kindle, ), This, product, so, far, has, no...",positive
1,"[(, very, fast, ), great, for, beginner, or, e...",positive
2,"[(, Beginner, tablet, for, our, 9, year, old, ...",positive
3,"[(, Good, !, !, !, ), I, 've, had, my, Fire, H...",neutral
4,"[(, Fantastic, Tablet, for, kids, ), I, bought...",positive


In [96]:
# remove punctuation
# Compile the punctuation pattern once
pattern = re.compile('[%s]' % re.escape(string.punctuation))

# Define a function to remove punctuation from tokenized sentences
def remove_punctuation_from_tokens(tokenized_sentence):
    return [pattern.sub('', token) for token in tokenized_sentence if pattern.sub('', token)]

# Apply the function to the 'text' column of the DataFrame
review_nlp['text'] = review_nlp['text'].apply(lambda sentence: remove_punctuation_from_tokens(sentence))

# Check the result
print(review_nlp.head())

                                                text    rating
0  [Kindle, This, product, so, far, has, not, dis...  positive
1  [very, fast, great, for, beginner, or, experie...  positive
2  [Beginner, tablet, for, our, 9, year, old, son...  positive
3  [Good, I, ve, had, my, Fire, HD, 8, two, weeks...   neutral
4  [Fantastic, Tablet, for, kids, I, bought, this...  positive


In [101]:
# Lematize
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
dler.download('averaged_perceptron_tagger_eng')
dler.download('wordnet')

# Initialize the WordNet lemmatizer
wordnet_lemma = WordNetLemmatizer()

# Function to get word POS for lemmatization
def get_wordnet_pos(word):
    """Map POS tag to the first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0]  # Get the POS tag for the word
    tag_dict = {"J": wordnet.ADJ,  # Adjective
                "N": wordnet.NOUN,  # Noun
                "V": wordnet.VERB,  # Verb
                "R": wordnet.ADV}   # Adverb

    return tag_dict.get(tag, wordnet.NOUN)  # Default to NOUN if unknown

# Function to lemmatize a tokenized sentence
def lemmatize_sentence(sentence):
    return [wordnet_lemma.lemmatize(word, get_wordnet_pos(word)) for word in sentence]

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [102]:
# Apply the lemmatization function to the 'text' column
review_nlp['text'] = review_nlp['text'].apply(lemmatize_sentence)

# Check the result
print(review_nlp.head())

                                                text    rating
0  [Kindle, This, product, so, far, have, not, di...  positive
1  [very, fast, great, for, beginner, or, experie...  positive
2  [Beginner, tablet, for, our, 9, year, old, son...  positive
3  [Good, I, ve, have, my, Fire, HD, 8, two, week...   neutral
4  [Fantastic, Tablet, for, kid, I, bought, this,...  positive


In [None]:
# remove stopwords