# Sentiment Analysis: Customer Feedback

# Notebook 1: Preprocessing Data

In this notebook, two datasets about customer feedback/comments/reviews are preprocessed to prepare for training with:

- The Naive Bayes Classifier (NBC)
- The Recurrent Neural Network (RNN)

## Setup

In [1]:
# libraries to work with data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

# libaries for word normalization
import re
import nltk
from nltk.stem.porter import PorterStemmer
import spacy

In [2]:
# download NLTK stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

# load spaCy pre-trained model (load it only once for efficiency)
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Loading Datasets

There are two datasets used in this project:

1. The given training dataset, which contains historical restaurant reviews.
2. The Kaggle open-source dataset, prepared for customer sentiment analysis.

These datasets are in different file formats and have different structures. They will be merged and standardized into a single dataframe to train the model.

In [3]:
df1 = pd.read_csv('./datasets/a1_RestaurantReviews_HistoricDump.tsv', delimiter = '\t')
df1

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
895,I want to first say our server was great and w...,1
896,The pizza selections are good.,1
897,"I had strawberry tea, which was good.",1
898,Highly unprofessional and rude to a loyal patron!,0


In [4]:
df2 = pd.read_csv('./datasets/Customer_Sentiment.csv', delimiter=',')
df2

Unnamed: 0,customer_id,gender,age_group,region,product_category,purchase_channel,platform,customer_rating,review_text,sentiment,response_time_hours,issue_resolved,complaint_registered
0,1,male,60+,north,automobile,online,flipkart,1,very disappointed with the quality.,negative,46,yes,yes
1,2,other,46-60,central,books,online,swiggy instamart,5,fast delivery and great packaging.,positive,5,yes,no
2,3,female,36-45,east,sports,online,facebook marketplace,1,very disappointed with the quality.,negative,38,yes,yes
3,4,female,18-25,central,groceries,online,zepto,2,product stopped working after few days.,negative,16,yes,yes
4,5,female,18-25,east,electronics,online,croma,3,neutral about the quality.,neutral,15,yes,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,24996,female,36-45,south,beauty,online,lenskart,1,very disappointed with the quality.,negative,40,yes,yes
24996,24997,other,60+,central,automobile,online,flipkart,5,"amazing experience, highly recommend!",positive,25,yes,no
24997,24998,male,18-25,south,beauty,online,ajio,4,fast delivery and great packaging.,positive,9,yes,no
24998,24999,female,26-35,central,automobile,online,snapdeal,5,great value for money.,positive,65,no,no


The second dataset has more columns than necessary, so only the two columns are selected to match with the first dataset, while all other columns are dropped.

In [5]:
df2 = df2[['review_text', 'sentiment']]
df2

Unnamed: 0,review_text,sentiment
0,very disappointed with the quality.,negative
1,fast delivery and great packaging.,positive
2,very disappointed with the quality.,negative
3,product stopped working after few days.,negative
4,neutral about the quality.,neutral
...,...,...
24995,very disappointed with the quality.,negative
24996,"amazing experience, highly recommend!",positive
24997,fast delivery and great packaging.,positive
24998,great value for money.,positive


The second dataset is currently using three categories (positive, negative, and neutral), while the first dataset only uses two categories (positive and negative, represented as 1 and 0). Therefore, the categories in the second dataset need to be represented as numeric values. The assigned values must not contradict the first dataset's assignment. Hence, the categories will be mapped as follows:

- negative &#8594; 0
- postive &#8594; 1
- neutral &#8594; 2

In [6]:
# function to convert sentiment labels into numeric categories
def categorize_sentiment(sentiment):
    if sentiment == "negative":
        return 0
    elif sentiment == "positive":
        return 1
    elif sentiment == "neutral":
        return 2

# apply conversion
df2.loc[:, 'sentiment'] = df2['sentiment'].apply(categorize_sentiment)
df2

Unnamed: 0,review_text,sentiment
0,very disappointed with the quality.,0
1,fast delivery and great packaging.,1
2,very disappointed with the quality.,0
3,product stopped working after few days.,0
4,neutral about the quality.,2
...,...,...
24995,very disappointed with the quality.,0
24996,"amazing experience, highly recommend!",1
24997,fast delivery and great packaging.,1
24998,great value for money.,1


## Joining Datasets

The two datasets are combined, but this requires the column names to match. Here, the columns of the first dataset are renamed to match the second dataset's columns, but the reverse can also be done.

In [7]:
# rename the columns of the first dataset
df1 = df1.rename(columns={'Review': 'review_text', 'Liked': 'sentiment'})

# combine two datasets
df = pd.concat([df1, df2], ignore_index=True)
df

Unnamed: 0,review_text,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
25895,very disappointed with the quality.,0
25896,"amazing experience, highly recommend!",1
25897,fast delivery and great packaging.,1
25898,great value for money.,1


## Main Preprocessing Steps

There are three types of preprocessing. First, **invalid entries** within the dataframe must be removed.

The second type of preprocessing focuses on **word normalization**, as sentiment analysis emphasizes the meaning conveyed by words rather than irrelevant details. There are many elements in a sentence, such as punctuation, stop words, and certain numbers, that do not influence sentiment. These can be removed or adjusted to help the model focus on the important aspects of the sentence.

Therefore, each sentence in the dataset is preprocessed using the following steps:

1. **Convert to lowercase**
    - `HE is not EnJoYiNG./n` &#8594; `he is not enjoying./n`
2. **Remove newline characters**
    - `he is not enjoying./n` &#8594; `he is not enjoying.`
3. **Remove punctuation**
    - `he is not enjoying.` &#8594; `he is not enjoying`
4. **Remove stop words** except for "not" and other negation words that can indicate negative sentiment
    - `he is not enjoying` &#8594; `not enjoying`
5. **Apply stemming** using the *Porter Stemmer*
    - `not enjoying` &#8594; `not enjoy`
6. **Remove digits** unless they are related to sentiment
    - `It costs $300` &#8594; `It costs`
    - `10 out of 10 service` &#8594; `10 out of 10 service`
7. **Apply lemmatization** which can be computationally expensive for large datasets (optional)
    - `enjoyable` &#8594; `enjoy`
    - `we had a good time` &#8594; `we have a good time`

The third type of preprocessing deals with **limiting the number of words** in a sentence.

A sentence can have an arbitrary number of words. In simple models like NBC, the length of individual sentences is less important. However, deep learning models (like RNNs, CNNs, or transformers) expect fixed-length inputs. Therefore, sentences must be padded (with additional words) or truncated (by removing words) to ensure they all have the same length.

The cutoff value is arbitrary. In this project, the 90th percentile of sentence lengths, based on the distribution of sentence lengths in the dataset, can be used as a cutoff to set a maximum sentence length (the maximum number of words a sentence can have). This ensures that too much information is not lost while the model can still handle variable-length input effectively.

### Preprocessing: Removing Invalid Entries

In [8]:
# remove invalid entries
df = df.dropna()
df

Unnamed: 0,review_text,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
25895,very disappointed with the quality.,0
25896,"amazing experience, highly recommend!",1
25897,fast delivery and great packaging.,1
25898,great value for money.,1


### Preprocessing: Word Normalization
#### (a). Helper Functions

For lemmatization, the `spaCy` library is used. The following command needs to be run in the terminal because the spaCy pre-trained model will not load otherwise:

```python
python -m spacy download en_core_web_sm
```

In [9]:
# step 1
def to_lowercase(text):
    """
    Convert text to lowercase.
    
    Args:
        text (str): Input text to convert
        
    Returns:
        str: Text in lowercase
    """
    if text == '':
        return text
    return text.lower()

###########################################

# step 2
def remove_newlines(text):
    """
    Remove newline characters from text.
    
    Args:
        text (str): Input text that may contain newline characters
        
    Returns:
        str: Text with newline characters removed
    """
    if text == '':
        return text
    return text.replace('\n', ' ').replace('\r', ' ')

###########################################

# step 3
def remove_punctuation(text):
    """
    Remove punctuation from text.
    
    Args:
        text (str): Input text that may contain punctuation
        
    Returns:
        str: Text with punctuation removed
    """
    if text == '':
        return text
    # remove all non-alphanumeric, non-whitespace characters
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

###########################################

# step 4
def remove_stopwords(text):
    """
    Remove stop words from text, except 'not' and other negation words.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Text with stop words removed (except 'not')
    """
    if text == '':
        return text
    
    # get English stopwords and remove 'not' from the stopwords set
    all_stopwords = set(stopwords.words('english'))
    all_stopwords.discard('not')
    
    # split text into words
    words = text.split()
    
    # filter out stopwords
    filtered_words = [word for word in words if word.lower() not in all_stopwords]
    
    # join words back into text
    return ' '.join(filtered_words)

###########################################

# step 5
def apply_stemming(text):
    """
    Apply Porter Stemmer to each word in the text.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Text with words stemmed
    """
    if text == '':
        return text
    
    ps = PorterStemmer()
    
    # split text into words
    words = text.split()
    
    # apply stemming to each word
    stemmed_words = [ps.stem(word) for word in words]
    
    # join words back into text
    return ' '.join(stemmed_words)


###########################################

# step 6
def remove_digits(text):
    """
    Remove digits from text unless they are related to sentiment 
    (e.g., "10 out of 10", "5 stars", "rating 4").
    
    Args:
        text (str): Input text
        
    Returns:
        str: Text with non-sentiment-related digits removed
    """
    if text == '':
        return text
    
    # keywords that indicate sentiment-related numbers
    sentiment_keywords = ['out of', 'stars', 'star', 'rating', 'ratings', 
                         'score', 'scores', 'mark', 'marks', 'points', 'point']
    
    # pattern to find numbers (including standalone digits and numbers in context)
    def should_keep_digit(match):
        # get the matched number and its position
        start_pos = match.start()
        end_pos = match.end()
        
        # get context around the number (20 characters before and after)
        context_start = max(0, start_pos - 20)
        context_end = min(len(text), end_pos + 20)
        context = text[context_start:context_end].lower()
        
        # check if any sentiment keyword is in the context
        for keyword in sentiment_keywords:
            if keyword in context:
                return True
        
        # if no sentiment keyword found, remove the number
        return False
    
    # replace numbers: keep if sentiment-related, remove otherwise
    ### \b represents a word boundary to ensure that the match is a separate word (not part of another word).
    ### \d+ matches one or more digits.
    ### r'\b\d+\b' represents "123" in "hello 123 hello", not in "hello12 3hello"
    ### group() extracts the part of the string that matched the regular expression.
    ### re.sub(pattern, replacement, text) searches for occurrences of the pattern in the text and replaces them with the replacement.
    result = re.sub(r'\b\d+\b', lambda m: m.group() if should_keep_digit(m) else '', text)
    
    # clean up multiple spaces that might be created
    result = re.sub(r'\s+', ' ', result)
    
    return result.strip()

###########################################

# step 7
def apply_lemmatization(text, nlp):
    """
    Apply lemmatization using spaCy to the text.
    
    Args:
        text (str): Input text
        nlp: spaCy pre-trained language model
        
    Returns:
        str: Text with words lemmatized
    """
    if text == '':
        return text
        
    # process text with spaCy
    doc = nlp(text)
    
    # extract lemmatized tokens
    lemmatized_tokens = [token.lemma_ for token in doc]
    
    # join tokens back into text
    return ' '.join(lemmatized_tokens)

###########################################

# MASTER FUNCTION that applies all preprocessing steps
def preprocess_text(text, nlp):
    """
    Apply all preprocessing steps to a text.
    
    Args:
        text (str): Input text
        nlp: spaCy pre-trained language model
        
    Returns:
        str: Fully preprocessed text
    """
    text = to_lowercase(text)
    text = remove_newlines(text)
    text = remove_punctuation(text)
    text = remove_stopwords(text)
    text = apply_stemming(text)
    text = remove_digits(text)
    text = apply_lemmatization(text, nlp)
    return text

#### (b). Applying Preprocessing

In [10]:
# apply the MASTER FUNCTION
df['review_text'] = df['review_text'].apply(lambda x: preprocess_text(x, nlp))
df

Unnamed: 0,review_text,sentiment
0,wow love place,1
1,crust not good,0
2,not tasti textur nasti,0
3,stop late may bank holiday rick steve recommen...,1
4,select menu great price,1
...,...,...
25895,disappoint qualiti,0
25896,amaz experi highli recommend,1
25897,fast deliveri great packag,1
25898,great valu money,1


### Preprocessing: Limiting the Number of Words in Each Sentence

In [11]:
# create a new temporary column to calculate sentence lengths
df['sentence_length'] = df['review_text'].apply(lambda x: len(x.split()))
df

Unnamed: 0,review_text,sentiment,sentence_length
0,wow love place,1,3
1,crust not good,0,3
2,not tasti textur nasti,0,4
3,stop late may bank holiday rick steve recommen...,1,9
4,select menu great price,1,4
...,...,...,...
25895,disappoint qualiti,0,2
25896,amaz experi highli recommend,1,4
25897,fast deliveri great packag,1,4
25898,great valu money,1,3


In [12]:
# check distribution statistics
print(df['sentence_length'].describe())
print("=" * 40)

print(f"90th percentile: {df['sentence_length'].quantile(0.90)}")
print(f"95th percentile: {df['sentence_length'].quantile(0.95)}")
print(f"99th percentile: {df['sentence_length'].quantile(0.99)}")
print(f"Max length: {df['sentence_length'].max()}")
print("=" * 40)

print(f"Sentences longer than 90th percentile: {(df['sentence_length'] > df['sentence_length'].quantile(0.90)).sum()}")
print(f"Sentences longer than 95th percentile: {(df['sentence_length'] > df['sentence_length'].quantile(0.95)).sum()}")
print(f"Sentences longer than 99th percentile: {(df['sentence_length'] > df['sentence_length'].quantile(0.99)).sum()}")

count    25900.000000
mean         3.412317
std          1.090908
min          1.000000
25%          3.000000
50%          4.000000
75%          4.000000
max         19.000000
Name: sentence_length, dtype: float64
90th percentile: 4.0
95th percentile: 4.0
99th percentile: 7.0
Max length: 19
Sentences longer than 90th percentile: 513
Sentences longer than 95th percentile: 513
Sentences longer than 99th percentile: 247


Although truncation and padding must be done for RNNs, in this case, the longest sentence consists of only 19 words, so it is not very long. To preserve information, no sentence is truncated. Padding will also be done separately for RNNs after encoding.

In [13]:
# do not truncate # remove the temporary column
df = df.drop('sentence_length', axis=1)
df

Unnamed: 0,review_text,sentiment
0,wow love place,1
1,crust not good,0
2,not tasti textur nasti,0
3,stop late may bank holiday rick steve recommen...,1
4,select menu great price,1
...,...,...
25895,disappoint qualiti,0
25896,amaz experi highli recommend,1
25897,fast deliveri great packag,1
25898,great valu money,1


## Saving Dataset

In [14]:
df.to_pickle('./datasets/cleaned_final_dataset.pkl')

This notebook is done by `La Wun Nannda`.