# Sentiment Analysis: Customer Feedback

# Notebook 1.2: Data Preparation - Aggregation Pretest

This notebook serves as a pretest and experimentation environment for testing data processing techniques related to aggregated prediction. The purpose is to validate and refine aggregation approaches that handle variable-length sentences in sentiment analysis models.

A sentence is split into smaller chunks of a fixed size, each chunk is predicted separately by the model, and the resulting probabilities are averaged to produce the final aggregated prediction.

## Setup

In [1]:
# libraries to work with data
import numpy as np
import pandas as pd
import joblib

# libraries for machine learning
import tensorflow as tf
from tensorflow.keras.models import load_model

  if not hasattr(np, "object"):


## Loading Datasets

In [2]:
df_train = pd.read_pickle('../datasets/final_training_dataset.pkl')
df_train

Unnamed: 0,review_text,sentiment
0,wow love place,1
1,crust not good,0
2,not tasti textur nasti,0
3,stop late may bank holiday rick steve recommen...,1
4,select menu great price,1
...,...,...
25895,disappoint qualiti,0
25896,amaz experi highli recommend,1
25897,fast deliveri great packag,1
25898,great valu money,1


In [3]:
df_test = pd.read_pickle('../datasets/final_testing_dataset.pkl')
df_test

Unnamed: 0,review_text,sentiment
0,fantast spot even quit cocktail swell host yel...,1
1,love love love calamari good spici endless lis...,1
2,love place stiff martini cocktail cheap drink ...,1
3,everyth great cocktail bar great locat ambianc...,1
4,come pirat game around 530ish even get lucki t...,1
...,...,...
4316,wife catch show golden nugget hear good thing ...,0
4317,dumb show ever seen never laugh minut realiz w...,0
4318,girlfriend go show absolut terriblenot funni n...,0
4319,restroom look like bombard improvis shack amid...,0


## Revisiting on Dataset Distributions

In [4]:
# create a new temporary column to calculate sentence lengths
df_train['sentence_length'] = df_train['review_text'].apply(lambda x: len(x.split()))
df_test['sentence_length'] = df_test['review_text'].apply(lambda x: len(x.split()))

In [5]:
# check distribution statistics for training dataset
print(df_train['sentence_length'].describe())
print("=" * 40)

print(f"90th percentile: {df_train['sentence_length'].quantile(0.90)}")
print(f"95th percentile: {df_train['sentence_length'].quantile(0.95)}")
print(f"99th percentile: {df_train['sentence_length'].quantile(0.99)}")
print(f"Max length: {df_train['sentence_length'].max()}")
print("=" * 40)

print(f"Sentences longer than 90th percentile: {(df_train['sentence_length'] > df_train['sentence_length'].quantile(0.90)).sum()}")
print(f"Sentences longer than 95th percentile: {(df_train['sentence_length'] > df_train['sentence_length'].quantile(0.95)).sum()}")
print(f"Sentences longer than 99th percentile: {(df_train['sentence_length'] > df_train['sentence_length'].quantile(0.99)).sum()}")

count    25900.000000
mean         3.412317
std          1.090908
min          1.000000
25%          3.000000
50%          4.000000
75%          4.000000
max         19.000000
Name: sentence_length, dtype: float64
90th percentile: 4.0
95th percentile: 4.0
99th percentile: 7.0
Max length: 19
Sentences longer than 90th percentile: 513
Sentences longer than 95th percentile: 513
Sentences longer than 99th percentile: 247


In [6]:
# check distribution statistics for testing dataset
print(df_test['sentence_length'].describe())
print("=" * 40)

print(f"90th percentile: {df_test['sentence_length'].quantile(0.90)}")
print(f"95th percentile: {df_test['sentence_length'].quantile(0.95)}")
print(f"99th percentile: {df_test['sentence_length'].quantile(0.99)}")
print(f"Max length: {df_test['sentence_length'].max()}")
print("=" * 40)

print(f"Sentences longer than 90th percentile: {(df_test['sentence_length'] > df_test['sentence_length'].quantile(0.90)).sum()}")
print(f"Sentences longer than 95th percentile: {(df_test['sentence_length'] > df_test['sentence_length'].quantile(0.95)).sum()}")
print(f"Sentences longer than 99th percentile: {(df_test['sentence_length'] > df_test['sentence_length'].quantile(0.99)).sum()}")

count    4321.000000
mean       65.400139
std        57.899580
min         0.000000
25%        27.000000
50%        49.000000
75%        85.000000
max       472.000000
Name: sentence_length, dtype: float64
90th percentile: 134.0
95th percentile: 177.0
99th percentile: 299.0000000000009
Max length: 472
Sentences longer than 90th percentile: 432
Sentences longer than 95th percentile: 215
Sentences longer than 99th percentile: 44


The training dataset consists mostly of short sentences, with a maximum length of 19 words, while the testing dataset includes sentences up to 472 words. By breaking longer test sentences into smaller, more manageable chunks, discrepancies can be smoothed out, which enhances generalization and improves model performance on longer sentences.

Therefore, **a new approach accepts variable sentence lengths**. Each sentence is split into chunks or phrases of a set length, with predictions made for each chunk. The probabilities are then averaged across these chunks, resulting in an aggregated prediction that represents the overall probability distribution for each class. This technique, commonly used in ensemble methods, strengthens prediction robustness by utilizing multiple inputs.

## Aggregated Prediction: Defining Helper Functions

In [7]:
# load 2 models for example
nbc = joblib.load('../models/sentiment_analysis_nbc_model.joblib')
rnn = load_model("../models/sentiment_analysis_rnn_textvectorization_model.keras")

In [8]:
nbc

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",Pipeline(step...inomialNB())])
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'classifier__alpha': [0.1, 0.5, ...], 'vectorizer__max_features': [500, 800, ...]}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.","{'accuracy': 'accuracy', 'f1_macro': 'f1_macro'}"
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",-1
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",'f1_macro'
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",2
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",True

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"stop_words  stop_words: {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",
,"token_pattern  token_pattern: str or None, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'
,"ngram_range  ngram_range: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams. Only applies if ``analyzer`` is not callable.","(1, ...)"

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",0.1
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [9]:
rnn

<Sequential name=sequential, built=True>

In [10]:
# function to convert a long sentence into a combination of shorter phrases
def chunk_sentence(sentence, chunk_size=4):
    # split the sentence into a list of words
    words = sentence.split()

    # calculate how many full chunks we can make
    total_words = len(words)
    full_chunks_count = total_words // chunk_size  # this gives how many full chunks of size chunk_size

    # create full chunks (chunks with exactly 'chunk_size' words) # each chunk's words are joined into a single string
    full_chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, full_chunks_count * chunk_size, chunk_size)]

    # handle remaining words (if any) and join them into a single string
    remaining = [' '.join(words[full_chunks_count * chunk_size:])]

    # combine the full chunks and remaining
    all_phrases = full_chunks + (remaining if remaining else [])

    return all_phrases

# example
s = "This is a test sentence to see how one can chunk the words properly into smaller parts"
p = chunk_sentence(s)

print(f"The following sentence:\n  {s}\n\nis converted into:\n  {p}")

The following sentence:
  This is a test sentence to see how one can chunk the words properly into smaller parts

is converted into:
  ['This is a test', 'sentence to see how', 'one can chunk the', 'words properly into smaller', 'parts']


In [11]:
# function to get average probabilities of a split sentence
def get_avg_probs(phrases, model, model_name, total_classes=3):
    # initialize array to store probabilities
    all_probs = np.zeros((len(phrases), total_classes))

    # predict each chunk of a sentence
    for i, phrase in enumerate(phrases):
        
        # get probability for each class
        if model_name=="nbc":
            y_pred_probs = model.predict_proba([phrase])  # wrap in a list to preserve the expected input shape
        elif model_name=="rnn":
            y_pred_probs = model.predict(tf.convert_to_tensor([phrase]))
        
        # save it in the array
        all_probs[i] = y_pred_probs

    # compute the average probability for each class
    avg_probs = all_probs.mean(axis=0)
    
    return avg_probs

# example
s = "The food is good but the service is bad"
p = chunk_sentence(s)
ap = get_avg_probs(p, nbc, "nbc")
# ap = get_avg_probs(p, rnn, "rnn")
print(f"The following sentence:\n  '{s}'\n\nis converted into:\n  {p}\n\nEach chunk is predicted by the model, and the probabilities are aggregated to give the final class probabilities:\n  {ap}")

The following sentence:
  'The food is good but the service is bad'

is converted into:
  ['The food is good', 'but the service is', 'bad']

Each chunk is predicted by the model, and the probabilities are aggregated to give the final class probabilities:
  [0.50856594 0.42392956 0.0675045 ]


In [12]:
# function to make aggregated prediction
def get_aggregated_prediction(sentence, model, model_name, chunk_size=4, total_classes=3):
    phrases = chunk_sentence(sentence, chunk_size)
    avg_probs = get_avg_probs(phrases, model, model_name, total_classes)
    final_class = int(avg_probs.argmax())

    return final_class

# example
s = "I love that i hate it"
pred = get_aggregated_prediction(s, nbc, model_name="nbc")

print(f"The model's prediction for '{s}' is {pred}.")

The model's prediction for 'I love that i hate it' is 1.


## Trying Aggregated Prediction with Helper Functions

In [13]:
s = "The food is good but the service is bad"

In [14]:
nbc.predict_proba([s])  # wrap in a list to preserve the expected input shape

array([[7.88918129e-01, 2.11081626e-01, 2.45290778e-07]])

In [15]:
a=tf.convert_to_tensor([s])  # wrap in a list to preserve the expected input shape
rnn.predict(a).argmax()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 813ms/step


np.int64(1)

## Summary and Next Steps

This notebook successfully implemented and validated three core functions for aggregated prediction. The aggregation approach successfully handles variable-length sentences, and these functions work correctly with both Naive Bayes Classifier and RNN models. Thus, these functions will be implemented in a Python script that supports additional machine learning models. The final script will enable batch processing and integration with the main prediction pipeline.

This notebook is done by `La Wun Nannda`.