[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeljov/NAP2025/blob/main/Mobile_App_Reviews_Classification_Word_Embeddings.ipynb)

# Word embeddings for document (review) classification

This notebook exemplifies text classification using a pre-trained word embeddings model ([Word2Vec](https://en.wikipedia.org/wiki/Word2vec)) for transforming textual data into numerical vectors (to serve as the input for classification algorithms).<br>

The dataset used in the example is the same as the one used in the previous two classes. It originates from the paper [Listening to the Crowd for the Release Planning of Mobile Apps](https://ieeexplore.ieee.org/abstract/document/8057860) and is available from [this web page](https://dibt.unimol.it/report/others/clap/) with the supplementary materials for the paper.

### The initial setup: downloading the required models, loading packages, seting up some useful variables

Up to now we’ve been using spaCy’s smallest English language pipeline (_en_core_web_sm_), which provides vocabulary, syntax, and entities, but not vectors. To make advantage of the spaCy's built-in word vector model we’ll need a larger pipeline, namely either _en_core_web_md_ or _en_core_web_lg_. An overview of all the NLP piplines is available at [https://spacy.io/models/en](https://spacy.io/models/en)

Since neither medium nor large spaCy's english language pipelines are directly available from Colab, we will first download the one we will use (_en_core_web_lg_):

In [None]:
!python -m spacy download en_core_web_lg

Note: in case of running this notebook locally, put the following line in comments and run the following line in Terminal or Command Prompt: <br>
`python -m spacy download en_core_web_lg`

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer

# modules for oversampling and pipeline building
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay

# sklearn modules that implement classification algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

import spacy

from collections import defaultdict

RAND_STATE = 1

# spacy pipeline to use for text preprocessing and word vectors
SPACY_MODEL = "en_core_web_lg"

### Explore spaCy's pre-trained word embeddings model (Word2Vec)

Load the spaCy pipeline with pre-trained Word2Vec word embeddings (vectors):

In [None]:
spacy_model = spacy.load(SPACY_MODEL)

The loaded NLP pipeline includes a vocabulary of 343K unique words represented with 300-dimensional vectors.

Let's see what a word embedding looks like:

In [None]:
world_vec = spacy_model.vocab['world'].vector

In [None]:
print(f"Vectors dimension: {len(world_vec)}")

In [None]:
print('World: ', world_vec)

Before moving further, let's just check how we access words in the vocabulary of a spaCy's NLP pipeline:

In [None]:
type(spacy_model.vocab['world'])

Each unique word in spaCy's vocabulary is represented as an instance of the [`Lexeme`](https://spacy.io/api/lexeme) class.

 A Lexeme holds context-free lexical attributes - like if the word is a number, if it's a stop-word, its string, its word vector, etc:

In [None]:
lexeme = spacy_model.vocab['happy']
print("Properties of the word \"happy\" as a Lexeme")
print(lexeme.text)
print(lexeme.like_num)
print(lexeme.is_stop)
print(lexeme.has_vector)

On the other hand, `Lexeme` lacks contextual linguistic information - such as part-of-speech and lemma, which are available from a [`Token`](https://spacy.io/api/token):

In [None]:
doc = spacy_model("I'm so happy!")
print("Properties of the word \"happy\" as a Token")
token = doc[3]
print(token.lemma_)
print(token.pos_)
print(token.text)
print(token.lex.is_stop) # access to the lexeme from the token

### Word similarity

Having words represented as vectors, we can use linear algebra and vector space models to analyze the relationship between words.

We will start by examining word similarity, by computing the ***cosine similarity*** of word vectors. Cosine similarity is a widely used measure of semantic similarity of words. Its value ranges from -1 to 1, though it is usually used in the non-negative space [0, 1] where 0 means absence of similarity and 1 means extremely similar or identical.

In spaCy, we can use the built-in similarity function (which implements cosine similarity) to calculate word similarity based on their vectors:

In [None]:
snow = spacy_model.vocab['snow']
winter = spacy_model.vocab['winter']
summer = spacy_model.vocab['summer']


print(f"Similarity of snow and winter: {snow.similarity(winter)}")
print(f"Similarity of snow and summer: {snow.similarity(summer)}")
print(f"Similarity of winter and summer: {summer.similarity(winter)}")

In [None]:
spain = spacy_model.vocab['Spain']
portugal = spacy_model.vocab['Portugal']

print(f"Similarity of Spain and Portugal: {spain.similarity(portugal)}")
print(f"Similarity of Spain and snow: {spain.similarity(winter)}")

### Word analogy

Word embeddings are well known for their good performance on the analogies task, that is, modeling analogy relationships between words.
The most famous example is:<br>
_Man is to Woman as King is to Queen_ <br>

This can also be writen as:
Man - Woman = King - Queen

With some simple arithmetics, the above can also be written as follows: <br>
King + Woman – Man = Queen <br>

and can be interpreted as follows: adding the vectors associated with the words *king* and *woman* while subtracting *man* results in a vector that is the most similar to the vector for *queen*. In other words, by subtracting the concept of man from the concept of King we get a representation of the "royalty". Then, if we add the concept of woman, the concept we obtain is closest to the word "queen".

In [None]:
def get_vector(word:str):
  return spacy_model.vocab[word].vector

king_vec = get_vector('king')
queen_vec = get_vector('queen')
man_vec = get_vector('man')
woman_vec = get_vector('woman')

res_vec = king_vec - man_vec + woman_vec

Function for computing cosine similarity of two vectors
($cos(A,B) = (A · B) / (||A|| * ||B||)$):

In [None]:
from numpy.linalg import norm

cosine = lambda v1, v2: np.dot(v1, v2) / (norm(v1) * norm(v2))

And another one for determining topK most similar words to the given word (i.e., its word vector):

In [None]:
def most_similar_words(word_vec, words_to_exclude=None, topk=5):
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

    if words_to_exclude:
        words_to_exclude.extend(spacy_stopwords)
    else:
        words_to_exclude = spacy_stopwords

    print(len(words_to_exclude))

    # from the vocabulary select all words eligible for a comparison
    all_words = [w for w in spacy_model.vocab if
                w.has_vector and (w.text not in words_to_exclude) and w.is_lower and (len(w.text) > 3)]

    # sort the selected (eligible) words by similarity
    candidates = sorted(all_words, key=lambda w: cosine(word_vec, w.vector), reverse=True)

    return candidates[:topk]

In [None]:
print('Similarity between queen and result:', cosine(res_vec, queen_vec))

In [None]:
print("Words most similar to the resulting vector:", [c.text for c in most_similar_words(res_vec, words_to_exclude=['king','man','woman'])])

We can try the same for <br>
*Paris is to France as Madrid is to ?* <br>
or <br>
Paris - France = Madrid - ?

In [None]:
paris_vec = get_vector('paris')
france_vec = get_vector('france')
madrid_vec = get_vector('madrid')
spain_vec = get_vector('spain')

res_vec_2 = madrid_vec - paris_vec + france_vec

print(f"Similarity of spain and the resulting vector: {cosine(res_vec_2, spain_vec)}")

In [None]:
print("Words most similar to the resulting vector:", [c.text for c in most_similar_words(res_vec_2, words_to_exclude=['france','madrid','paris'])])

You may want to try similar for:
* France - Paris + Rome =
* France - french + english =
* December - November + June =
* Man - Woman + Sister =

#### Visualization of Word Embeddings

Visualisation is often used to get a better understanding of how word vectors model relations among words.

Since word vectors are high dimensional - Word2Vec has 300 dimensions and more recent models have a few thousand dimensions - we cannot visualize them directly.
A typical approach to overcome that challenge is to apply dimensionality reduction algorithm and visualize thus transformed data.
Next time, we will use [t-SNE](https://lvdmaaten.github.io/tsne/), a popular dimension reduction technique to reduce the word vectors to 2D and explore the word relations to see if we can find some pattern visually.
<br>
For now, we will explore an interactive visualization of word embeddings can be found here:
[https://projector.tensorflow.org/](https://projector.tensorflow.org/)

## Use word embeddings for text classification

### Load and prepare the data

We use the mobile apps reviews dataset that was downloaded from the above-linked web page and stored locally.

In [None]:
from google.colab import files

data_files = files.upload()
data_file_path = list(data_files.keys())[0]

If running the code locally, comment the code in the cell above and uncomment the code in the following cell

In [None]:
# from pathlib import Path
#
# data_file_path = Path.cwd() / 'data' / 'mobile_apps_reviews.csv'

In [None]:
data = pd.read_csv(data_file_path)
data.head()

In [None]:
data.info()

In [None]:
data.category.unique()

We will re-code the outcome variable as in the previous classes, in particular:

* functional: BUG, FEATURE
* non-functional: PERFORMANCE, USABILITY, ENERGY, SECURITY
* irrelevant: OTHER


In [None]:
def category_to_cls(cat):
  if cat == "OTHER": return cat
  if cat in ['BUG', 'FEATURE']: return "FUNCTIONAL"
  return "NON-FUNCTIONAL"

data['cls'] = data.category.apply(category_to_cls)
data.cls.value_counts(normalize=True)

Add review length, as in the previous classes

In [None]:
data['review_len'] = data.review.apply(len)

In [None]:
data.head()

Keep the review, review length, and rating columns (as in the previous classes) and split the data set into training and test sets

In [None]:
X = data[['review','review_len','rating']]
y = data['cls']

# do label encoding, not to work with strings
y_encodings, y_levels = pd.factorize(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encodings, test_size = 0.2, random_state = RAND_STATE, stratify=y)

### Use word embeddings for document representation

We have a way to numerically represent each word - the embeddings provide a vector for every word in our vocabulary.<br>
However, to do the classification, we need to have a numerical representation of complete reviews (documents). So, for each review, we need a vector that would represent that review and that vector should be based on embeddings of the words in the review. There are several options to do that; here, we apply a very simple method: we average the embeddings of the words forming the review.<br>
Though simple, this method often works well for short texts such as the reviews we are working with.

#### Creating a custom transformer class

We will create a class - `ReviewVectorizer` - that transforms every review into a vector representation, as described above, using a pre-trained Word2Vec model (that is integrated in the spaCy NLP pipeline). This transformation is preceded by text 'cleaning' and reducing it to semantics-bearing words, so that only 'the most important' words are used for building a text (review) representation.

Note that the `ReviewVectorizer` class is defined as a subclass of sklearn's `BaseEstimator` and `TransformerMixin`. This is required in order to make our
transformer component compatible with the sklearn's tools such as GridSearchCV and Pipeline.

When creating a custom transformer by inheriting from `BaseEstimator` and `TransformerMixin`, we are primarily required to implement the following methods:
* `__init__(self, *args, **kwargs)`. The `__init__`'s role is to initialize the transfomer. **Important**: all hyperparameters of our custom transformer that can be set by the user must be defined as the input arguments of the `__init__` method and stored as instance attributes.
* `fit(self, X, y=None)`. The purpose of this method is to learn any necessary parameters from the training data, $X$; for a transformer, this usually involves calculating statistics like the mean or standard deviation. It must return self.
* `transform(self, X)`. The role of this method is to apply the transformation to the input data, $X$, using the parameters learned in `fit`. It must return the transformed data.


If interested to learn more, the sklearn's documentation (in particular, [this page](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator)) provides detailed information.



In [None]:
class ReviewVectorizer(BaseEstimator, TransformerMixin):

    def __init__(self, spacy_pipeline):

        # spacy language model (NLP pipeline) to be used
        self.spacy_pipeline = spacy_pipeline

        # dimensionality of vectors
        self.D = len(self.spacy_pipeline.vocab['word'].vector) if self.spacy_pipeline else None


    def spacy_tokenizer(self, doc):
        # Create token objects with linguistic annotations
        tokens = self.spacy_pipeline(doc)

        # remove punctuation, spaces, numbers, and stopwords
        tokens = [token for token in tokens if ((not token.is_punct) and (not token.like_num) and (not token.is_space) and (not token.is_stop))]

        # Lemmatize each token and convert each token into lowercase
        words = [token.lemma_.lower() for token in tokens]

        return words


    def fit(self, X, y=None):
        return self


    def transform(self, X):

        # dictionary of review words that could not be matched against the spacy model
        # key is the review ordinal number, value is a list of unmatched words
        unmatched_words = defaultdict(list)

        # indices of documents that could not be represented as vectors due to
        # their words not having vectors in the used word vector model
        dropped_indices = []

        # create a zero-matrix with the number of rows corresponding to the number of documents and
        # number of columns corresponding to the dimensions of the spacy's word vectors
        X_transformed = np.zeros((len(X), self.D))

        for i, text in enumerate(X):
            words = self.spacy_tokenizer(text)
            vecs = []
            for word in words:
                try:
                    word_vec = self.spacy_pipeline.vocab[word].vector # throws KeyError if word is not found
                    vecs.append(word_vec)
                except KeyError:
                    unmatched_words[i].append(word)
                    print(f"No vector for word: {word}")
            if len(vecs) > 0:
                vecs = np.array(vecs)
                X_transformed[i] = vecs.mean(axis=0)
            else:
                dropped_indices.append(i)

        self.report_unmatched_words_stats(dropped_indices, unmatched_words)

        return X_transformed


    @staticmethod
    def report_unmatched_words_stats(dropped_docs_list, unmatched_words_dict):
      from statistics import mean, stdev

      unmatched_counts = [len(unmatched) for unmatched in unmatched_words_dict.values()]
      if len(unmatched_counts) > 0:
        print(f"Average number of unmatched words per review (and SD): {mean(unmatched_counts):.4f} ({stdev(unmatched_counts):.4f})")

      dropped_cnt = len(dropped_docs_list)
      if dropped_cnt > 0:
        print(f"Number and proportion of reviews with no words found: {dropped_cnt}({dropped_cnt} / {len(X)})")


    # def fit_transform(self, X, y=None, **fit_params):
    #     return self.transform(X)


### Train classifiers on the document embeddings

Use the `ReviewVectorizer` class to transform reviews into vectors using a word embeddings model.

Use [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) to rescale the two additional features (using median and inter-quartile range) so that all the features are in the roughly equal value range.

Then we can use the resulting feature set to build a classifier.

In [None]:
# Load models required for the ReviewVectorizer; if not already loaded
# spacy_model = spacy.load(SPACY_MODEL)

# Instantiate the review vectoriser
vectorizer = ReviewVectorizer(spacy_pipeline=spacy_model)

# Instatiate a transformer to rescale (normalise) the other two features (rating and review length)
normaliser = RobustScaler()

# Apply transformers to all columns
column_transformer = ColumnTransformer(
    transformers=[
        ('embeddings', vectorizer, 'review'),
        ('other_features', normaliser, ['review_len','rating']),
        ])


We will use **Random Forest** classification method to build a classifier. The rationale: in a [comparative analysis](https://dl.acm.org/doi/10.5555/2627435.2697065) performed with 179 general purpose classification algorithms on 121 different datasets, Random forest was one of the top performing algorithms.

For a quick introduction to Random Forest, watch [this video tutorial](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ).

We are also using [SMOTE resampling method](https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/) to address the problem of imbalanced dataset (recall that the three review categories (classes) are not equally present). **SMOTE** is abbreviation from *Synthetic Minority Oversampling Technique*, and is a technique for increasing the size of a minority class (and thus balancing the class distribution) by synthesizing new instances of that class from the existing instances. In short, SMOTE works as follows:

A random instance from the minority class is first chosen. Then k of the nearest neighbors for that instance are found (typically k=5). A randomly selected neighbor is chosen and a synthetic instance is created at a randomly selected point between the two instances in feature space.

Finally, note that we are using Pipeline from the `imblearn.pipeline` (not from `sklearn.pipeline` as in the previous classes), since this kind of pipepine properly handles class balancing methods (apply them to the train set, but not to the test set)

In [None]:
rf_classifier = RandomForestClassifier(random_state=RAND_STATE, n_estimators=100)

smote = SMOTE(random_state=RAND_STATE)

rf_pipe = Pipeline(
    [
        ('vectorizer', column_transformer),
        ('resampling', smote),
        ('classifier', rf_classifier)
     ],
     verbose=True)

rf_pipe.fit(X_train, y_train)

#### Evaluate the Random Forest classifier

In [None]:
ConfusionMatrixDisplay.from_estimator(
        rf_pipe,
        X_test,
        y_test,
        display_labels=y_levels,
        cmap='Greens',
    );

In [None]:
rf_predictions = rf_pipe.predict(X_test)

perf_rf = metrics.classification_report(y_true = y_test, y_pred = rf_predictions)
print(perf_rf)

In [None]:
bacc_rf = metrics.balanced_accuracy_score(y_test, rf_predictions)
print(f"Balanced accuracy: {bacc_rf:0.4f}" )

This is not much better than the performance we've had with the BoW model.
Fine-tuning of RF might help improve the results.

We will explore if another classification algorithm - one often used in text classfication - can lead to better performance.

### Create and evaluate an SVM classifier

This time, we will use a Support Vector Machines (SVM) classifier. This type of classifier has proven successful in text classification tasks; it is generally known to deal well with high dimensional data (as is the case here). It also works well with (relatively) small datasets.
For a quick introduction to SVM, watch this short [video tutorial](https://www.youtube.com/watch?v=N1vOgolbjSc).

We will use the simplest version of this classifier, which is SVM with linear kernel (that is, linear transformation of the original features). This type of classifier is implemented in sklearn's [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) class

In [None]:
svm_classifier = SVC(kernel='linear',
                     C=1, # default value of the hiper-parameter
                     random_state=RAND_STATE)

svm_pipe = Pipeline(
    [
        ('vectorizer', column_transformer),
        ('resampling', smote),
        ('classifier', svm_classifier)
     ],
     verbose=True
)

svm_pipe.fit(X_train, y_train)

Make predictions on the test set and compute evaluation measures

In [None]:
ConfusionMatrixDisplay.from_estimator(svm_pipe, X_test, y_test, display_labels=y_levels, cmap='Greens');

In [None]:
svm_predictions = svm_pipe.predict(X_test)

perf_svm = metrics.classification_report(y_test, svm_predictions)
print(perf_svm)

In [None]:
bacc_svm = metrics.balanced_accuracy_score(y_test, svm_predictions)
print(f"Balanced accuracy: {bacc_svm:0.4f}" )

### Improve the SVM model through hyper-parameter tuning

We can try to improve the classification model through hyper-parameter tuning. In particular, we'll examine a range of values for the hyper-parameter **C**. With this hyper-parameter, we specify how much we want to penalize missclassified points - the higher **C** value, the higher the penalty, but also the higher risk of overfitting. We will do the tuning through grid search over the specified range of parameter C values.<br>

Note: for GridSearchCV, hyper-paramter names have to be preceded by the name of the classifier as given in the pipeline; in this case, the name is `classifer`, so the hyper-paramter C got prefix `classifier__`.

In [None]:
%%time

param_grid = dict(classifier__C=[0.01, 0.1, 1, 10, 100])
grid_search = GridSearchCV(svm_pipe,
                           param_grid=param_grid,
                           scoring='recall_macro', # we want to boost recall
                           verbose=2)
# grid_search.set_params(estimator__max_iter=5000)
grid_search.fit(X_train, y_train)

In [None]:
print('best parameters: ', grid_search.best_params_)
print('best scores: ', grid_search.best_score_)

Re-train the SVM classifier with the optimal C value

In [None]:
best_c = grid_search.best_params_['classifier__C']
svm_pipe.set_params(classifier__C=best_c)
svm_pipe.fit(X_train,y_train)

Evaluate the optimised model

In [None]:
ConfusionMatrixDisplay.from_estimator(svm_pipe, X_test, y_test, display_labels=y_levels, cmap='Greens');

In [None]:
svm_opt_predictions = svm_pipe.predict(X_test)

perf_svm_opt = metrics.classification_report(y_test, svm_opt_predictions)
print(perf_svm_opt)

In [None]:
bacc_svm_opt = metrics.balanced_accuracy_score(y_test, svm_opt_predictions)
print(f"Balanced accuracy: {bacc_svm_opt:0.4f}" )

### Compare all the models

In [None]:
def get_model_metrics(y_true, y_pred, model_name):

    # compute P, R, F1
    report = metrics.classification_report(y_true, y_pred, output_dict=True)

    # compute balanced accuracy
    b_acc = metrics.balanced_accuracy_score(y_true, y_pred)

    # extract metrics for both macro and weighted averages
    macro_avg = report['macro avg']
    weighted_avg = report['weighted avg']

    # Create the results dictionary
    results = {
        'Model': model_name,
        'Macro Precision': macro_avg['precision'],
        'Macro Recall': macro_avg['recall'],
        'Macro F1-Score': macro_avg['f1-score'],
        'Weighted Precision': weighted_avg['precision'],
        'Weighted Recall': weighted_avg['recall'],
        'Weighted F1-Score': weighted_avg['f1-score'],
        'Balanced Accuracy': b_acc
    }

    return results

Create also a dummy classifier to serve as the general baseline.

In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy='stratified') # generates predictions by respecting the training set’s class distribution

dummy_clf.fit(X_train, y_train)
dummy_pred = dummy_clf.predict(X_test)

In [None]:
model_predictions = {
    'W2V_RF': rf_predictions,
    'W2V_LinSVM': svm_predictions,
    'W2V_OptLinSVM': svm_opt_predictions,
    'DUMMY': dummy_pred
}

all_results = list()

for model_name, model_pred in model_predictions.items():
  model_res = get_model_metrics(y_test, model_pred, model_name)
  all_results.append(model_res)

all_results_df = pd.DataFrame(all_results)
all_results_df.set_index("Model", inplace=True)
all_results_df = all_results_df.apply(lambda x: round(x, 4))

print("Model Performance Comparison:")
print(all_results_df)