# NLP Topic Modelling on PureGym Reviews

**Author:** Matt Cocker  
**Date:** [July 2025]  
**Project Type:** NLP | Topic Modelling | Customer Insight

---

## About this Notebook

This notebook applies topic modelling and emotion analysis to customer reviews from PureGym to uncover key drivers of member behaviour and sentiment. It supports business insight generation by identifying dominant themes and emotional tones in user feedback.

The notebook uses traditional LDA for topic modelling, which will be compared to a separate notebook. Preprocessing steps are performed in the separate notebook, and processed data is loaded via `.pkl` files for efficiency.

**Key Tasks:**
- Load cleaned review data
- Apply topic modelling
- Visualise topic distributions

*Note: The original dataset used in this notebook is under NDA and not accessible.  Any data loading or Drive mounting steps have been replaced with placeholders.*

# Imports

We will start by force reinstalling versions of the necessary libraries which are compatible with Gensim.

In [None]:
!pip install -q --force-reinstall --no-cache-dir numpy==1.24.4 gensim==4.3.2 pyLDAvis==3.3.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.7 MB[0m [31m21.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.9/61.9 kB[0m [31m155.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m153.3 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.

Now we'll import the necessary libraries.

In [None]:
# --- Standard libraries
import os
import pickle

# --- Analysis libraries
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models

# --- Data & NLTK setup
# from google.colab import drive
# drive.mount('/content/drive')
# Drive access removed for this public version – data loading uses placeholder below

nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

# Importing Data

We will import our list of negative reviews from the merged datasets in the part 1 notebook using pickle.

In [None]:
# ⚠️ NDA Notice:
# The original data used in this project is proprietary and cannot be shared.
# Below is an example placeholder to illustrate where data would be loaded.

# with open("data/merged_neg_reviews.pkl", "rb") as f:
#     merged_neg_reviews = pickle.load(f)

print("Data loading step skipped due to confidentiality.")


Merged Dataset of Negative Reviews:
[['many', 'students', 'two', 'local', 'colleges', 'go', 'leave', 'rubbish', 'changing', 'rooms', 'sit', 'like', 'canteen', 'going', 'years', 'cancel', 'membership', 'go', 'group', 'disgusting', 'students', 'hanging', 'around', 'machines', 'messing', 'around', 'like', 'school', 'crowded', 'ceo', 'supports', 'genocide', 'civilians', 'israel', 'disgusting', 'people'], ['current', 'member', 'quite', 'dirty', 'often', 'theres', 'no', 'soap', 'bathroom', 'zero', 'airflow', 'like', 'sauna', 'also', 'often', 'overcrowded', 'anytime', 'pm', 'good', 'thing', 'location', 'bring', 'buddy', 'thing'], ['way', 'hot', 'even', 'workout', 'no', 'windows', 'open', 'ac', 'barely', 'works', 'staff', 'no', 'near', 'friendly', 'always', 'rude', 'especially', 'men', 'clients', 'mean', 'work'], ['no', 'access', 'wc', 'empty', 'no', 'assistance', 'gain', 'access', 'fault', 'forgot', 'pin', 'didnt', 'see', 'stay', 'enable', 'assistance'], ['year', 'finally', 'leaving', 'gutte

# Preprocessing

Here we'll perform the preprocessing steps - we've already removed punctuation, numbers and stopwords in previous preprocessing, as well as making the text lower case and tokenizing. As such, we just need to lemmatise the cleaned comments.

In [None]:
# Lemmatise the cleaned reviews - we will use the negative reviews from the Top 30 Locations

lemmatizer = WordNetLemmatizer()

# Set up the lemmatiser
def lemmatise_text(text):
    lemmatised_tokens = [lemmatizer.lemmatize(token) for token in text]
    return lemmatised_tokens

neg_merged_reviews_lemma = [lemmatise_text(tokens) for tokens in merged_neg_reviews]

In [None]:
# Create a dictionary representation of the documents.
dictionary = corpora.Dictionary(neg_merged_reviews_lemma)

# Filter out words that occur in fewer than 2 documents or in more than 50% of the documents.
dictionary.filter_extremes(no_below=2, no_above=0.5)

# Create corpus (bag-of-words)
corpus = [dictionary.doc2bow(text) for text in neg_merged_reviews_lemma]

# Modelling

Now we'll build the LDA model, setting it to generate 10 topics.

In [None]:
# Instantiate the LDA model with num_topics = 10

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    random_state=19,
    passes=20,
    alpha='auto',   # Set 'auto' instead of default to let model learn document-topic density from the data
    per_word_topics=True
)

# Visualisation

Now we'll visualise the output using pyLDAvis

In [None]:
# Prepare visualization
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

# Display it
vis

The low number of topics set results in some generic topics where the theme isn't clear. Moreover, the distances are sometimes odd - Topic 5 on cleanliness overlaps with Topic 1 on equipment maintenance, but is far away from Topic 8 which has the theme of cold showers. Topic 8 is also odd, with cold showers being grouped together with parking.

A more specific subset of negative reviews, such as angry reviews, may help to make more distinct clusters with clearer themes. More fine-tuning of the LDA model may also help to generate a better visualisation.