## Summary on the Method and Output

The topic modelling analysis identified 10 topics in a dataset of user feedback or reviews about a dating app. This can help the product team prioritize and address specific user concerns or pain points in order to improve the user experience and drive engagement on the platform. Each topic represents a common theme or issue that users have with the app, such as technical issues, dating experience, user interface, security, and account management.

These results suggest that the app's user experience may be a critical area for improvement, with specific focus on messaging functionality, swiping and user interface, and addressing user concerns related to privacy and account management. Additionally, the analysis highlights the importance of addressing concerns related to fraudulent activity and security, as well as optimizing the app's technical performance to ensure a smooth user experience.

Overall, the results of the topic modelling can provide valuable insights for the product team, helping them to identify areas for improvement and optimize the app's design and functionality to better meet user needs and preferences.

## Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re
import contractions
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import requests, time
from nltk.sentiment import SentimentIntensityAnalyzer


nltk.download('vader_lexicon')
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ijeonghyeon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
df = pd.read_csv("bumble_google_play_reviews.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126724 entries, 0 to 126723
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   reviewId              126724 non-null  object
 1   userName              126724 non-null  object
 2   userImage             126724 non-null  object
 3   content               126715 non-null  object
 4   score                 126724 non-null  int64 
 5   thumbsUpCount         126724 non-null  int64 
 6   reviewCreatedVersion  105867 non-null  object
 7   at                    126724 non-null  object
 8   replyContent          82340 non-null   object
 9   repliedAt             82340 non-null   object
dtypes: int64(2), object(8)
memory usage: 9.7+ MB


In [4]:
print(df.duplicated().sum())

0


In [5]:
print(df.isnull().sum())

reviewId                    0
userName                    0
userImage                   0
content                     9
score                       0
thumbsUpCount               0
reviewCreatedVersion    20857
at                          0
replyContent            44384
repliedAt               44384
dtype: int64


In [6]:
text = df["content"]

text[:10]

0                                                Great
1                                    Why am i blocked?
2                                            good work
3                                              cool ap
4                                                  Wow
5    I've been on it for a day and let me tell you ...
6                                              Useless
7    I am serious about dating, but even when Bumbl...
8    Just rating for the most godawful ads you have...
9    20 people like me apparently. been a week of d...
Name: content, dtype: object

In [7]:
# Define a list of negation cues
negation_cues = ["not", "n't", "never", "no", "none", "neither", "nor"]

def text_preprocessing(text):
    # Convert to lowercase
    text = str(text)
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize the text
    text_tokens = nltk.word_tokenize(text)
    # Handle negation cues
    negated = False
    for i, token in enumerate(text_tokens):
        if token.lower() in negation_cues:
            negated = True
        elif negated:
            text_tokens[i] = "NOT_" + token
            negated = False
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in text_tokens if not word in stop_words]
    # Join the filtered words back into a string
    text = ' '.join(filtered_text)
    # Replace contractions with their expanded form
    text = contractions.fix(text)
    return text

processed_text = []
for text in text:
    result = text_preprocessing(text)
    processed_text.append(result)
    
processed_text = pd.Series(processed_text)

lemmatizer = WordNetLemmatizer()

# Define a function that takes a sentence as input and returns a list of lemmas
def lemmatize_nltk(sentence):
    tokens = nltk.word_tokenize(sentence)
    # Perform part-of-speech tagging on the tokens 
    pos_tags = nltk.pos_tag(tokens)
    lemmas = []
    for token, tag in pos_tags:
        # Map the POS tag to the corresponding WordNet POS tag
        tag = get_wordnet_pos(tag)
        if tag:
            lemma = lemmatizer.lemmatize(token, tag)
        else:
            lemma = lemmatizer.lemmatize(token)
        lemmas.append(lemma)
    return lemmas

# Define a function that maps NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('N'):
        return 'n'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('J'):
        return 'a'
    elif tag.startswith('R'):
        return 'r'
    else:
        return None

lemmatized_words = []
for sentence in processed_text:
    lemmas = lemmatize_nltk(sentence)
    lemmatized_words.append(lemmas)
    
# Convert the list of lemmatized words to a Series
lemmatised_text = pd.Series(lemmatized_words)

lemmatised_text.head()

lemmatised_corpus = []

for doc in lemmatised_text:
    sentence = " ".join(doc)
    lemmatised_corpus.append(sentence)

In [8]:
reviews = lemmatised_corpus

## Topic Modelling

In [9]:
from gensim.corpora import Dictionary

tokenized_corpus = [nltk.word_tokenize(doc) for doc in lemmatised_corpus]
dictionary = Dictionary(tokenized_corpus)
dictionary.filter_extremes(no_below=2, no_above=0.5)

In [10]:
from gensim.models import LdaModel

# create a bag of words corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_corpus]

# build the LDA model with 5 topics
num_topics = 10
lda_model = LdaModel(corpus=bow_corpus,
                     id2word=dictionary,
                     num_topics=num_topics,
                     random_state=42,
                     update_every=1,
                     chunksize=100,
                     passes=10,
                     alpha='auto',
                     per_word_topics=True)

# print the topics and their top 10 keywords
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx+1, topic))

Topic: 1 
Words: 0.141*"app" + 0.122*"not" + 0.068*"do" + 0.040*"message" + 0.036*"time" + 0.032*"work" + 0.031*"make" + 0.030*"first" + 0.024*"guy" + 0.023*"girl"
Topic: 2 
Words: 0.105*"woman" + 0.073*"great" + 0.069*"tinder" + 0.053*"well" + 0.053*"date" + 0.040*"concept" + 0.038*"love" + 0.032*"apps" + 0.031*"easy" + 0.029*"men"
Topic: 3 
Words: 0.129*"swipe" + 0.064*"right" + 0.048*"day" + 0.038*"show" + 0.036*"pretty" + 0.036*"much" + 0.026*"leave" + 0.025*"yet" + 0.019*"android" + 0.018*"already"
Topic: 4 
Words: 0.051*"would" + 0.037*"keep" + 0.035*"give" + 0.028*"every" + 0.026*"connection" + 0.026*"hour" + 0.025*"fix" + 0.024*"actually" + 0.024*"let" + 0.022*"thing"
Topic: 5 
Words: 0.171*"profile" + 0.089*"fake" + 0.071*"lot" + 0.043*"could" + 0.041*"picture" + 0.039*"sign" + 0.038*"though" + 0.025*"fun" + 0.024*"best" + 0.020*"location"
Topic: 6 
Words: 0.107*"account" + 0.105*"back" + 0.056*"delete" + 0.041*"bad" + 0.040*"change" + 0.036*"NOT_a" + 0.034*"anything" + 0.034*

In [11]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# create the visualization
lda_visualization = gensimvis.prepare(lda_model, bow_corpus, dictionary)

# display the visualization
pyLDAvis.display(lda_visualization)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


### Additional analysis and insights for each of the topics in the given list

- Topic 1: This topic suggests that users are experiencing issues with the app's messaging functionality, as well as other technical problems. This could be a critical issue for the business, as messaging is a core feature of most dating apps, and technical issues could drive users away from the platform. Improving the app's messaging and technical performance could be a key priority for the product team.


- Topic 2: This topic suggests that users are using the app for dating, and may be looking for specific features or functionality to support their dating goals. For example, they may be interested in features that make it easy to find and connect with other users, or to filter potential matches based on certain criteria. Understanding user needs and preferences in this area could help the product team develop new features or improve existing ones to better support the user experience.


- Topic 3: This topic suggests that users may be frustrated with the app's swiping functionality or other user interface elements. This could be a key area for improvement, as the app's user interface is a critical component of the user experience. The product team may need to conduct user testing or research to better understand user needs and preferences, and use this information to inform the design and development of the app's user interface.


- Topic 4: This topic suggests that users may be experiencing connection issues or other technical problems that are preventing them from using the app effectively. Addressing these issues could be a key priority for the product team, as they could significantly impact user retention and engagement. The team may need to conduct technical audits or engage with users to identify and address these issues.


- Topic 5: This topic suggests that users are concerned about fake profiles and other fraudulent activity on the platform. This could be a critical issue for the business, as users are likely to abandon the platform if they do not feel that their safety and privacy are being protected. The product team may need to develop new security features or improve existing ones to better address these concerns.


- Topic 6: This topic suggests that users may be concerned about account management and privacy issues. The product team may need to improve the app's privacy settings or develop new features to help users better manage their accounts and data. This could be a critical area of focus for the business, as privacy concerns are a key driver of user trust and engagement.


- Topic 7: This topic suggests that users may be interested in social media integration or other advanced features that enhance the user experience. The product team may need to conduct user research or testing to better understand user needs and preferences in this area, and use this information to inform the design and development of new features.


- Topic 8: This topic suggests that users have general feedback or comments about the app, and may be interested in a range of different features or functionality. The product team may need to engage with users to better understand their needs and preferences, and use this information to guide product development and optimization.


- Topic 9: This topic suggests that users are generally satisfied with the app's performance and features, but may have minor complaints or issues. The product team may need to prioritize bug fixes or other optimizations to ensure that the app continues to meet user expectations and remains competitive in the market.


- Topic 10: This topic suggests that users may be experiencing issues with app updates or other technical issues. The product team may need to prioritize technical support or other resources to help users resolve these issues and ensure that they are able to use the app effectively. Improving user support and engagement could be critical for driving user retention and engagement.