![wordcloud](wordcloud.png)

Applying K-means clustering and NLP techniques to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1


In [5]:
#preprocessing the negative reviews
# Filter for negative reviews (where score is 1 or 2)
negative_reviews_tmp = reviews[(reviews["score"] == 1) | (reviews["score"] == 2)]["content"]

print(negative_reviews_tmp)

0                            I cannot open the app anymore
1        I have been begging for a refund from this app...
2        Very costly for the premium version (approx In...
3        Used to keep me organized, but all the 2020 UP...
4                                      Dan Birthday Oct 28
                               ...                        
11940    I loved it until I realized that the very feat...
11941    Gave it a test run and tried out the notificat...
11942    Looks great but since installing, my device on...
11943    This app looked good until I had to purchase i...
11944                                             It's OK!
Name: content, Length: 4850, dtype: object


In [8]:
def preprocess_text(text):
    # Tokenizing the text
    tokens = word_tokenize(text)

    # Removing stop words and non-alpha characters
    filtered_tokens = [
        token
        for token in tokens
        if token.isalpha() and token.lower() not in stopwords.words("english")
    ]

    return " ".join(filtered_tokens)

negative_reviews_cleaned = negative_reviews_tmp.apply(preprocess_text)
negative_reviews_cleaned

0                                         open app anymore
1                 begging refund app month nobody replying
2        costly premium version approx Indian Rupees pe...
3        Used keep organized UPDATES made mess things c...
4                                         Dan Birthday Oct
                               ...                        
11940    loved realized feature got download first plac...
11941    Gave test run tried notifications hear thing A...
11942    Looks great since installing device lasts half...
11943    app looked good purchase get week view everyti...
11944                                                   OK
Name: content, Length: 4850, dtype: object

In [9]:

preprocessed_reviews = pd.DataFrame({"review": negative_reviews_cleaned})
preprocessed_reviews.head()


Unnamed: 0,review
0,open app anymore
1,begging refund app month nobody replying
2,costly premium version approx Indian Rupees pe...
3,Used keep organized UPDATES made mess things c...
4,Dan Birthday Oct


In [10]:

# Vectorize the cleaned reviews using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_reviews["review"])



In [12]:

# Applying K-means clustering 
clust_kmeans = KMeans(n_clusters=5, random_state=500)
pred_labels = clust_kmeans.fit_predict(tfidf_matrix)

# Storing the predicted labels 
categories = pred_labels.tolist()
preprocessed_reviews["category"] = categories



0        0
1        0
2        3
3        2
4        2
        ..
11940    1
11941    2
11942    0
11943    0
11944    2
Name: category, Length: 4850, dtype: int64

In [13]:


#the feature names (terms) from the vectorizer
terms = vectorizer.get_feature_names_out()

#Saving the top term for each cluster
topic_terms_list = []

for cluster in range(clust_kmeans.n_clusters):
    # Get indices of reviews in the current cluster
    cluster_indices = [i for i, label in enumerate(categories) if label == cluster]

    # Sum the tf-idf scores for each term in the cluster
    cluster_tfidf_sum = tfidf_matrix[cluster_indices].sum(axis=0)
    cluster_term_freq = np.asarray(cluster_tfidf_sum).ravel()

    # Get the top term and its frequencies
    top_term_index = cluster_term_freq.argsort()[::-1][0]

    # Appending rows to the topic_terms DataFrame with three fields:
    # - category: label / cluster assigned from K-means
    # - term: the identified top term
    # - frequency: term's weight for the category
    topic_terms_list.append(
        {
            "category": cluster,
            "term": terms[top_term_index],
            "frequency": cluster_term_freq[top_term_index],
        }
    )


topic_terms = pd.DataFrame(topic_terms_list)


print(topic_terms)


   category      term   frequency
0         0       app  186.525216
1         1   version   63.738669
2         2      good   52.935519
3         3   premium   55.750426
4         4  calendar   70.971649
