![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [7]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
reviews = pd.read_csv("reviews.csv")

print(reviews.head())

                                             content  score
0                      I cannot open the app anymore      1
1  I have been begging for a refund from this app...      1
2  Very costly for the premium version (approx In...      1
3  Used to keep me organized, but all the 2020 UP...      1
4                                Dan Birthday Oct 28      1


To reveal the main topics from app reviews, you'll perform these tasks:

Preprocess the negative reviews (reviews with a score of 1 or 2) by tokenizing the text, removing stop words and non-alpha characters. Save the results in a pandas DataFrame called `preprocessed_reviews`.

In [9]:
negative_reviews_tmp = reviews[(reviews["score"] == 1) | (reviews["score"] == 2)]["content"]

def preprocess_text(text):

    tokens = word_tokenize(text)

    filtered_tokens = [
        token
        for token in tokens
        if token.isalpha() and token.lower() not in stopwords.words("english")
    ]

    return " ".join(filtered_tokens)


negative_reviews_cleaned = negative_reviews_tmp.apply(preprocess_text)

preprocessed_reviews = pd.DataFrame({"review": negative_reviews_cleaned})
print(preprocessed_reviews.head())

                                              review
0                                   open app anymore
1           begging refund app month nobody replying
2  costly premium version approx Indian Rupees pe...
3  Used keep organized UPDATES made mess things c...
4                                   Dan Birthday Oct


Vectorize the cleaned negative reviews using TF-IDF and store the matrix in a variable called `tfidf_matrix`.

In [10]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_reviews["review"])
print(tfidf_matrix)

  (0, 292)	0.7274120131461655
  (0, 316)	0.26084007427013295
  (0, 4041)	0.6346922236686016
  (1, 4954)	0.515334875276096
  (1, 3882)	0.47123926546563444
  (1, 3738)	0.32475682037507164
  (1, 4826)	0.3572537676591596
  (1, 585)	0.515334875276096
  (1, 316)	0.11703093798205348
  (2, 2135)	0.19146110029239458
  (2, 3357)	0.1435439718198807
  (2, 3632)	0.18717758003062893
  (2, 6431)	0.1140959667437744
  (2, 6607)	0.22676740734635442
  (2, 309)	0.3240018335942553
  (2, 1736)	0.20540435544097566
  (2, 625)	0.32463845282391557
  (2, 6766)	0.18457571967920094
  (2, 4240)	0.22676740734635442
  (2, 5140)	0.2902416713612349
  (2, 2942)	0.3090627238189522
  (2, 347)	0.3090627238189522
  (2, 6489)	0.2507216262520351
  (2, 4463)	0.26135698889158343
  (2, 1265)	0.2835241469753328
  :	:
  (4847, 5447)	0.15485015875850797
  (4847, 4437)	0.24552433334322854
  (4847, 1203)	0.2115800923452725
  (4847, 3847)	0.16388869773897538
  (4847, 5353)	0.20271133338548133
  (4847, 2996)	0.22543680593982587
  (4847

Apply K-means clustering to `tfidf_matrix` to group the reviews into five categories. Store the predicted labels in a list called `categories`.

In [15]:
clust_kmeans = KMeans(n_clusters=5, random_state=500)
pred_labels = clust_kmeans.fit_predict(tfidf_matrix)

# Store the predicted labels in a list variable called categories
categories = pred_labels.tolist()
preprocessed_reviews["category"] = categories

print(categories[0:5])

[0, 0, 3, 2, 2]


For each unique cluster label, find the most frequent term. Store the results in a pandas DataFrame called `topic_terms` with at least three columns to store the label assigned from K-means, the identified term, and its frequency.

In [16]:
terms = vectorizer.get_feature_names_out()

topic_terms_list = []

for cluster in range(clust_kmeans.n_clusters):

    cluster_indices = [i for i, label in enumerate(categories) if label == cluster]

    cluster_tfidf_sum = tfidf_matrix[cluster_indices].sum(axis=0)
    cluster_term_freq = np.asarray(cluster_tfidf_sum).ravel()

    top_term_index = cluster_term_freq.argsort()[::-1][0]

    topic_terms_list.append(
        {
            "category": cluster,
            "term": terms[top_term_index],
            "frequency": cluster_term_freq[top_term_index],
        }
    )

topic_terms = pd.DataFrame(topic_terms_list)

print(topic_terms)

   category      term   frequency
0         0       app  186.525216
1         1   version   63.738669
2         2      good   52.935519
3         3   premium   55.750426
4         4  calendar   70.971649
