![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [101]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [102]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [103]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1


In [104]:
# Your code starts here
# Cells are free! Use as many as you need ;)
reviews['score'].value_counts()


5    2879
4    2775
1    2506
2    2344
3    1991
Name: score, dtype: int64

In [105]:
negative_reviews=reviews[(reviews['score']==1) | (reviews['score']==2)]
negative_reviews.head()

from nltk.stem import PorterStemmer

stemmer=PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text=text.lower()
    tokens=[token for token in word_tokenize(text) if token.isalpha()]
    filtered_stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    
    return ' '.join(filtered_stemmed_tokens)
    
preprocessed_reviews = pd.DataFrame({
    'content': negative_reviews['content'].apply(preprocess_text)
})
preprocessed_reviews.head()
    

Unnamed: 0,content
0,open app anymor
1,beg refund app month nobodi repli
2,costli premium version approx indian rupe per ...
3,use keep organ updat made mess thing cud u lea...
4,dan birthday oct


In [106]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in preprocessed_reviews['content']])
print(tfidf_matrix.shape)

(4850, 4986)


In [107]:
km= KMeans(n_clusters=5)
km.fit(tfidf_matrix)
categories= km.labels_.tolist()
preprocessed_reviews['cluster'] = km.labels_

In [108]:
from collections import Counter
def most_freq_term(text):
    tokens = text.split()
    term_count = Counter(tokens)
    most_common_term, most_common_count = term_count.most_common(1)[0]
    return most_common_term, most_common_count
    

topic_terms = preprocessed_reviews.groupby('cluster')['content'].apply(lambda texts: most_freq_term(' '.join(texts))).reset_index()


topic_terms = pd.DataFrame(topic_terms['content'].tolist(), columns=['most_frequent_term', 'term_count'])
topic_terms['cluster'] = topic_terms.index

topic_terms = topic_terms[['cluster', 'most_frequent_term', 'term_count']]

print(topic_terms)

   cluster most_frequent_term  term_count
0        0                app        2159
1        1                app         573
2        2               good          95
3        3           calendar         486
4        4             remind         456
