In this notebook, we use an extractive algorithm (PageRank) to summarize the reviews associated to a cluster we used in the LDA notebook.

In [1]:
import pandas as pd
import nltk
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/ivo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
df = pd.read_csv("../data/clusterized_dataframe.csv")

In [3]:
cluster = df[df["cluster_num"]==5]

In [4]:
cluster_reviews_joined = " ".join(cluster["Reviews"])
# some reviews may contain important sentences.
sentences = cluster_reviews_joined.split(".")
# we could also have used the embeddings created by sentence transformers, instead of
# CountVectorizer.
vectorizer = CountVectorizer(stop_words="english")
sentence_vectors = vectorizer.fit_transform(cluster["Reviews"])
similarity_matrix = cosine_similarity(sentence_vectors)

In [5]:
graph = nx.from_numpy_array(similarity_matrix)

In [6]:
scores = nx.pagerank(graph)

In [7]:
num_sentences = 5
top_sentence_indices = sorted(
    range(len(scores)), 
    key=lambda i: scores[i], 
    reverse=True
)[:num_sentences]

summary = [sentences[i] for i in top_sentence_indices]

In [8]:
summary

[' This oil is great, and keeps my car running happy',
 ' These are great, everyone should have them in the toolbox',
 ' Great, versatile oil filter',
 ' I hope I can easily get the filter off to do my own changes',
 "  It works only to loosen, not to tighten - and of course you'd never want to tighten an oil filter with a tool anyway"]

It seems in this cluster we have reviews for an oil filter, and for a brand of oil, and also for something to tighten an oil filter.
The general sentiment seems to be positive.