# Analysis of product reviews

## The goals

I wanted to learn about langchain and came up with this little project to actually use the library for a somewhat realistic usecase.

The goal of the project is to automatically extract useful insights from customer reviews on a single product.

My learning goals are:
- basic use of langchain
  - structured output (simple tool use)
  - extraction (of information from raw text)
  - clustering
- interactive plot for visualization
- hosting of the plot (while maintaining interactivity)


## The roadmap

The roadmap I came up with to achieve the above goals:

1. Choose a product on amazon
2. Gather some negative reviews (ideally >20)
3. Extract critique points from each review
4. Cluster all extracted critique points into broader categories
5. Show the amount of critique points for each cluster over time
6. Host the dynamic plot

## Let's go

### Step 1: Choose a product

Simple enough. Open up amazon front page and click on the first item you see. I won't share what article this turned out to be.

### Step 2: Gather reviews

This was the first roadblock. Amazon really doesn't want you to automatically scrape their pages (understandably). For any real business case one should definetly go through an official api. But for a quick toy project? I ended up opening the first 3 review pages manually in a browser (filtered for negative reviews), hit CTRL+S and went from there. Each page contains 10 reviews, so I ended up with 30 negative reviews buried deep inside a vast corpus of html. Fortunately html parsing is easily done with `beautifulsoup`.

After parsing and some post-processing I ended up with this kind of structure for each review:

In [13]:
import json
from pathlib import Path


json.loads(Path('ReviewFetcher/raw_reviews/review_0.txt').read_text().replace("'", '"'))

{'title': '3,0 von 5 Sternen\nToll, aber!',
 'text': 'Meine Kinder lieben den Würfel, leider ist er vermutlich nicht dafür gebaut, wirklich lange und ständig verwendet zu werden. Die geklebte Folie löst sich nun ab.',
 'rating': '3,0 von 5 Sternen',
 'date': '25. Mai 2024'}

### Step 3: Extract critique points

Now for the fun part, I get to use LLMs. When I first tried this right when GPT-3 was available via api I had lots of trouble convincing it to output valid json. And only valid json, no "certainly, here is ..." prefix. Nowdays this seems to be solved, and elegantly at that. Basically you define a simple tool that's tailored to producing structured output. Defining the expected schema is done via pydantic:

In [14]:
from langchain_core.pydantic_v1 import BaseModel, Field


class Review(BaseModel):
    """Collection of issues mentioned in a review"""
    critique_points: list[str] = Field(description="List of issues mentioned in the review")

Remember when your colleagues nag you to add proper comments to your code? Well, this time it's actually not optional at all. The docstring of the base class and the description of the field are both used by the model to determine what information to extract and how to match that with the fields in the class. The explanations in this example are rather short, but enough for my simple use case.

Defining an llm with that expected output is then simply:

In [18]:
from utils import load_openai_key
from langchain_openai import ChatOpenAI

load_openai_key()

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0125").with_structured_output(Review)

Now all that's left is a prompt which can be re-used for every one of the reviews. In langchain this is called a prompt template. Similar to f-strings you define a text with placeholders, which are populated later:

In [19]:
from langchain_core.prompts import ChatPromptTemplate

prompt_msg = ChatPromptTemplate.from_template("""
    Extract all the issues mentioned in the review and return them as a list. Only extract concrete issues, not general sentiments. I.e. "The battery life is too short" is an issue, but "I don't like the design" is not.
                                              Or "not made to last long" is not an issue, but "(because) screws are loose" is an issue.

    Review:
    {input}
""")

Putting it all together with langchains fancy chain-syntax gives us a callable object that takes a review as input, populates the prompt template, invokes the llm and returns the information structured in the `Review` class from above:

In [20]:
chain = prompt_msg | llm

An example call:

In [21]:
example_review = "Title: Toll, aber!\nText: Meine Kinder lieben den Würfel, leider ist er vermutlich nicht dafür gebaut, wirklich lange und ständig verwendet zu werden. Die geklebte Folie löst sich nun ab.\nRating: 3/5\nDate: 2024-05-25"

chain.invoke({"input": example_review})

Review(critique_points=['Die geklebte Folie löst sich ab'])

Works like a charm. Now repeat that for every review, save the responses with the same (arbitrary) review id as the original ones to later match the creation date and this step is done.

### Step 4: Cluster customer issues

Now that we have a long list of customer issues, we want to form clusters of issues with similar meaning.

The exact task I set was
> Given the list of all issues, find clusters to group those issues

In particular, I'm neither telling the system what the clusters are nor how many there are. I a real business case, this might be information that's already available.

This step doesn't necessarily require an api call, you could just do it in the ChatGPT webinterface (where I tested it). However, since we're pretending this system would be run for thousands of articles this step should be automated via code. Also this gives me the opportunity to test a slightly more involved structured output:

In [22]:
from langchain_core.pydantic_v1 import BaseModel, Field

class IssuesWithId(BaseModel):
    id: int = Field(..., description="ID of the issue")
    issues: list[str] = Field(..., description="list of issues from that critique that were associated with this cluster. Not all issues mentioned might belong to the same cluster")
    def __str__(self):
        return f"{self.id}: {self.issues}"

class ClusterRequest(BaseModel):
    name: str = Field(..., description="Name of the cluster")
    description: str = Field(..., description="Brief description of what defines the cluster")
    ids: list[IssuesWithId] = Field(..., description="List of issues with their respective IDs")
    def __str__(self):
        return f"{self.name}: {self.description}\n{self.ids}"

class ClustersResponse(BaseModel):
    clusters: list[ClusterRequest] = Field(..., description="List of clusters")
    def __str__(self):
        return f"{self.clusters}"

With the input being a list of `[id, [issues]]` pairs, the exptected output is a list of clusters. Each cluster should have:
- name
- brief description
- list of all review ids with the corresponding issues that match the cluster

In my brief tests I had to use `gpt-4o` for this task. The `gpt-3.5` model often only gave me a single cluster.

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-4o").with_structured_output(ClustersResponse)

Invoking the llm with the list of all reviews, we get the clusters:
- Durability Issues
- Price Issues
- Size Issues
- Quality Issues
- Usability Issues

along with the list of reviews that fall under each category.

### Step 5: Visualization per cluster over time

Flattening the issues lists and merging back the creation date of the corresponding we end up with this dataframe:

In [23]:
import pandas as pd

clusters_with_dates = pd.read_csv('clusters_with_dates.csv')
clusters_with_dates.head()

Unnamed: 0,Cluster Name,Review ID,Issue,date,hover_text
0,Durability Issues,0,Die geklebte Folie löst sich ab,2024-05-25,Die geklebte Folie löst sich ab (25.05.24)
1,Durability Issues,2,Es war nach wenigen Minuten kaputt,2024-05-05,Es war nach wenigen Minuten kaputt (05.05.24)
2,Durability Issues,4,am nächsten Tag schon gerissen,2024-04-10,am nächsten Tag schon gerissen (10.04.24)
3,Durability Issues,8,nicht stabil,2024-03-21,nicht stabil (21.03.24)
4,Durability Issues,8,ziemlich schnell hinüber,2024-03-21,ziemlich schnell hinüber (21.03.24)


As for the actual visiualization code... Well let's just say that was almost entirely done by ChatGpt. I'm quite confident in my matplotlib skills, but I've never done anything with an interactive plotting library. ChatGpt suggested using plotly, which I at least had heard before. The code looks readable and I can understand it, that's enough for me for now.

If you want to dig in, feel free:

In [None]:
import pandas as pd
import plotly.express as px
import locale

# Set locale to English
locale.setlocale(locale.LC_TIME, 'en_US.UTF-8')

# Sample data
data = pd.merge(issues, dates_df, on='Review ID')

# Convert 'date' column to datetime if not already
data['date'] = pd.to_datetime(data['date'])

# Resample to monthly frequency and count the number of issues
monthly_issues = data.groupby(['Cluster Name', pd.Grouper(key='date', freq='ME')]).size().reset_index(name='Issue Count')

# Create a string for the hover data combining issue descriptions
data['hover_text'] = data['Issue'] + " (" + data['date'].dt.strftime('%d.%m.%y') + ")"
hover_text_df = data.groupby(['Cluster Name', pd.Grouper(key='date', freq='ME')])['hover_text'].apply(lambda x: '<br>'.join(x)).reset_index()

# Merge hover_text back to monthly_issues
monthly_issues = monthly_issues.merge(hover_text_df, on=['date', 'Cluster Name'], how='left')

# Plot using plotly
fig = px.bar(monthly_issues, x='date', y='Issue Count', color='Cluster Name', barmode='group',
             custom_data=['hover_text'],
             title='Monthly Number of New Issues for Each Cluster')

# Update hover template to show issues only
fig.update_traces(hovertemplate='%{customdata}')

fig.update_layout(xaxis_title='Month', yaxis_title='Number of Issues',  width=1000,  # Set the width of the plot
    bargap=0.1)   # Set the gap between bars (0 to 1, where 0 is no gap and 1 is full gap))

fig.write_html("plot.html")


fig.show()

# Visualization

<div style="width: 100%; overflow: hidden;">
    <iframe src="plot.html" style="width: 100%; height: 600px; border: none;"></iframe>
</div>
