# Setup

In [1]:
from pathlib import Path
import os

In [2]:
review_paths = list(Path('raw_reviews').glob('*.txt'))
review_paths[0]

PosixPath('raw_reviews/review_0.txt')

In [3]:
with Path('openai.key').open() as f:
    os.environ['OPENAI_API_KEY'] = f.read().strip()

# Extraction

In [4]:
import openai

openai.api_key = os.environ['OPENAI_API_KEY']

In [5]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

In [6]:
template = """The text in tripple backticks is a product review written in german. It is given in the format

title: title of the review
text: main text of the review
rating: rating out of 5, 5 being best

From those three information, extract problems with the product that can be used by the manufacturer of the product to improve it. Translate all parts of your reply to english. Add the information to the existing list of problems that is given below, following this json format:

    [
        {
            "topic": "Parts",
            "text": ["<problem 1>",  ...]
        },
        {
            "topic": "Quality",
            "text": ["<problem 1>",  ...]
        },
    ]

Do not override problems that are already in the list, only add new ones.
Extract the topics from the review, the ones above are only examples. Add new topics if necessary. Only include topics for which at least one problem is found.
Make the topics and text as short as possible, but still understandable.
Don't use word for word parts from the reviews, but try to generalize.
It should not be obvious that the text is from a customer review.
Don't repeat the same problem multiple times. If a problem fits multiple of the above categories, only add it to one of them (doesn't matter which one).
Only include problems if they are clearly mentioned in the review. Don't add problems that are only implied. I.e. if the review says "seems stable at first" but does not mention any problems, don't add "unstable" as a problem.

Here are examples for how a review text translates to an problems:

review example 1: The painted attachment caused problems with the nut, requiring a separate screw to be used.
problem 1: (Topic is "Parts") "painted attachment causes problem with the nut, sometimes requiring a separate screw to be used.


Only return valid json as output, nothing else. The output must be able to load into pythons json.loads() without error. I.e. do not have any text outside the enclosing square brackets, do not add any comments, do not add any trailing commas, etc.
Here is the review:
"""

In [7]:
import json
import re

def extract_json(response):
    try:
        return json.loads(response)

    except:
        match = re.search("```(.*?)```", response, re.DOTALL)

        if match:
            json_data = match.group(1)
            problems = json.loads(json_data)
            return problems
        else:
            print("Failed to extract json from response:", response)

In [8]:
def load_review(idx: int):
    return review_paths[idx].read_text()

In [9]:
def build_prompt(template, review, problems):
    prompt = template + f"```{review}```"
    prompt += f"\n json list of problems: ```{problems}```"
    prompt = prompt.replace('\n', ' ')
    prompt = prompt.replace("'", '"')
    return prompt

In [10]:
problems = []
for i in range(10):
    print(f"Review {i}")
    prompt = build_prompt(template, load_review(i), problems)
    response = get_completion(prompt)
    if (new_problems := extract_json(response)) is not None:
        problems = new_problems

Review 0
Review 1
Review 2
Review 3


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [11]:
response

'```title: Schlechte Qualität, Finger weg! \ntext: Ich habe das Produkt gekauft und war sehr enttäuscht. Die Qualität ist sehr schlecht und es hat nicht lange gehalten. Finger weg! \nrating: 1,0 von 5 Sternen``` \n\njson list of problems: \n```\n[\n  {\n    "topic": "Quality",\n    "text": [\n      "poor quality, product did not last long."\n    ]\n  }\n]\n```'

In [12]:
problems

[{'topic': 'Parts', 'text': ['no problems mentioned.']},
 {'topic': 'Quality', 'text': ['no problems mentioned.']},
 {'topic': 'Assembly',
  'text': ['assembly instructions could be clearer, especially for beginners.']},
 {'topic': 'Price', 'text': ['good price-performance ratio.']}]