In this file we preprocess records from json downloaded from here: https://data.dws.informatik.uni-mannheim.de/largescaleproductcorpus/data/v2/goldstandards/all_gs.json.gz
Out goal is to process each record using LLM to get better representation and than create triplets and fiveplets of of similar products acording to hierarchy.

In [1]:
import pandas as pd
import numpy as np
import re
import json
import pickle

Below we load the raw product data from unziped json downloaded above.

In [2]:
file_path = 'all_gs.json'
json_data = []
with open(file_path, 'r') as file:
    for i in file.readlines():
        json_data.append(json.loads(i))

for item in json_data:
    # Access individual items in each JSON object
    id_left = item['id_left']
    title_left = item['title_left']
    description_left = item['description_left']

    # Access KeyValuePairs
    key_value_pairs_left = item['keyValuePairs_left']
    category_left = key_value_pairs_left['category']
    sub_category_left = key_value_pairs_left['sub category']

    # Similarly, access fields from the right side
    id_right = item['id_right']
    title_right = item['title_right']
    description_right = item['description_right']

    # Access KeyValuePairs for the right side
    key_value_pairs_right = item['keyValuePairs_right']
    category_right = key_value_pairs_right['category']
    sub_category_right = key_value_pairs_right['sub category']

    # Additional processing or printing can be done here
    print(f"Left Item ID: {id_left}, Right Item ID: {id_right}")
    print(f"Left Title: {title_left}, Right Title: {title_right}")
    print(f"Left Description: {description_left}, Right Description: {description_right}")
    print(f"Left Category: {category_left}, Right Category: {category_right}")
    print("------")
    break

Left Item ID: 5923646, Right Item ID: 16920267
Left Title: null , 417772 b21 hp xeon 5130 2 0ghz dl140 g3 new wholesale price, Right Title: 417772 b21 hp xeon 5130 2 0ghz dl140 g3 , null
Left Description: description intel xeon 5130 dl140 g3 2 00ghz 2 core 4mb 65w full processor option kitpart number s manufacturer part 417772 b21, Right Description: description intel xeon 5130 dl140 g3 2 00ghz core 4mb 65w full processor option kitpart number s manufacturer part 417772 b21
Left Category: proliant processor, Right Category: proliant processor
------


Below we connect to hugchat, which gives us possibility of using LLM. It is required to provide email and password for HF.

In [3]:
from hugchat import hugchat
from hugchat.login import Login

def login():
  email = "PLACEHOLDER@gmail.com" #EMAIL FOR HUGGING FACE
  password = "PLACEHOLDER"        #PASSWORD FOR HUGGING FACE
  sign = Login(email, password)
  cookies = sign.login()
  cookie_path_dir = "./cookies_snapshot"
  sign.saveCookiesToDir(cookie_path_dir)
  return hugchat.ChatBot(cookies=cookies.get_dict())
CHATBOT = login()

def query_wrapper(text):
  id = CHATBOT.new_conversation()
  CHATBOT.change_conversation(id)
  return CHATBOT.query(text)

def poc(similarity, title_left, description_left, title_right, description_right, prompt):
  query_left = query_wrapper(prompt + description_left)
  query_right = query_wrapper(prompt + description_right)
  return similarity.similarity(title_left+str(query_left), title_right+str(query_right))

Below is a prompt that we used to create new representations for products.

In [4]:
model_prompt = "Given a product title and description, generate a meaningful text representation that captures the essence of the product for effective similarity search. Consider relevant features, attributes, and contextual information to ensure the generated representation reflects the product's unique characteristics, allowing for accurate comparisons in a similarity search algorithm. Do not answer, just create a representation.\n\nTEXT TO REPRESENT:\n"

Below is a helper class to save each representation.

In [5]:
class Created:
    def __init__(self,id,llm_output):
        self.id=id
        self.llm_output=llm_output

    def __eq__(self, other):
        return self.id == other

Below is the function that calls LLM with given prompt+product details. After getting result it saves it to a file in /representations

In [6]:
ALL_DATA=[]
def LLM_call(i, directory):
    CHATBOT.delete_all_conversations()
    id = CHATBOT.new_conversation()
    CHATBOT.change_conversation(id)
    result=str(query_wrapper(model_prompt + i["title_"] if i['title_'+directory] else ""+"\n"+i['description_'+directory] if i['description_'+directory] else ""))
    ALL_DATA.append(Created(i["id_"+directory],result))
    with open("representations/"+str(i["id_"+directory])+'.txt', 'w') as f:
        f.write(result)

Below we call the LLM using hugchat and preprocess the data.

In [7]:
while True:
    try:
        for j,i in enumerate(json_data):
            if i["id_left"] not in ALL_DATA:
                LLM_call(i,"left")
            if i["id_right"] not in ALL_DATA:
                LLM_call(i,"right")
    except:
        continue

Below is a funtion to handle unicode characters that can't be used in further training.

In [7]:
def handle_unicode(f):
    return re.sub(r"[^\x00-\x7F]+", "", "".join(f.readlines()[2:-1]).strip())

Below is a function to open generated files and return string inside

In [8]:
def handle_files(name):
    with open("./data/representations/"+str(name)+".txt","r") as f:
        return handle_unicode(f)

Below we iterate over data from original json and we connect positive pairs with negative product to create a triplet according to attribute "label"

In [None]:
triplets = []
base=None
positive=None
negative=None
for i in json_data:
    if i["label"]=="1":
        base=handle_files(i["id_left"])
        positive=handle_files(i["id_right"])
        for j in json_data:
            if j["label"]=="0":
                if j["id_left"]==i["id_left"] or j["id_left"]==i["id_right"]:
                    negative=handle_files(j["id_right"])
                    triplets.append([base,positive,negative])
                if j["id_right"]==i["id_left"] or j["id_right"]==i["id_right"]:
                    negative=handle_files(j["id_left"])
                    triplets.append([base,positive,negative])
np.save("triplets.npy",np.array(triplets))

Below we iterate over data from original json. Similarly to triplets we connect positive pair with negative record. Additionaly we add a copy of a product and product from the same cluster in original data. Fiveplet will be utilized to calculate  metric proposed by us.

In [10]:
fiveplets = []
base=None
positive=None
negative=None
for i in json_data:
    if i["label"]=="1":
        base=handle_files(i["id_left"])
        positive=handle_files(i["id_right"])
        for j in np.random.permutation(json_data):
            if i["category_left"]!=j["category_left"]:
                category = handle_files(j["id_left"])
                break
        for j in json_data:
            if j["label"]=="0":
                if j["id_left"]==i["id_left"] or j["id_left"]==i["id_right"]:
                    negative=handle_files(j["id_right"])
                    fiveplets.append([base,base,positive,negative,category])
                if j["id_right"]==i["id_left"] or j["id_right"]==i["id_right"]:
                    negative=handle_files(j["id_left"])
                    fiveplets.append([base,base,positive,negative,category])
np.save("fiveplets.npy",np.array(fiveplets))

In [11]:
np.save("fiveplets.npy",np.array(fiveplets))