# Semantic Matching with the PaLM API

**Learning Objectives**
  1. Learn how to identify matching product descriptions
  1. Learn how to design a prompt for semantic matching
  2. Learn how to evaluate performance of a prompt for semantic matching
  1. Learn how to query the Vertex PaLM API
  
Semantic matching is the problem of classifying a pair of entities $(x_1, x_2)$ as being a good match or not. So it is a classification setup that is a very flexible: Namely, it comprises general information retrieval (where the first entity can be a textual query and the second entity can be a paragraph for instance), entity resolutions, or database-record fuzzy-matching. In this notebook we will focus on matching textual descriptions of retail products. More specifically:
  
  
**Use case description:** An online retail company scours the web to compare prices of products in their inventory with those offered by their competitors. Their first priority is to implement a model that compares the information on two product webpages and outputs a classification indicating whether the different product descriptions on the webpages correspond to an identical product, which we will refer to as a 'match'. We use the [Amazon-Google Products dataset](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) in the [entity resolution benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution) created by Leipzig University. 

**Model description:** This notebook illustrates how to use PaLM API in Vertex AI to match product descriptions. The [Amazon-GoogleProducts dataset](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) contains product information about the products scrapped on Amazon or Google websites. It includes the products title, description, price, and manufacturer, athough this information is worded differently on the two websites. In this notebook, we will focus solely on building a model using the product titles. The idea is straightforward: we will create a prompt that asks the PaLM API whether the product titles match or not. Although the information about the products is limited to these titles, we still achieve an accuracy close to 100% on the test set.

**Evaluation method:** In order to avoid overfitting the prompt design on our dataset, we first split the our dataset sample of paired descriptions into an evaluation set (evalset) containing 60 examples and a test set (testset) also consisting of 60 examples. The choice of 60 examples aligns with the current limit quota of 60 requests per minute for the PaLM API in Vertex AI. Subsequently, we devise the prompt using the evalset and report the model's accuracy on the testset. Both the evaluation and test sets are roughly balanced.

## Setup

In [200]:
import random

import pandas as pd
import vertexai
from vertexai.language_models import TextGenerationModel

pd.options.display.max_colwidth = 1000

vertexai.init()

## Exploring the dataset

We use a [dataset of product information](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) scraped from Google and Amazon websites. The dataset is part of a [benchmark for semantic matching and entity resolution](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution) from Leipzig University. It contains 3 tables which are included in this repo:

```python
../data/Amazon.csv.gz 
../data/GoogleProducts.csv.gz
../Amzon_GoogleProducts_perfectMapping.csv.gz

```

The first table contains product information listed on Amazon, including the product title, description, and manufacturer:

In [201]:
amazon = pd.read_csv("../data/Amazon.csv.gz")
amazon.columns = [
    "idAmazon",
    "amazon_title",
    "amazon_description",
    "amazon_manufacturer",
    "amazon_price",
]
amazon.head()

Unnamed: 0,idAmazon,amazon_title,amazon_description,amazon_manufacturer,amazon_price
0,b000jz4hqo,clickart 950 000 - premier image pack (dvd-rom),,broderbund,0.0
1,b0006zf55o,ca international - arcserve lap/desktop oem 30pk,oem arcserve backup v11.1 win 30u for laptops and desktops,computer associates,0.0
2,b00004tkvy,noah's ark activity center (jewel case ages 3-8),,victory multimedia,0.0
3,b000g80lqo,peachtree by sage premium accounting for nonprofits 2007,peachtree premium accounting for nonprofits 2007 is the affordable easy to use accounting solution that provides you with donor/grantor management. if you're like most nonprofit organizations you're constantly striving to maximize each and every dollar of your annual operating budget. financial reporting by programs and funds advanced operational reporting and the rock-solid core accounting features that have made peachtree the choice of hundreds of thousands of small businesses. the result is an accounting solution tailor-made for the challenges of operating a nonprofit organization. keep an audit trail to record and report on any changes made to your transactions improve data integrity with prior period locking archive your organization's data for snap shots of your data before you closed your year set up individual user profiles with password protection peachtree restore wizard restores all backed-up data files plus web transactions and customized forms includes all standard acc...,sage software,599.99
4,b0006se5bq,singing coach unlimited,singing coach unlimited - electronic learning products (win me nt 2000 xp),carry-a-tune technologies,99.99


The second table contains the same information but for product information scrapped from Google website:

In [202]:
google = pd.read_csv("../data/GoogleProducts.csv.gz")
google.columns = [
    "idGoogleBase",
    "google_title",
    "google_description",
    "google_manufacturer",
    "google_price",
]
google.head()

Unnamed: 0,idGoogleBase,google_title,google_description,google_manufacturer,google_price
0,http://www.google.com/base/feeds/snippets/11125907881740407428,learning quickbooks 2007,learning quickbooks 2007,intuit,38.99
1,http://www.google.com/base/feeds/snippets/11538923464407758599,superstart! fun with reading & writing!,fun with reading & writing! is designed to help kids learn to read and write better through exercises puzzle-solving creative writing decoding and more!,,8.49
2,http://www.google.com/base/feeds/snippets/11343515411965421256,qb pos 6.0 basic software,qb pos 6.0 basic retail mngmt software. for retailers who need basic inventory sales and customer tracking.,intuit,637.99
3,http://www.google.com/base/feeds/snippets/12049235575237146821,math missions: the amazing arcade adventure (grades 3-5),save spectacle city by disrupting randall underling's plan to drive all the stores out of business and take over the city. solve real-world math challenges in the uniquely entertaining stores and make theme successful again.,,12.95
4,http://www.google.com/base/feeds/snippets/12244614697089679523,production prem cs3 mac upgrad,adobe cs3 production premium mac upgrade from production studio premium or standard,adobe software,805.99


The last table contains a matching of product information on both website corresponding to the same product, but possibly described differently on the two websites:

In [203]:
matching = pd.read_csv("../data/Amzon_GoogleProducts_perfectMapping.csv.gz")
matching.head()

Unnamed: 0,idAmazon,idGoogleBase
0,b000jz4hqo,http://www.google.com/base/feeds/snippets/18441480711193821750
1,b00004tkvy,http://www.google.com/base/feeds/snippets/18441110047404795849
2,b000g80lqo,http://www.google.com/base/feeds/snippets/18441188461196475272
3,b0006se5bq,http://www.google.com/base/feeds/snippets/18428750969726461849
4,b00021xhzw,http://www.google.com/base/feeds/snippets/18430621475529168165


From this raw data, we have pre-generated for you an eval and test set:

```python
../data/product_matching_eval.csv
../data/product_matching_test.csv
```


We will use the eval set to design the prompt and use the test set to evaluate the PaLM API performance on this prompt. This way, the performance we report will be closer to the real performance on never-seen-pairs of product descriptions.  


To genrate the eval and test split, we used the function in the cell below. It takes a sample (whose size is controlled by `SAMPLE_SIZE`) of matching product ID's in the `matching` dataframe, and it joins the information of the Google and Amazon product information contained in the `google` and `amazon` dataframes. Then it extracts pairs of matching Google and Amazon descriptions and creates a table of matching descriptions with columns `description_1` (Google), `description_2` (Amazon), and a target column named `match` whose value is set to `yes` since we have only matching pairs so far.
To create description pairs of not matching product, we permutate the second description columns while keeping the first description fixed, and concatenate this new dataframe of not matching descriptions to the one of matching description. We shuffle and then split the resulting table into two equal sized dataframes, which we save on disk as our eval and test splits. 

Observe that we only use the `title` column as product description. So there is much more info in the raw data. Nevertheless, we will see that PaLM will achieve a performance close to 100% on the test set. Remarkable!

**Note:** Uncomment the last line if you want to re-generate the eval and test set on a different sample of the data.

In [225]:
SAMPLE_SIZE = 60


def generate_test_and_eval_sets(sample_size=SAMPLE_SIZE):
    sample = matching.sample(sample_size)

    # Join the product information to the df of matching ID's
    matched_products = sample.merge(
        right=amazon, how="left", on="idAmazon"
    ).merge(right=google, how="left", on="idGoogleBase")

    google_descriptions = list(matched_products["google_title"])
    amazon_descriptions = list(matched_products["amazon_title"])

    # Create the dataframe of matching descriptions
    matching_descriptions = pd.DataFrame(
        {
            "description_1": google_descriptions,
            "description_2": amazon_descriptions,
            "match": "yes",
        }
    )

    # Create the dataframe of not matching descriptions
    amazon_descriptions_perm = [
        amazon_descriptions[i - 1] for i in range(len(amazon_descriptions))
    ]
    not_matching_descriptions = pd.DataFrame(
        {
            "description_1": google_descriptions,
            "description_2": amazon_descriptions_perm,
            "match": "no",
        }
    )

    full_dataset = pd.concat(
        [matching_descriptions, not_matching_descriptions], axis=0
    ).sample(len(matching_descriptions) * 2)

    evalset = full_dataset[:sample_size]
    testset = full_dataset[sample_size : 2 * sample_size]
    evalset.to_csv("../data/product_matching_eval.csv", index=None)
    testset.to_csv("../data/product_matching_test.csv", index=None)


# Uncomment to generate a different data sample
# generate_test_and_eval_sets()

The next cell loads the eval and test datasets that we pre-generated. The two CSV files contain 60 examples of product description pairs. Each pair is labeled with a `match` value of `yes` if the descriptions describe the same product and `no` otherwise. `description_1` comes from product title on Google while `description_2` comes from Amazon products.

In [206]:
evalset = pd.read_csv("../data/product_matching_eval.csv")
testset = pd.read_csv("../data/product_matching_test.csv")

Let's have a quick look at a few entries in this dataset:

In [207]:
evalset.head()

Unnamed: 0,description_1,description_2,match
0,topics entertainment instant immersion hawaiian audio,adobe photoshop elements 4.0 (mac),no
1,corel corporation corel wordperfect office x3 pro dvd,corel wordperfect office x3 professional edition,yes
2,microsoft office and windows training (win 98 me nt 2000 xp),crystal 3d impact! pro 1.25 (jewel case),no
3,sierrahome hse hallmark card studio special edition win 98 me 2000 xp,cinescore professional soundtrack edition,no
4,intego inc internet security barrier x4 antispam edition,internet security barrier x4 antispam edition,yes


Both splits are roughly balanced. To make sure that both splits are roughly balanced, we count the number of class instances for each split.

In [208]:
evalset.match.value_counts()

no     33
yes    27
Name: match, dtype: int64

In [209]:
testset.match.value_counts()

yes    33
no     27
Name: match, dtype: int64

#  Model implementation

We start by instanciating our client to the `text-bison@001` version of the PaLM 2 which is a large language model (LLM) developed by Google.

In [210]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

Using this client, we implement in the next cell a simple function that takes two product descriptions and a parameterized prompt as input, and outputs `yes` or `no` depending on whether the PaLM model thinks the product descriptions are matching.

In [211]:
def are_products_matching(d1, d2, prompt):
    prompt = prompt.format(desc1=d1, desc2=d2)
    answer = generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    )
    return answer.text

The next cell allows us to rapidly test on the evaluation set whether a given prompt seems to be working for this use case.

In [212]:
PROMPT = """
Are the following two descriptions describing the same product?
Answer only with "yes" or "no".

Description 1: {desc1}
Description 2: {desc2}
"""

index = random.randint(0, len(evalset) - 1)

d1 = evalset.iloc[index]["description_1"]
d2 = evalset.iloc[index]["description_2"]
ground_truth = evalset.iloc[index]["match"]
prediction = are_products_matching(d1, d2, prompt=PROMPT)


print(
    f"""
Are the following two descriptions describing the same product?

Description 1: {d1}

Descriptions 2: {d2}

MODEL ANSWER: {prediction}
GROUND TRUTH: {ground_truth}
"""
)


Are the following two descriptions describing the same product?

Description 1: microsoft windows xp pro sp2 retail

Descriptions 2: microsoft windows xp professional full version with sp2

MODEL ANSWER: yes
GROUND TRUTH: yes



# Model Analysis

We now analyze the performance of our model on the test set.

Large language models may sometimes output something other than "yes" or "no," even if we ask them politely to do so. This could be due to safety filters being triggered or the model simply not understanding the question. Therefore, we first need to determine the proportion of requests that our model fails to answer. In this case, it is around 6%, which may be acceptable. Further prompt engineering or model tuning could help to reduce this number.

The second aspect to consider is the performance of the model on the requests
that succeded (i.e. for whose the output was actually `yes` or `no`).
Since the test set is balanced, we can compute the model accuracy, which is 98%. This means that only a single example in the test set received a prediction that was different from the ground truth.

## Scoring the test set

To simplify evaluation, we implement a function in the next cell that will add a `prediction` column to our `testset`. This column will contain the predictions received from the PaLM API:

In [213]:
def apply_prompt(prompt, dataset):
    scored_dataset = dataset.copy()
    scored_dataset["predictions"] = scored_dataset.apply(
        lambda row: are_products_matching(
            row.description_1, row.description_2, prompt=PROMPT
        ),
        axis=1,
    )
    return scored_dataset

Let's apply this function to our `testset` using our simple prompt that we designed on the `evalset`:

In [214]:
scored_testset = apply_prompt(prompt=PROMPT, dataset=testset)

Since requests to the PaLM API are limited and capped to 60 requests per minute, we save our scoring to disk so that we can analyze it offline if needed.

In [215]:
scored_testset.to_csv("scored_evalset.csv", index=False)

Here are the predictions of our model. We can see that most examples are classified correctly, although some requests failed, resulting in empty predictions. We will need to analyze these cases separately and compute the accuracy only for the requests that succeeded:

In [216]:
scored_testset.head(10)

Unnamed: 0,description_1,description_2,match,predictions
0,myob firstedge,firstedge v 2.0 (mac),yes,yes
1,adobe cs3 design premium upsell,adobe creative suite cs3 design premium [mac],yes,yes
2,tlc arthur's kindergarten learning system (pc) encore,child safe,no,no
3,pxl smartscale for mac/win,onone software pxl smartscale ( windows and macintosh ),yes,yes
4,corel ulead dvd workshop 2 authoring software,ulead systems dvd workshop 2,yes,yes
5,emedia rock guitar method (win 95 98 me nt 2000 xp/mac 8.0),emedia rock guitar method,yes,yes
6,epson s041886 story teller 8 x 10 20 pages,epson storyteller photo book creator - 8 x 10 (20 pages),yes,yes
7,diskeeper 2007 professional,iremember digital scrapbooking 2.0,no,no
8,tri synergy inc test & improve your memory - scientific brain training,internet security barrier x4 antispam edition,no,no
9,punch software 42100 punch! home design architectural series 18 (small box),elementary school success deluxe 2006,no,no


### Failed predictions

The cell below list the number of failed predictions, that is, predictions which are anything other that `yes` or `no`. There are several possible causes to such a behavior. PaLM, as all LLM's, has been trained to predict the next most likely word from a sequence of words. Therefore, although we instruct PaLM explicitely in our prompt to answer by `yes` or `no`, it may happen that certain product descriptions confuse PaLM, resulting in something different from `yes` or `no`. Another issue is that the language in the product descriptions may trigger a safety filter, which then will replace PaLM raw answer by some standard warning text. These safety filters can be triggered by words in the product descriptions that are too evocating of health or medical issues for instance, or violence, among many [other safety settings](https://developers.generativeai.google/guide/safety_setting). 

In [217]:
failed_predictions_mask = ~(
    scored_testset.predictions.str.match("yes")
    | scored_testset.predictions.str.match("no")
)

failed_predictions_mask.value_counts()

False    60
Name: predictions, dtype: int64

In [218]:
proportion_of_failed_predictions = failed_predictions_mask.sum() / len(
    failed_predictions_mask
)
print(
    "Proportion of failed requests:",
    round(proportion_of_failed_predictions, 3) * 100,
    "%",
)

Proportion of failed requests: 0.0 %


The next cell examines the failed requests. Some of the terms may have triggered safety filters, but this would require further investigation:

In [219]:
scored_testset[failed_predictions_mask]

Unnamed: 0,description_1,description_2,match,predictions


## Accuracy on succeeded predictions

Let us now compute the model accuracy on the requests that succeeded. First, we remove all the failed requests from the test set:

In [220]:
scored_testset_with_predictions = scored_testset[~failed_predictions_mask]

Then, we compute the number of correct answers:

In [221]:
is_prediction_correct = (
    scored_testset_with_predictions.match
    == scored_testset_with_predictions.predictions
)
is_prediction_correct.value_counts()

True    60
dtype: int64

The accuracy of our model is the number of correct answers divided by the number of predictions that succeeded. In this case, the accuracy is roughly 98%.


In [222]:
model_accuracy = is_prediction_correct.sum() / len(is_prediction_correct)
print("Model accuracy:", round(model_accuracy, 3))

Model accuracy: 1.0


Actually, for our small test set of 56 examples, this corresponds to our model having made only a single error:

In [223]:
uncorrect_predictions_mask = (
    scored_testset_with_predictions.match
    != scored_testset_with_predictions.predictions
)

If we inspect that error, we see that our model may have been correct:

In [224]:
scored_testset_with_predictions[uncorrect_predictions_mask]

Unnamed: 0,description_1,description_2,match,predictions


Copyright 2023 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License