There are different versions of the Demo, and while we were working on this version by doing online predictions, parallelism (using multiple cores/threads) to get jobs for 100k listings done in 25 minutes, we also used the batch prediction approach that was used for the final demo. **This notebook is for online predictions and parallelism.**

# Data Preprocess


*   Embeddings
*   Questions and Answers for Category 1
*   Questions and Answers for Category 2
*   Questions and Answers for Category 3



## Import Libraries

In [None]:
import os
import json
import tqdm
import requests
import vertexai
import threading
import concurrent
import pandas as pd
from urllib.parse import urlparse
from google.cloud import bigquery, storage
from google.cloud import discoveryengine_v1 as discoveryengine
from vertexai.preview.generative_models import grounding, Tool
from requests.exceptions import RequestException, MissingSchema
from google.cloud import bigquery

## Vertex AI Q&A Preload Preprocessing

- While creating the questions there were 2 approaches, using online predictions and use multiple cores/threas to call the api multiple times once and in batches to speed up the process to generate content. It took 25 minutes to generate 100k Q&A,
- While runing this demo we were also testing Batch predictions.

In [None]:
df = bigquery.Client(project="vtxdemos").query("select * from `demos_us.home_and_living_listings_100k`").to_dataframe()

### Get Images from Listings Public URL and Store them in Google Cloud Storage
- A Google Cloud Global Load Balancer + CDN enabled were used.
- Cloudflare for TLS (HTTPS) we used.

In [None]:
public_gcs_link = []
private_gcs_link = []
public_cdn_link = []
bucket_name = "vtxdemos-fstoresearch-datasets"  # Replace with your bucket name
suffix = "g-100k"  # Replace with your suffix

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)

total_images = len(df)
pbar = tqdm.tqdm(total=total_images, desc="Copying Images")
update_frequency = 1000  # Update tqdm every 1000 images

def process_image(args):
    index, row = args
    url = row["image_url"]
    if url is None:
        return None, None, None

    try:
        parsed_url = urlparse(url)
        filename = os.path.basename(parsed_url.path)
        response = requests.get(url)
        response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        blob = bucket.blob(f"etsy-{suffix}/" + filename)
        blob.upload_from_string(response.content, content_type='image/jpeg')
        public_gcs_url = blob.public_url
        private_gcs_url = f"gs://{bucket_name}/etsy-{suffix}/{filename}"
        public_cdn_url = f"https://gcpetsy.sonrobots.net/etsy-{suffix}/{filename}"  # Replace with your CDN URL
        if index % update_frequency == 0:
            pbar.update(update_frequency)  # Update tqdm less frequently
        return public_gcs_url, private_gcs_url, public_cdn_url
    except (requests.exceptions.RequestException, MissingSchema) as e:
        print(f"Error processing URL {url} at index {index}: {e}")
        return None, None, None

with concurrent.futures.ThreadPoolExecutor(max_workers=40) as executor:
    results = list(executor.map(process_image, df.iterrows()))

for public_url, private_url, cdn_url in results:
    public_gcs_link.append(public_url)
    private_gcs_link.append(private_url)
    public_cdn_link.append(cdn_url)

df["public_gcs_link"] = public_gcs_link
df["private_gcs_link"] = private_gcs_link
df["public_cdn_link"] = public_cdn_link

pbar.close()

## Set Variables and Initialize

In [None]:
project_id = "vtxdemos"
location = "us-central1"

In [None]:
vertexai.init(project=project_id, location=location)

In [None]:
embeddings_model_name = "text-embedding-004"
emb_model = TextEmbeddingModel.from_pretrained(embeddings_model_name)

In [None]:
safety_settings = [
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=SafetySetting.HarmBlockThreshold.OFF
    ),
]

## Embeddings

*Getting 2k samples -> Materials only has ~1000 out of 2k.*

In [None]:
df = bigquery.Client(project=project_id).query("select * from `vtxdemos.demos_us.etsy_10k` limit 2000").to_dataframe()
df['combined_text'] = df['title'].fillna('None') + ' Description: ' + df['description'].fillna('None')  + ' Tags: ' + df['tags'].fillna('None')  + ' Attributes: ' + df['attributes'].fillna('None')
df = df.drop(["min_price_usd", "max_price_usd", "pct_discount", "variations"], axis=1)

*Join with another table to get the CDN Link*

In [None]:
# Join CDN -> New DF
cdn_df = bigquery.Client(project="vtxdemos").query("select * from `vtxdemos.demos_us.etsy-10k-full`").to_dataframe()
cdn_df["listing_id"]=cdn_df.apply(lambda x: int(x["listing_id"]), axis=1)
_ = pd.merge(cdn_df[["listing_id", "public_cdn_link"]], df, on="listing_id", how="right")
_.dropna(inplace=True)

In [None]:
import time

star_time = time.time()
def get_embedding(text: str):
  inputs = [TextEmbeddingInput(text, "SEMANTIC_SIMILARITY")]
  return emb_model.get_embeddings(inputs)[0].values

emb_list = []
listing_id = []
error_count = 0

for num, (index, row) in enumerate(_.iterrows()):
  try:
    listing_id.append(int(row["listing_id"]))
    emb_list.append(get_embedding(row["combined_text"]))
  except Exception as e:
    print(f"Error processing row {num}: {e}")
    error_count += 1
    continue  # Skip to the next row

print(time.time() - star_time)
print(f"Total errors encountered: {error_count}")

In [None]:
emb_df = pd.DataFrame({"listing_id": listing_id, "embedding": emb_list})
emb_df.to_pickle("gs://vtxdemos-datasets-private/marketplace/embeddings.pkl")

### Vector Search Using ScANN

In [None]:
emb_df = pd.read_pickle("gs://vtxdemos-datasets-private/marketplace/embeddings.pkl")

In [None]:
# Vector Retrieval Engine Using ScANN

import scann
import numpy as np

img = np.array([r["embedding"] for i, r in emb_df.iterrows()])
k = int(np.sqrt(emb_df.shape[0]))

if int(k/20) < 1:
    leave_search = 1
else:
    leave_search = int(k/20)

searcher = scann.scann_ops_pybind.builder(img, num_neighbors=5, distance_measure="squared_l2").tree(
    num_leaves=k, num_leaves_to_search=int(int(k/20)), training_sample_size=emb_df.shape[0]).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(5).build()

In [None]:
# Testing
texts = ["Ceramic coffee mug with the words: &quot;Fabu..."]
inputs = [TextEmbeddingInput(text, "RETRIEVAL_DOCUMENT") for text in texts]
embeddings = text_emb_model.get_embeddings(inputs)[0].values

neighbors, distances = searcher.search(embeddings, final_num_neighbors=10)

In [None]:
all_extracted_data = _.loc[neighbors, :]
all_extracted_data

## Category 1

In [None]:
response_schema_cat1 = {
    "type": "OBJECT",
    "properties": {
        "questions_cat1": {
            "type": "ARRAY",
            "items": {
                "type": "STRING"
            }
        },
        "answers_cat1": {
            "type": "ARRAY",
            "items": {
                "type": "STRING"
            }
        }
    }
}

# Questions Category 1
system_instructions_cat_1 = """
You are a helpful product expert for Etsy, proactively prompting customers to discover the breadth of Etsy.
Etsy is an e-commerce company with an emphasis on selling of handmade or vintage items and craft supplies.
You will be provided product information and you need to generate exactly 12 questions using this product information.
Questions should be interesting and exciting, but very short.

<Instructions>
  - 4 questions MUST be related to the product that customers usually ask about that product.
      Questions should be directly relevant to the product, addressing typical customer inquiries about its features, specifications, or usage as suggested by product details.
      Make questions very short and the created questions MUST have answers within product information.
      Do not ask very explicit questions.
  - Create 4 answers to these questions by looking up product information (context).
</Instructions>

<Rules>
Be concise, clear and smart in your questions, this will be used as a buttonary recommendations for the customer.
</Rules>

"""

In [None]:
context_model = GenerativeModel(
    "gemini-1.5-flash-001",
    generation_config=GenerationConfig(temperature=1, response_mime_type="application/json", response_schema=response_schema_cat1),
    system_instruction=system_instructions_cat_1
)

In [None]:
def generate_gem_1(prompt: str):
  try:
    return json.loads(context_model.generate_content(prompt, safety_settings=safety_settings).text)
  except Exception as e:
    print(e)
    return {"questions_cat1": None, "answers_cat1": None}

In [None]:
q_cat_1 = []
a_cat_1 = []
listing_id_1 = []

for index, row in df.iterrows():
  listing_id_1.append(row["listing_id"])
  re = generate_gem_1(row["combined_text"])
  q_cat_1.append(re["questions_cat1"])
  a_cat_1.append(re["answers_cat1"])

In [None]:
df1 = pd.DataFrame({"listing_id": listing_id_1, "q_cat_1": q_cat_1, "a_cat_1": a_cat_1})
df1.to_pickle("gs://vtxdemos-datasets-private/marketplace/cat1.pkl")

## Category 2

In [None]:
response_schema_cat2 = {
    "type": "OBJECT",
    "properties": {
        "questions_cat2": {
            "type": "ARRAY",
            "items": {
                "type": "STRING"
            },
            "min_items": 4,
            "max_items": 4
        },
        "answers_cat2": {
            "type": "ARRAY",
            "items": {
                "type": "STRING"
            },
            "min_items": 4,
            "max_items": 4
        },
    },
    "required": ["questions_cat2", "answers_cat2"]
}

system_instructions_cat_2 = """
You are a helpful product expert for Etsy, proactively prompting customers to discover the breadth of Etsy.
Etsy is an e-commerce company with an emphasis on selling of handmade or vintage items and craft supplies.
You will be provided product information and you need to generate exactly 12 questions using this product information.
Questions should be interesting and exciting, but very short.

<Instructions>
  - 4 questions should be associated with this product information but completely beyond the explicit product details, exploring potential applications, key features to consider, material properties, historical context, or broader industry standards
      These questions should pique the customer's interest and encourage them to explore the product.
      Should be very general questions for which you can search in Google Search to provide needed information
      DO not ask questions about product availability or prices.
  - Create 4 answers to these questions by using Google Search.
</Instructions>

<Rules>
Be concise, clear and smart in your questions, this will be used as a buttonary recommendations for the customer.
</Rules>

"""

tools = [
    Tool.from_google_search_retrieval(
        google_search_retrieval=grounding.GoogleSearchRetrieval()
    ),
]

ground_model = GenerativeModel(
    "gemini-1.5-flash-001",
    generation_config=GenerationConfig(
        temperature=1.1,
        response_mime_type="application/json",
        response_schema=response_schema_cat2,
        max_output_tokens=4000),
    tools=tools,
    system_instruction=system_instructions_cat_2
)

In [None]:
def generate_gem_2(prompt: str):
  try:
    return json.loads(ground_model.generate_content(prompt, safety_settings=safety_settings).text)
  except Exception as e:
    print(e)
    return {"questions_cat2": None, "answers_cat2": None}

In [None]:
q_cat_2 = []
a_cat_2 = []
listing_id_1 = []

for index, row in _.iterrows():
  listing_id_1.append(row["listing_id"])
  re = generate_gem_2(row["combined_text"])
  q_cat_2.append(re["questions_cat2"])
  a_cat_2.append(re["answers_cat2"])

In [None]:
df2 = pd.DataFrame({"listing_id": listing_id_1, "q_cat_2": q_cat_2, "a_cat_2": a_cat_2})
df2.to_pickle("gs://vtxdemos-datasets-private/marketplace/cat2.pkl")

## Category 3

In [None]:
_.reset_index(inplace=True)

In [None]:
## Utility Class
import pandas as pd
import json

model_name = "gemini-1.5-flash-001"
embeddings_model_name = "text-embedding-004"

model = GenerativeModel(
    model_name=model_name,
    generation_config={"temperature": 1.1}
)

text_emb_model = TextEmbeddingModel.from_pretrained(embeddings_model_name)

class Gemini:
    def __init__(self):
        self.retrieval_results = None
        self.dataframe = None
        self.response_schema = {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    },
                    "min_items": "4"
                }
            }
        }

        self.prompt_template_generate_questions = """
              You are an expert in helping Etsy customers find products that complement their current selections.
              Etsy specializes in handmade, vintage items, and craft supplies. Given the provided product information:

              your task is to create 4 questions that would lead to the discovery of complementary products.

              <Instructions>
            - Generate 4 questions that would help a customer find products that pair well with this item.
              For example, if the product is a red shirt, a complementary question might be,
              "What pants would match with this shirt?"
              For example, if the product is a pair of shoes, a complementary question might be,
              "What socks would go well with these shoes?"
            - List the questions in plain text without headers, numbers, or hyphens.
            </Instructions>
            """

        self.rephraser_prompt_cat3 = """
        Your task is to rephrase the user's question {question_cat3} to explicitly mention what product it is referring to.
        Use the information provided in the product information {product_details}. The result should be a single line question.
        """

        self.prompt_after_rag = """
        You are Etsy product expert. You are helping users explore the breadth of Etsy. Your task is to answer to user's question: {question_cat3}.

        - Your goal is to recommend matching products: {matching_products}
        - Use ONLY the matching products to answer the question
        - Customer is currently looking at product mentioned in {product_details}. Keep this context in mind
        - If there is no matching products, respond with a generic message
        - Condense the response into a clear and concise summary, using bullet points whenever appropriate.
        - Be kind always and reply as descriptive as needed
        """

    def generate_questions(self, listing_info):
        try:
            response = model.generate_content(
                [self.prompt_template_generate_questions, "\n\n<Product Information>\n", listing_info],
                generation_config=GenerationConfig(
                    response_mime_type="application/json", response_schema=self.response_schema, temperature=1.1
                ),
                safety_settings=safety_settings
            )
            return response
        except:
            print("Error In generate_questions")
            return 'error'

    def generate_answers(self, question_cat3, matching_products, product_details):
      try:
        prompt = [self.prompt_after_rag, f"question_cat3: {question_cat3}" , f"matching_products: {matching_products}\n", f"product_details: {product_details}"]
        response = model.generate_content(prompt, safety_settings=safety_settings)
        return response
      except:
        print("Error In generate_answers")
        return 'error'

    def search_for_item_information(self, query):
        texts = [query]
        inputs = [TextEmbeddingInput(text, "RETRIEVAL_DOCUMENT") for text in texts]
        embeddings = text_emb_model.get_embeddings(inputs)[0].values

        neighbors, distances = searcher.search(embeddings, final_num_neighbors=10)

        all_extracted_data = _.loc[neighbors, :]
        self.dataframe = pd.DataFrame(all_extracted_data)

    def run(self, content):
        recommendations_list = []
        try:
            re = self.generate_questions(content).text
        except:
            re = 'error'
            print("Error In run")
        if re != "error":
            question_cat3 = json.loads(re)["questions"]
            for num, question in enumerate(question_cat3):
                recommendations = {
                    "rephrased_question": "",
                    "answer": "",
                    "rec_titles": [],
                    "rec_prices": [],
                    "rec_descriptions": [],
                    "tags": [],
                    "materials": [],
                    "attributes": [],
                    "category": [],
                    "combined_text": [],
                    "public_cdn_link": [],
                }

                rephraser_contents_cat3 = [
                    self.rephraser_prompt_cat3,
                    question,
                    content,
                ]
                #rephrased_query = model.generate_content(rephraser_contents_cat3, safety_settings=safety_settings, generation_config={"temperature": 1.1}).text
                self.search_for_item_information(question)
                answer = self.generate_answers(question, self.dataframe, content).text

                recommendations["rephrased_question"] = question
                recommendations["answer"] = answer

                # Optimization: Access DataFrame rows only once
                recs = [self.dataframe.iloc[i] for i in range(5)] # Get first 5 recommendations

                for rec in recs:
                    recommendations["rec_titles"].append(rec["title"])
                    recommendations["rec_prices"].append(int(rec["price_usd"])) # Convert to native Python int
                    recommendations["rec_descriptions"].append(rec["description"])
                    recommendations["tags"].append(rec["tags"])
                    recommendations["materials"].append(rec["materials"])
                    recommendations["attributes"].append(rec["attributes"])
                    recommendations["category"].append(rec["category"])
                    recommendations["combined_text"].append(rec["combined_text"])
                    recommendations["public_cdn_link"].append(rec["public_cdn_link"])

                recommendations_list.append(recommendations)

        return recommendations_list

In [None]:
g = Gemini()

In [None]:
error_count = 0
error_listings = []
listing_ids = []
cat_3_questions_list = []

for num, (index, row) in enumerate(_.iterrows()):  # Iterate through the entire DataFrame
    try:
        print(num+1)
        results = g.run(row["combined_text"])
        print(results)
        print(results)
        if results != 'error':  # Check if g.run returned an error
            cat_3_questions = []
            for result in results:  # Iterate through the dictionaries in the output
                cat_3_questions.append(json.dumps(result)) # Convert dictionary to JSON string
            listing_ids.append(str(row['listing_id']))  # Convert listing_id to string
            cat_3_questions_list.append(cat_3_questions)
        else:
            error_count += 1
            error_listings.append(str(row['listing_id']))  # Convert listing_id to string
            print(f"Error occurred for listing_id: {row['listing_id']}")

    except Exception as e:
        error_count += 1
        error_listings.append(str(row['listing_id']))  # Convert listing_id to string
        print(f"Error occurred for listing_id: {row['listing_id']}: {e}")

final_df = pd.DataFrame({"listing_id": listing_ids, "cat_3_questions": cat_3_questions_list})  # Create the new DataFrame

print(f"Total errors encountered: {error_count}")
print(f"Listings with errors: {error_listings}")

In [None]:
final_df.to_pickle('gs://vtxdemos-datasets-private/marketplace/cat3.pkl')

## Titles

In [None]:
generate_title_prompt = '''Based on the product details attached, generate a concise and engaging product title that highlights the key attributes, unique features, and selling points of the item.
Focus on creating a title that will attract potential buyers and clearly communicate the most important details about the product.

Return only one title that is less than 20 words and is in plain text. Return just the title and nothing else.

'''

In [None]:
gen_ai_model=GenerativeModel(model_name="gemini-1.5-flash-001",generation_config={"temperature": 1.1})

In [None]:
generated_titles = []
for i in range(len(df)):
  product_details = df['combined_text'][i]
  contents = [
    generate_title_prompt,
    product_details,
  ]

  response = gen_ai_model.generate_content(contents, stream=False,
                                           safety_settings=safety_settings)
  generated_titles.append(response.text)

In [None]:
df['generated_titles'] = generated_titles
df.head()

## Description Trimm

In [None]:
generate_description_prompt = '''Based on the product details provided, generate a concise yet informative description that highlights the key features, benefits, and unique aspects of the item.
Ensure the description is clear, accurate, and includes as much relevant information as possible while remaining easy to understand.

Return only one description that fits within a single paragraph and is in plain text. Return just the description and nothing else.
'''

In [None]:
generated_descriptions = []
for i in range(len(df)):
  product_details = df['combined_text'][i]
  contents = [
    generate_description_prompt,
    product_details,
  ]
  response = gen_ai_model.generate_content(contents, stream=False,
                                           safety_settings=safety_settings)
  generated_descriptions.append(response.text)

In [None]:
df['generated_descriptions'] = generated_descriptions
df.head()

## Query Recommendations

In [None]:
generate_description_prompt = '''You are a ux developer expert, and you tasked to build
a listing text recommendation system for a product listing.

The text recommendation will be based on the product title and description. Be concise and short
since the text will be shown in the search text window for the user.

The output should be short and concise no more than 5 words to represent the listing, remember is the query recommendation.

Constraints:
- Avoid any markdown, asterisk or any number symbols, punctuation or special characters.
- Output (only 1) is in plain text.

'''

In [None]:
generated_descriptions = []
for i in range(len(df)):
  product_details = df['combined_text'][i]
  contents = [
    generate_description_prompt,
    product_details,
  ]
  response = gen_ai_model.generate_content(contents, stream=False,
                                           safety_settings=safety_settings)
  generated_descriptions.append(response.text)

In [None]:
df['generated_queries'] = generated_descriptions
df.head()