# Use Embeddings to Cluster Products Based on Descriptions

In this lab, you're going to generate the embeddings of a list of product descriptions.

Embeddings represent the use of language in a numerical space, so you can use distance measurements between these embeddings to calculate similarity between them.

## Setup
To access Google Cloud's embedding models, you need to set up the client to access Vertex AI. Start by installing one of the latest versions of the Python client library.

All of the imports will be done up front to keep the notebook tidy.


### Installs
Make sure we have a new enough version of the Vertex AI Python client library to use a text embedding model.

In [1]:
!pip3 install --quiet "google-cloud-aiplatform>=1.40"

After the install, we need to restart the kernel to use the new library version.

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Imports
Import all of the modules needed for the notebook.

Most modules in the Python client library belong to the package `google.cloud`, but Vertex AI is an exception.



In [1]:
import vertexai
from vertexai.language_models import TextEmbeddingModel

from sklearn.cluster import KMeans

import google.auth
from google.cloud import storage
from google.cloud import aiplatform

import math
import numpy as np
import pandas as pd

from tqdm.auto import tqdm

from typing import Generator, List, Optional, Tuple
import functools
import time
from concurrent.futures import ThreadPoolExecutor

import pickle
from urllib.request import urlopen

from collections import Counter

### User Authentication
If you are using Colab, uncomment this cell and run it to authenticate as your user for Google Cloud.

Shortcut: Select the cell contents and press `Ctrl/Cmd + /`.

In [2]:
# from google.colab import auth
# auth.authenticate_user()

Run the cell below to retrieve the credentials for intializing access to Vertex AI.

In [3]:
credentials, _ = google.auth.default()

### Client setup

Configure your project ID in `MY_PROJECT`.

If you want to use a specific Vertex AI location, set it as `VERTEX_LOCATION`. We'll default to us-central1.

In [5]:
MY_PROJECT = 'qwiklabs-gcp-04-192b6fce6166'
VERTEX_LOCATION = 'us-central1'

In [6]:
vertexai.init(
    project=MY_PROJECT,
    location=VERTEX_LOCATION,
    #credentials set above
    credentials=credentials
)

## Data
The product description data we are using was gathered for the paper

**Justifying recommendations using distantly-labeled reviews and fined-grained aspects**\
Jianmo Ni, Jiacheng Li, Julian McAuley\
_Empirical Methods in Natural Language Processing (EMNLP)_, 2019\
[pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf)

This [dataset](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/index.html) includes many additional fields that you're not going to use, including reviews from users. To keep this lab more focused, the data has already been parsed and cleaned. We've taken a sample of 2,000 products each across 5 categories, for a total of 10,000 products. The clustering model you'll train will be used to attempt to rediscover those categories!

The cleaned data only includes the name of the item and the description. This is formatted into a CSV file, which is hosted on Cloud Storage.

The five original categories are as follows:
* all beauty
* appliances
* musical instruments
* pantry
* software

In [7]:
URL = 'https://storage.googleapis.com/cloud-training/specialized-training/model_garden/products.csv.gz'
products_df = pd.read_csv(URL)

You can preview the data in the following cell. Verify that there are indeed 10,000 rows of data consisting of two columns: name and description.

In [8]:
products_df

Unnamed: 0,name,description
0,"NAILTEK CITRA Formula 3 Protection for Dry, Br...","NAILTEK CITRA Formula #3 Protection for Dry, B..."
1,Wildberry Incense Sticks: Vanilla,"Vanilla incense is smooth, creamy and delectable."
2,Wonder Pro Professional Red Rubber Sponge #010...,Wonder Pro Professional Red Sponges-2 pack
3,Ronson 2.75 oz(78 Gram) Butane Multi-Fill/12 Pk.,Fills up to 20 disposable lighters per bottle.
4,"NUKSIT 10% Sulfur Ointment - Large tub 4oz, Po...",This 4 oz size is an excellent value. Swiss fo...
...,...,...
9995,Mayo Clinic Family Health 3.0 (PC CD Jewel Case),Mayo Clinic one of the most trusted names in m...
9996,SecurErase 8: Permanently Erase Your Hard Drive,SecurErase overwrites the data that sits unuse...
9997,Finale 2008-2009 Tutorial DVD,156 videos with a running time of over 8.5 hou...
9998,Putt-Putt Joins the Circus - PC/Mac,Step Right Up! Discover the Circus with Putt-...


In [9]:
len(products_df)

10000

## Embeddings Model


With the product information loaded, you can now get the embeddings for the description. Start by loading the model. Check the model card in Model Garden for how to get started, or view the API documentation.

[Embeddings for Text model card in Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/textembedding-gecko)

[TexEmbeddingModel API](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel)

[TextEmbeddingModel code](https://github.com/googleapis/python-aiplatform/blob/bbffb0d5bfe0509399c801d849311a6201caa633/vertexai/language_models/_language_models.py#L2105)

In [10]:
model = TextEmbeddingModel.from_pretrained("text-embedding-004")

Confirm that we are using the model from Model Garden that we expect, which is the TextEmbeddingModel.

In [11]:
type(model)

vertexai.language_models.TextEmbeddingModel

Let's try calling the model with a couple example sentences, just to see the basic process. We'll print the first 15 dimensions of each vector to see what the result looks like.

In [12]:
embeddings = model.get_embeddings(["Dinner in New York City", "Dinner in Paris"])
for embedding in embeddings:
    vector = embedding.values
    print(vector[:15])
    print('\n-----\n')

[0.044721949845552444, -0.04241640120744705, 0.05587426573038101, -0.029217351227998734, 0.02876843884587288, -0.01948491856455803, 0.008314436301589012, 0.016210688278079033, 0.014452790841460228, 0.02895222045481205, -0.002439456759020686, 0.002695783507078886, 0.046391382813453674, -0.08300202339887619, -0.03702150285243988]

-----

[0.002556765917688608, -0.07282502949237823, 0.016707293689250946, 0.0009703844552859664, 0.03553098440170288, -0.03356371074914932, -0.00041769977542571723, -0.003981488291174173, 0.00450397003442049, 0.05094684287905693, 0.01801416277885437, -0.0369516983628273, 0.04498067870736122, -0.04002223536372185, -0.042615052312612534]

-----



In [13]:
print(f'Response type: {type(embeddings)}')
print(f'Each response type: {type(embeddings[0])}')

Response type: <class 'list'>
Each response type: <class 'vertexai.language_models.TextEmbedding'>


The Python client puts the responses from the model into a list in order of the requested sentences.

In [14]:
len(embeddings)

2

The TextEmbeddings are a set of floating point values representing the 768 dimensions used by Vertex AI text embedding models for understanding the meaning of text. Later, we'll convert these lists of floats to numpy ndarrays for easier use.

In [15]:
print(f'Values: {len(embeddings[0].values)}')
print(f'Value type: {type(embeddings[0].values[0])}')

Values: 768
Value type: <class 'float'>


According to the [documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#get_text_embeddings_for_a_snippet_of_text), the API has a limit of 5 input texts (product descriptions in our case) per API call. With a batch size of 5 and 10,000 product descriptions to embed, we'll need to make 2,000 calls to the API. It's time to turn the basic example into a function for more utility.

#### Function to call the model
This function does what our basic example does, which is call the `get_embeddings` method to convert our text. It adds very basic error handling, which should be expanded for production use.

In [16]:
# Define an embedding method that uses the model
def encode_texts_to_embeddings(sentences: List[str]) -> List[Optional[List[float]]]:
    try:
        embeddings = model.get_embeddings(sentences)
        return [embedding.values for embedding in embeddings]
    except Exception:
        return [None for _ in range(len(sentences))]

#### Define two more helper functions for converting text to embeddings

- generate_batches:  This method splits `sentences` into batches of 5 before sending to the embedding API.
- encode_text_to_embedding_batched: This method calls `generate_batches` to handle batching and then calls the embedding API via `encode_texts_to_embeddings`. It also handles rate-limiting using `time.sleep`. For production use cases, you would want a more sophisticated rate-limiting mechanism that takes retries into account.

In [17]:
# Generator function to yield batches of descriptions
def generate_batches(
    descriptions: List[str], batch_size: int
) -> Generator[List[str], None, None]:
    for i in range(0, len(descriptions), batch_size):
        yield descriptions[i : i + batch_size]


def encode_text_to_embedding_batched(
    descriptions: List[str], api_calls_per_minute: int = 20, batch_size: int = 5
) -> Tuple[List[bool], np.ndarray]:

    embeddings_list: List[List[float]] = []

    # Prepare the batches using a generator
    batches = generate_batches(descriptions, batch_size)

    seconds_per_job = 60 / api_calls_per_minute

    with ThreadPoolExecutor() as executor:
        futures = []
        for batch in tqdm(
            batches, total=math.ceil(len(descriptions) / batch_size), position=0
        ):
            futures.append(
                executor.submit(functools.partial(encode_texts_to_embeddings), batch)
            )
            time.sleep(seconds_per_job)

        for future in futures:
            embeddings_list.extend(future.result())

    is_successful = [
        embedding is not None for sentence, embedding in zip(descriptions, embeddings_list)
    ]
    embeddings_list_successful = np.squeeze(
        np.stack([embedding for embedding in embeddings_list if embedding is not None])
    )
    return is_successful, embeddings_list_successful

#### Generate Embeddings

To generate the embeddings, we will call the batch helper function. However, the Qwiklabs environment has a low rate quota, so this process will take a long time to complete. Luckily, an important principle of the embeddings is that they won't substantially change no matter when you do the conversion! We can compute the embeddings once, store them, then load them whenever we need them again.

The below cell is the code for generating the embeddings. It is commented out since you won't be using it, but it does show you the process for your own use later! Make sure to change the `api_calls_per_minute` parameter based on [your own quota rate limit!](https://console.cloud.google.com/iam-admin/quotas?referrer=search&pageState=(%22allQuotasTable%22:(%22f%22:%22%255B%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22base_model_3Atextembedding-gecko_5C_22_22_2C_22s_22_3Atrue%257D%255D%22)))

In [18]:
# descriptions = products_df['description'].values.tolist()
# response = encode_text_to_embedding_batched(descriptions, api_calls_per_minute=20)

The full response is hosted on Cloud Storage so we can import it and pick up right where the previous cell would have left us.

In [19]:
!gsutil cp -r gs://partner-genai-bucket/genai042/embeddings-response.pkl .

Copying gs://partner-genai-bucket/genai042/embeddings-response.pkl...
- [1 files][ 58.6 MiB/ 58.6 MiB]                                                
Operation completed over 1 objects/58.6 MiB.                                     


In [20]:
file_path = 'embeddings-response.pkl'

# Open the file in binary mode and load the data
with open(file_path, 'rb') as file:
    response = pickle.load(file)

The `response` created by the `encode_text_to_embeddings_batched` function is a Tuple of (List[bool], ndarray)

The first tuple item represents whether the API call was successful or not. There should be an equal number of responses to the size of the original product description list (10,000). These should all be True to indicate we have embeddings for every description.

In [21]:
print('Item 0')
print(f'\ttype: {type(response[0])}')
print(f'\tvalues: {len(response[0])}')
# Test to make sure all responses are True.
print(f'All embeddings requests completed successfully: {all(response[0])}')

Item 0
	type: <class 'list'>
	values: 10000
All embeddings requests completed successfully: True


The second tuple item is a numpy ndarray of the 768-dimension embedding vectors provided by the model for each product description.

In [22]:
print('Item 1')
print(f'\ttype: {type(response[1])}')
print(f'\tshape: {response[1].shape}')
print(f'\tvector data type: {response[1].dtype}')

Item 1
	type: <class 'numpy.ndarray'>
	shape: (10000, 768)
	vector data type: float64


Even though the giant string of floating point numbers doesn't mean anything to us as humans, it's still interesting to see.

In [23]:
print(f'Original text: \n{products_df["description"][0]}\nEmbedding vector:')
print(response[1][0])

Original text: 
NAILTEK CITRA Formula #3 Protection for Dry, Brittle Nails .47oz  Conditions brittle dry nails, replenishes the natural moisture of the nails and provides the hydration necessary to resist stress and enhance survivability.
Embedding vector:
[ 1.68416835e-03  2.64479499e-02  3.66069041e-02 -5.85494712e-02
 -4.18350846e-02 -2.53428388e-02 -2.47018486e-02 -3.58675458e-02
  5.83306560e-03 -3.72460415e-03  5.53904437e-02  5.05214296e-02
  2.63037439e-02  3.30709517e-02 -9.58084688e-03 -1.33588593e-02
  8.42376333e-03  4.07868475e-02 -1.50383309e-01  5.11450805e-02
  3.67602259e-02  3.07591213e-03  5.02366163e-02  4.37612422e-02
 -3.92399468e-02  2.02372279e-02  6.84936577e-03 -4.06099629e-04
 -3.27024385e-02 -7.20532686e-02  1.49083668e-02  2.95301061e-02
 -7.84689374e-03 -1.34938024e-02  8.97528306e-02  1.72084793e-02
 -1.23068672e-02  7.31257023e-03 -4.16205972e-02 -5.06948344e-02
 -3.90185323e-03 -6.15222305e-02 -1.38300401e-03 -1.54975476e-02
 -2.87686139e-02  6.25725687

## Clustering Model
With the embeddings computed, you can use those to create categories representing similar products. Notice that the dataset you loaded is not labeled for which category each product belongs to. However, those product categories do exist the source data.

You'll use a K-Means model to find similarities in the embedding vectors. Using scikit-learn, the [KMeans model](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) directly accepts an ndarray, which our embedding vectors are formatted as thanks to the helper functions!

As mentioned, the sample data came from 5 product categories, so we're going to have the model learn 5 clusters. With the small dataset we're using, training the clustering model will take almost no time.

In [24]:
embeddings = response[1]
kmeans = KMeans(n_clusters=5, n_init="auto").fit(embeddings)

Fitting the model also provides us with the predictions for the values used to train the model in the `labels_` property.

In [25]:
len(kmeans.labels_)

10000

We can sanity check that by calling the `predict` function directly, which is also how you'd cluster future products now that we've trained the model!

In [26]:
predictions = kmeans.predict(embeddings)

In [27]:
kmeans.labels_[:20]

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
      dtype=int32)

In [28]:
predictions[:20]

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
      dtype=int32)

#### Evaluate the clusters
Let's attach the predicted clusters to the original data and see some example products from each cluster to see how well the model performed.

As a reminder, the data has 2,000 products from each of 5 categories.

In [29]:
products_df['category'] = predictions

In [30]:
products_df.groupby('category').count()['name']

category
0    1590
1    1877
2    2119
3    2453
4    1961
Name: name, dtype: int64

We can see that the model did not entirely accurately recreate the categories as there are not exactly 2,000 in each cluster. However, the performance isn't terrible for having spent about one second training.

What you should also know about the input data is that it was sorted by category. Each block of 2,000 rows is one product category. Let's see how well the model predicts each of the different categories.

In [31]:
# Create named slices for easier reference
beauty = slice(2000)
appliance = slice(2000,4000)
instrument = slice(4000,6000)
pantry = slice(6000,8000)
software = slice(8000,10000)

beauty_cnts = Counter(predictions[beauty])
appliance_cnts = Counter(predictions[appliance])
instrument_cnts = Counter(predictions[instrument])
pantry_cnts = Counter(predictions[pantry])
software_cnts = Counter(predictions[software])

product_pred_cnts = [beauty_cnts, appliance_cnts, instrument_cnts, pantry_cnts, software_cnts]
for product_cnt in product_pred_cnts:
    print(product_cnt)

Counter({3: 1803, 2: 89, 0: 82, 1: 16, 4: 10})
Counter({1: 1849, 2: 82, 3: 60, 0: 6, 4: 3})
Counter({2: 1896, 3: 69, 4: 29, 1: 5, 0: 1})
Counter({0: 1499, 3: 482, 2: 11, 4: 7, 1: 1})
Counter({4: 1912, 2: 41, 3: 39, 1: 6, 0: 2})


With the set of predictions and the counts per category, we can now label the clusters based on the highest number of predictions. With many clustering tasks, we won't have a source of ground truth like in this example, so figuring out names for your clusters will be up to you. You'll do additional analysis of the features of the objects that are predicted to be in the same cluster by your model.

In [32]:
def most_likely_cluster(counted_dict):
    return max(counted_dict, key=counted_dict.get)

product_category_list = ['all_beauty', 'appliance', 'instrument', 'pantry', 'software']

product_clusters = []
for product in product_pred_cnts:
    product_clusters.append(most_likely_cluster(product))

keys = [0, 1, 2, 3, 4] # simple list of index values as the products are inserted
row_names = dict(zip(keys, product_category_list))
cluster_map = dict(zip(product_clusters, product_category_list))
print(row_names) # ascending alpha order
print(cluster_map) # predictions

{0: 'all_beauty', 1: 'appliance', 2: 'instrument', 3: 'pantry', 4: 'software'}
{3: 'all_beauty', 1: 'appliance', 2: 'instrument', 0: 'pantry', 4: 'software'}


Add all of that information to a new DataFrame for easier visualization. The prediction counts get added in as raw information, then we'll update row and column names to match the product categories.

In [33]:
correctness_df = pd.DataFrame.from_dict(product_pred_cnts)
correctness_df.rename(index=row_names, inplace=True)
correctness_df.rename(columns=cluster_map, inplace=True)
correctness_df.sort_index(axis=0, inplace=True)
correctness_df.sort_index(axis=1, inplace=True)

In [34]:
correctness_df.head()

Unnamed: 0,all_beauty,appliance,instrument,pantry,software
all_beauty,1803,16,89,82,10
appliance,60,1849,82,6,3
instrument,69,5,1896,1,29
pantry,482,1,11,1499,7
software,39,6,41,2,1912


From this confusion matrix, we can see which categories have the most recognizable patterns of descriptions. Down the diagonal is the count of correct predictions.

With this data, the model predicted software and instruments very well. Very few of those predictions were wrong. However, it also frequently confused appliances and beauty products for instruments. You can see which descriptions were confusing by selecting those products from the original dataframe.

In [35]:
instrument_category = product_clusters[2]

In [36]:
products_df.iloc[beauty].query(f'category == {instrument_category}')

Unnamed: 0,name,description,category
41,Souvenir Palestine Metal Flag Map Keychain Pal...,Metal Palestine Flag Map Colors Key chain,2
62,PIBBS Wood Appliance Holder (Model,Professional Features:-Wooden appliance holder...,2
83,Blue Lamb Skin Leather Cigarette Case with 2 Z...,Lambskin leather cigarette case with 2 zipper ...,2
106,New Dc Comics The Joker Batman Western Mens Me...,Batman Movie Joker Character Officially Licens...,2
137,Yo-Kai Watch Series 1 Komajiro Medal [Loose],"1.75"" in diameter disc. Loose Yo-kai Medal",2
...,...,...,...
1905,Ultra (Box of 10) Corn Plane Blades,Ultra (Box of 10) Corn Plane Blades,2
1913,"color trak Color Trak Caddy, Pink, Black and Y...","Coloratura wood color brushes, 3 pack.",2
1921,"1907 NCS019 Streeterville Shear Case, 0.277 Pound",NCS019 Streeter vile Shear Case.,2
1969,JENNING MOUTH GAG 6,"JENNING MOUTH GAG 6"" G.S INSTRUMENTS",2


We can compare those descriptions with the descriptions from instruments that were correctly predicted.

In [37]:
products_df.iloc[instrument].query(f'category == {instrument_category}')

Unnamed: 0,name,description,category
4000,Oscar Schmidt OR6CEB-O-U Acoustic Electric Res...,The OR6CE is a biscuit resonator guitar with c...,2
4001,"Vacuum Tube Set for Fender Bandmaster VM Head,...",(2)T-12AX7-S-JJ (1)T-6L6GC-JJ-MP (Apex Matched...,2
4002,Liverpool Double Star Drum Stick/Mallet Comb S...,Liverpool Drum Sticks Mallet / Drum Stick Combo,2
4003,Didgeridoo Store Decorative Didgeridoo Midnigh...,Didgeridoo Store Decorative Didgeridoo Midnigh...,2
4004,WellieSTR (A Pair) Real Leather Accordion/Acco...,(A Pair) Real Leather Accordion/Accordian Bell...,2
...,...,...,...
5995,MBT Lighting PAR56 Par Can - Black,Par56 Par Can. Made by MBT Lighting.,2
5996,Blast King I49BMIC20B Female XLR to 1/4-Inch P...,"Blast King MICROPHONE CABLE, FEMALE XLR TO 1/4"" P",2
5997,Kona Guitars KA15T 10-Watt Amplifier with Buil...,Kona 10 Watt Amplifier with Built-in Tuner and...,2
5998,BEHRINGER STUDIO CONDENSER MICROPHONE T-47,Vacuum Tube Condenser Microphone,2


They don't look that similar to me, but they apparently look similar to the model at the embedding vector level!

### Save the model
Having the model available in the notebook is all well and good, but right now, it only lives in memory and is only accessible from within the notebook. That's not very useful. We can do better. First, let's save the model locally on the machine to prevent us from losing it during a reboot. We can use Pickle to export the model to disk.

In [38]:
model_local_file = 'model.pkl'
with open(model_local_file, 'wb') as model_file:
    pickle.dump(kmeans, model_file)

At least now the model is persisted to local storage, but it's still not very available. A much better place to store it is Cloud Storage!

We'll default to using a Cloud Storage bucket name based on our project ID. You can set this to any Cloud Storage bucket you have access to.

In [39]:
GCS_BUCKET=f"{MY_PROJECT}-productcluster"

In [40]:
gcs_client = storage.Client()

The following cell will create the bucket. If you set the variable to an existing bucket, skip the following cell or it will throw an error.

In [41]:
gcs_client.create_bucket(GCS_BUCKET)

<Bucket: qwiklabs-gcp-04-192b6fce6166-productcluster>

Now we can upload the model to Cloud Storage.

In [42]:
bucket = gcs_client.get_bucket(GCS_BUCKET)
gcs_model_path = f'embeddings_product_cluster'
blob = bucket.blob(f'{gcs_model_path}/{model_local_file}')
blob.upload_from_filename(model_local_file)

To reload your model for future use in a notebook, simply retrive the pickled model file from Cloud Storage, then use pickle to load it again!

In [43]:
with open(model_local_file, 'rb') as f:
    reloaded_kmeans = pickle.load(f)

reloaded_kmeans.predict(embeddings[:10])

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int32)

Since we still have the original in memory, we can easily show the predictions from the reloaded model are the same as the original.

In [44]:
kmeans.predict(embeddings[:10])

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int32)

### Model Registry
Being in Cloud Storage means you're going to have 11-9s of durability so you know the model won't be lost, and now you can download it in other notebooks to use.

However, we can still do better!

You can't use the model directly from Cloud Storage to make predictions. You'd have to spin up a notebook or other server, retrieve the model, then load the model back into memory to use it again. That's a lot of work, and it doesn't let you use this model for any kind of real-time predictions.

Instead, wouldn't it be great if we could directly send embeddings to our model and not have to worry about the computing behind it? By adding the model to Model Registry, we can unlock the full power of Vertex AI to run our model at scale!

According to the [documentation](https://cloud.google.com/vertex-ai/docs/training/exporting-model-artifacts#scikit-learn), Vertex AI expects the model to be named `model.pkl` when importing to Model Registry, which is why we named it that above. To register the model, you pass in the directory (really, key prefix) for where Model Registry can find the pickled model in Cloud Storage. You also specify a serving container image, which your model will be injected into to run on Vertex AI.

In [46]:
model_display_name = 'cluster_products'
artifact_directory_uri = f'gs://{GCS_BUCKET}/{gcs_model_path}'
serving_container_image = 'us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-2:latest'

vai_model = aiplatform.Model.upload(
            display_name=model_display_name,
            artifact_uri=artifact_directory_uri,
            serving_container_image_uri=serving_container_image)

vai_model.wait()

Creating Model
Create Model backing LRO: projects/291712904174/locations/us-central1/models/411310807276584960/operations/3549415571038666752
Model created. Resource name: projects/291712904174/locations/us-central1/models/411310807276584960@1
To use this Model in another session:
model = aiplatform.Model('projects/291712904174/locations/us-central1/models/411310807276584960@1')


Now you can easily send batch predictions to your model, or publish it to an Endpoint for real-time applications!

### Use the model
Finally, with your model no longer in danger of being lost, and being substantially more usable in Model Registry, let's turn back to our local copy of the model to predict a few more products.

Here are a few more cleaned-up product descriptions from the original dataset that were not included in the sample, one from each category.

In [47]:
new_products = ['Kinetronics StaticWisk Brush-7/8',
                'Range Kleen GE/Hotpoint Large Porcelain Drip Bowl 8"',
                'Game of Thrones (Theme from the HBO series) - EASY PIANO Sheet Music Single',
                'CADBURY Chocolate Candy Bar, English Toffee, 5.4 Ounce',
                'Star Trek: The Game Show']
new_descriptions = ['Fine quality anti-static lens brush. Made from a special blend of soft, natural hair and a conductive fiber. To use the brush, simply sweep the lens. The brush has a resistivity of 10-1 and will dissipate any static charge and release the dust.',
                    'Are your burner reflector bowls beyond rescue? Reasonably priced and easy to clean, replacement bowls will have your stove looking spiffy in a jiffy!',
                    'EASY Piano version of the popular theme song from the HBO series.',
                    'Enjoy the rich taste of premium milk chocolate with the satisfying crunch of English toffee. Make any moment more delicious with an Cadbury milk chocolate with English toffee pouch.',
                    'Star Trek: The Game Show'
]

First, generate the text embeddings

In [48]:
new_embeddings = model.get_embeddings(new_descriptions)

The returned embeddings are in a TextEmbedding class. Convert that to an ndarray of floats to pass to the clustering model.

In [49]:
new_embeddings_nd = np.squeeze(np.stack([embedding.values for embedding in new_embeddings if embedding is not None]))
new_predictions = kmeans.predict(new_embeddings_nd)

In [50]:
new_predictions_clusters = [cluster_map[x] for x in new_predictions]
new_predictions_clusters

['instrument', 'appliance', 'instrument', 'pantry', 'software']

How did your model do?

When I ran this, the model got 4 of 5 correct. Not bad! It confused the musical instrument category for software, which was the most likely category for it to be wrong about based on my confusion matrix.

## Congratulations!

In this lab, you've converted a set of product descriptions into a vector representation using a text embedding model. With that contextual information, you trained a clustering model to predict the category of the product based on the description. You then persisted the model to Cloud Storage, and most powerfully, added it to Model Registry. It's now ready to use in your Vertex AI pipelines!
