## Lesson 3: Visualizing Embeddings

#### Project environment setup

- Load credentials and relevant Python Libraries

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
PROJECT_ID = os.environ['PROJECT_ID']
REGION = os.environ['REGION']
print(f"PROJECT_ID: {PROJECT_ID}")
print(f"REGION: {REGION}")

## Embeddings capture meaning

In [None]:
in_1 = "Missing flamingo discovered at swimming pool"

in_2 = "Sea otter spotted on surfboard by beach"

in_3 = "Baby panda enjoys boat ride"


in_4 = "Breakfast themed food truck beloved by all!"

in_5 = "New curry restaurant aims to please!"


in_6 = "Python developers are wonderful people"

in_7 = "TypeScript, C++ or Java? All are great!"


input_text_lst_news = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]

In [None]:
in_1 = "These apples are always crisp and delicious! My family loves them."
in_2 = "I bought these apples, and they were mushy and tasted bad. Not happy at all."
in_3 = "The cereal selection here is fantastic. I can always find my favorite brands."
in_4 = "The cereal I bought was stale. It was like eating cardboard."

in_5 = "Just discovered the new organic section at my local grocery store! üåø So excited to shop for healthier options. Thanks, @GreenGroceryCo! #HealthyEating #OrganicFood"
in_6 = "Visited @SuperMart today for my weekly fruit haul. Their produce section is always on point! üçéü•≠üçá #FreshProduce #SuperMartLove"
in_7 = "Seriously disappointed with my recent trip to @BudgetMart. Out of stock on half the items I needed. üò° #PoorService #BudgetMartFail"

in_8 = "In a recent survey of 1,000 consumers, 72% indicated a growing preference for organic and locally sourced products when shopping for groceries."
in_9 = "Our market research shows that plant-based meat alternatives have seen a 35% increase in sales over the past year, indicating a rising trend in health-conscious choices."
in_10 = "Analysis of competitor pricing strategies reveals that Grocery Chain A consistently offers lower prices on staple products, attracting cost-conscious shoppers."
in_11 = "Our research indicates that urban millennials are the fastest-growing segment of online grocery shoppers, with a 42% increase in the past two years."

in_12 = "I've noticed some safety hazards in the back storage area that need attention. It's crucial to address these issues to ensure the safety of our team and prevent accidents."
in_13 = "Our team has been working exceptionally hard lately, and it would be motivating to receive more recognition for our efforts, perhaps through an 'Employee of the Month' program."
in_14 = "I'd like to see more opportunities for professional development and training. It would benefit both employees and the company as we stay updated on industry trends."
in_15 = "Improving communication between shifts and departments would help streamline operations and reduce misunderstandings. Clearer communication channels are essential."


input_text_lst_news = [
    in_1,
    in_2,
    in_3,
    in_4,
    in_5,
    in_6,
    in_7,
    in_8,
    in_9,
    in_10,
    in_11,
    in_12,
    in_13,
    in_14,
    in_15,
]

In [None]:
import numpy as np
from vertexai.language_models import TextEmbeddingModel

embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

- Get embeddings for all pieces of text.
- Store them in a 2D NumPy array (one row for each embedding).

In [None]:
embeddings = []
for input_text in input_text_lst_news:
    emb = embedding_model.get_embeddings([input_text])[0].values
    embeddings.append(emb)

embeddings_array = np.array(embeddings)

In [None]:
print("Shape: " + str(embeddings_array.shape))
print(embeddings_array)

#### Reduce embeddings from 768 to 2 dimensions for visualization
- We'll use principal component analysis (PCA).
- You can learn more about PCA in [this video](https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning/lecture/73zWO/reducing-the-number-of-features-optional) from the Machine Learning Specialization. 

In [None]:
from sklearn.decomposition import PCA

# Perform PCA for 2D visualization
PCA_model = PCA(n_components=2)
PCA_model.fit(embeddings_array)
new_values = PCA_model.transform(embeddings_array)

In [None]:
import pandas as pd
text = pd.DataFrame(input_text_lst_news, columns=["text"])
text

In [None]:
list(text.columns)

In [None]:
from utils import umap_plot
umap_plot(emb=embeddings_array, text=text)

In [None]:
print("Shape: " + str(new_values.shape))
print(new_values)

In [None]:
import matplotlib.pyplot as plt
import mplcursors
%matplotlib ipympl

from utils import plot_2D
plot_2D(new_values[:,0], new_values[:,1], input_text_lst_news)

#### Embeddings and Similarity
- Plot a heat map to compare the embeddings of sentences that are similar and sentences that are dissimilar.

In [None]:
in_1 = """He couldn‚Äôt desert 
          his post at the power plant."""

in_2 = """The power plant needed 
          him at the time."""

in_3 = """Cacti are able to 
          withstand dry environments."""

in_4 = """Desert plants can 
          survive droughts."""

input_text_lst_sim = [in_1, in_2, in_3, in_4]

In [None]:
embeddings = []
for input_text in input_text_lst_sim:
    emb = embedding_model.get_embeddings([input_text])[0].values
    embeddings.append(emb)

embeddings_array = np.array(embeddings)

In [None]:
from utils import plot_heatmap

y_labels = input_text_lst_sim

# Plot the heatmap
plot_heatmap(embeddings_array, y_labels=y_labels, title="Embeddings Heatmap")

Note: the heat map won't show everything because there are 768 columns to show.  To adjust the heat map with your mouse:
- Hover your mouse over the heat map.  Buttons will appear on the left of the heatmap.  Click on the button that has a vertical and horizontal double arrow (they look like axes).
- Left click and drag to move the heat map left and right.
- Right click and drag up to zoom in.
- Right click and drag down to zoom out.

#### Compute cosine similarity
- The `cosine_similarity` function expects a 2D array, which is why we'll wrap each embedding list inside another list.
- You can verify that sentence 1 and 2 have a higher similarity compared to sentence 1 and 4, even though sentence 1 and 4 both have the words "desert" and "plant".

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def compare(embeddings, idx1, idx2):
    return cosine_similarity([embeddings[idx1]], [embeddings[idx2]])

In [None]:
print(in_1)
print(in_2)
print(compare(embeddings, 0, 1))

In [None]:
print(in_1)
print(in_4)
print(compare(embeddings, 0, 3))