#### Getting Started With Text Embeddings

Project environment setup
Load credentials and relevant Python Libraries
If you were running this notebook locally, you would first install Vertex AI. In this classroom, this is already installed. <br>
!pip install google-cloud-aiplatform

In [2]:
from utils import authenticate
credentials, PROJECT_ID = authenticate() # Get credentials and project ID

ImportError: cannot import name 'authenticate' from 'utils' (/Users/dovcohen/opt/anaconda3/envs/LLM_3.10/lib/python3.10/site-packages/utils/__init__.py)

In [1]:
! pip install utils

Collecting utils
  Downloading utils-1.0.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: utils
Successfully installed utils-1.0.1


In [None]:
REGION = 'us-central1'
# Import and initialize the Vertex AI Python SDK

import vertexai
vertexai.init(project = PROJECT_ID, 
              location = REGION, 
              credentials = credentials)

Use the embeddings model <br>
Import and load the model.

In [None]:
from vertexai.language_models import TextEmbeddingModel
embedding_model = TextEmbeddingModel.from_pretrained(
    "textembedding-gecko@001")

Generate Word embedding

In [None]:
embedding = embedding_model.get_embeddings(
    ["life"])

The returned object is a list with a single TextEmbedding object.
The TextEmbedding.values field stores the embeddings in a Python list.

In [3]:
vector = embedding[0].values
print(f"Length = {len(vector)}")
print(vector[:10])

NameError: name 'embedding' is not defined

Sentence Embedding

In [None]:
embedding = embedding_model.get_embeddings(
    ["What is the meaning of life?"])
vector = embedding[0].values
print(f"Length = {len(vector)}")
print(vector[:10])


#### Similarity

- Calculate the similarity between two sentences as a number between 0 and 1.
- Try out your own sentences and check if the similarity calculations match your intuition.

- Note: the reason we wrap the embeddings (a Python list) in another list is because the `cosine_similarity` function expects either a 2D numpy array or a list of lists.
```Python
vec_1 = [emb_1[0].values]
```

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
emb_1 = embedding_model.get_embeddings(
    ["What is the meaning of life?"]) # 42!

emb_2 = embedding_model.get_embeddings(
    ["How does one spend their time well on Earth?"])

emb_3 = embedding_model.get_embeddings(
    ["Would you like a salad?"])

vec_1 = [emb_1[0].values]
vec_2 = [emb_2[0].values]
vec_3 = [emb_3[0].values]

print(cosine_similarity(vec_1,vec_2)) 
print(cosine_similarity(vec_2,vec_3))
print(cosine_similarity(vec_1,vec_3))



#### From word to sentence embeddings
- One possible way to calculate sentence embeddings from word embeddings is to take the average of the word embeddings.
- This ignores word order and context, so two sentences with different meanings, but the same set of words will end up with the same sentence embedding.

In [None]:
in_1 = "The kids play in the park."
in_2 = "The play was for kids in the park."
# remove stop words
in_pp_1 = ["kids", "play", "park"]
in_pp_2 = ["play", "kids", "park"]

# Generate one embedding for each word. So this is a list of three lists.
embeddings_1 = [emb.values for emb in embedding_model.get_embeddings(in_pp_1)]

# Use numpy to convert this list of lists into a 2D array of 3 rows and 768 columns.
import numpy as np
emb_array_1 = np.stack(embeddings_1)
print(emb_array_1.shape)

embeddings_2 = [emb.values for emb in embedding_model.get_embeddings(in_pp_2)]
emb_array_2 = np.stack(embeddings_2)
print(emb_array_2.shape)




- Take the average embedding across the 3 word embeddings 
- You'll get a single embedding of length 768.

- Check to see that taking an average of word embeddings results in two sentence embeddings that are identical.

In [None]:
emb_1_mean = emb_array_1.mean(axis = 0) 
emb_2_mean = emb_array_2.mean(axis = 0)
print(emb_1_mean.shape)
print(emb_2_mean.shape)


#### Get sentence embeddings from the model.
- These sentence embeddings account for word order and context.
- Verify that the sentence embeddings are not the same.

In [None]:
print(in_1)
print(in_2)
embedding_1 = embedding_model.get_embeddings([in_1])
embedding_2 = embedding_model.get_embeddings([in_2])

# note radically different values from the single word embeddings
# this accounts for sentence structure differences
vector_1 = embedding_1[0].values
print(vector_1[:4])
vector_2 = embedding_2[0].values
print(vector_2[:4])