## Gecko Embedding explore

Reference:
- https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings#generative-ai-get-text-embedding-drest
- https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/generative_ai/text_embedding_api_cloud_next_new_models.ipynb

### Summary
- I find google start to return token number within response
- different version embedding models generate different vector with same content
- If you use same data and same version model with different task_type, then the result of the vectors are same.
- TODO: Haven't test the performance of textembedding-gecko-multilingual with multi language.
- Question : What is the benefit regarding using task? What is the usecase for it?

In [14]:
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

- 001 model with single input

In [15]:
model_001 = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
embeddings_001 = model_001.get_embeddings(["What is life?"])
print(f"Embedding dimensions:{embeddings_001[0].values}")
print(f"Embedding statistics:{embeddings_001[0].statistics}")
print(f"Embedding Other data {embeddings_001[0]._prediction_response.deployed_model_id}, {embeddings_001[0]._prediction_response.model_version_id}, {embeddings_001[0]._prediction_response.model_resource_name}, {embeddings_001[0]._prediction_response.explanations}")

Embedding dimensions:[0.010562753304839134, 0.04915031045675278, -0.022224493324756622, 0.0208794716745615, 0.024389723315835, 0.010366306640207767, 0.023919280618429184, 0.022391626611351967, -0.031569067388772964, 0.023535897955298424, -0.017047161236405373, -0.014345862902700901, 0.044956106692552567, 0.027327297255396843, -0.03314697742462158, -0.028214626014232635, -0.035373710095882416, -0.05229683220386505, 0.017105583101511, -0.03780610114336014, -0.07891207933425903, -0.01173518318682909, -0.01629730500280857, -0.04353305324912071, 0.013023999519646168, -0.10904901474714279, -0.0341256819665432, -0.0025329082272946835, -0.036971937865018845, -0.027775181457400322, 0.02332289144396782, 0.0052000475116074085, 0.005503748077899218, 0.0047489493153989315, -0.029920609667897224, 0.07563772797584534, 0.0007565636187791824, 0.03501711040735245, 0.02154686115682125, -0.000812096637673676, 0.06169590726494789, -0.024313345551490784, 0.03736764192581177, -0.0005869767046533525, -0.02287

- 001 model with multiple inputs

In [16]:
embedding_candidates = ["What is life?","Hello World"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
    )
    for embedding_candidate in embedding_candidates
]
embeddings_001 = model_001.get_embeddings(embedding_inputs)


In [17]:
for embedding in embeddings_001:
    print(len(embedding.values))

768
768


- 001 model with different language

In [18]:
embeddings_mandarin = model_001.get_embeddings(["歡迎"])
len(embeddings_mandarin[0].values)

768

- Not sure why when we provide title, it cause error.

In [19]:
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="SEMANTIC_SIMILARITY",
        title="test"
    )
    for embedding_candidate in embedding_candidates
]
model_001_semantic_similarity_embeddings = model_001.get_embeddings(embedding_inputs)
model_001_semantic_similarity_embeddings[0]

InvalidArgument: 400 Request contains an invalid argument.

- 001 model with SEMANTIC_SIMILARITY task

In [20]:
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="SEMANTIC_SIMILARITY",
    )
    for embedding_candidate in embedding_candidates
]
model_001_semantic_similarity_embeddings = model_001.get_embeddings(embedding_inputs)
print(len(model_001_semantic_similarity_embeddings[0].values))

768


- 001 model with CLASSIFICATION task

In [21]:
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="CLASSIFICATION"
    )
    for embedding_candidate in embedding_candidates
]
model_001classification_embeddings = model_001.get_embeddings(embedding_inputs)
print(len(model_001classification_embeddings[0].values))

768


- latest model with SEMANTIC_SIMILARITY task

In [22]:
model_latest = TextEmbeddingModel.from_pretrained("textembedding-gecko@latest")
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="SEMANTIC_SIMILARITY",
    )
    for embedding_candidate in embedding_candidates
]
model_latest_semantic_similarity_embeddings = model_latest.get_embeddings(embedding_inputs)
model_latest_semantic_similarity_embeddings[0]

TextEmbedding(values=[0.006262324284762144, -0.03588619828224182, -0.008898020721971989, -0.018727445974946022, 0.003050940576940775, -0.004488206934183836, 0.012328890152275562, -0.01102829072624445, -0.015658358111977577, 0.004517422057688236, 0.0002527018659748137, -0.0006728838779963553, 0.03949910029768944, -0.05747256800532341, -0.005315796006470919, -0.02534700743854046, 0.022190725430846214, 0.029335815459489822, 0.010807173326611519, -0.010541905649006367, -0.03893468901515007, 0.008656367659568787, 0.011143505573272705, 0.009616359136998653, 0.01080895122140646, -0.005980856250971556, 0.023684805259108543, -0.08767802268266678, -0.02541523613035679, 0.07665596902370453, -0.06426955759525299, 0.029208937659859657, -0.07110679894685745, 0.017246857285499573, 0.019630523398518562, -0.029843712225556374, 0.0643225908279419, 0.007723214570432901, -0.012490224093198776, -0.00775933125987649, -0.004055703524500132, -0.016546033322811127, -0.07438856363296509, -0.02252141572535038, -

- latest model with CLASSIFICATION task

In [23]:
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="CLASSIFICATION"
    )
    for embedding_candidate in embedding_candidates
]
model_latest_classification_embeddings = model_latest.get_embeddings(embedding_inputs)
model_latest_classification_embeddings[0]

TextEmbedding(values=[0.006262324284762144, -0.03588619828224182, -0.008898020721971989, -0.018727445974946022, 0.003050940576940775, -0.004488206934183836, 0.012328890152275562, -0.01102829072624445, -0.015658358111977577, 0.004517422057688236, 0.0002527018659748137, -0.0006728838779963553, 0.03949910029768944, -0.05747256800532341, -0.005315796006470919, -0.02534700743854046, 0.022190725430846214, 0.029335815459489822, 0.010807173326611519, -0.010541905649006367, -0.03893468901515007, 0.008656367659568787, 0.011143505573272705, 0.009616359136998653, 0.01080895122140646, -0.005980856250971556, 0.023684805259108543, -0.08767802268266678, -0.02541523613035679, 0.07665596902370453, -0.06426955759525299, 0.029208937659859657, -0.07110679894685745, 0.017246857285499573, 0.019630523398518562, -0.029843712225556374, 0.0643225908279419, 0.007723214570432901, -0.012490224093198776, -0.00775933125987649, -0.004055703524500132, -0.016546033322811127, -0.07438856363296509, -0.02252141572535038, -

- gecko-multilingual model with basic input

In [24]:
model_multilingual = TextEmbeddingModel.from_pretrained("textembedding-gecko-multilingual@latest")
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
    )
    for embedding_candidate in embedding_candidates
]
model_multilingual_embeddings = model_multilingual.get_embeddings(embedding_inputs)
model_multilingual_embeddings[0]

TextEmbedding(values=[0.029735861346125603, -0.06782633811235428, 0.08411906659603119, 0.022100111469626427, -0.041305381804704666, -0.015591509640216827, -0.012811971828341484, -0.01671595126390457, 0.0005775117897428572, 0.05338100716471672, 0.031389880925416946, 0.03406531736254692, 0.03461538627743721, -0.04565431550145149, 0.051861535757780075, -0.013456015847623348, 0.011313442140817642, -0.00816854927688837, -0.06532882153987885, -0.03192742168903351, -0.06341762840747833, 0.025468239560723305, -0.0367497093975544, -0.0034741307608783245, 0.012664465233683586, 0.02007066085934639, -0.010000479407608509, -0.00651200208812952, 0.003697033040225506, 0.0012131963158026338, -0.01266973651945591, -0.01601383090019226, 0.005917493719607592, -0.01801498420536518, -0.007637878879904747, 0.04341084882616997, 0.041808828711509705, -0.018606532365083694, 0.04603789001703262, 0.034295059740543365, 0.0003546519437804818, -0.0984773188829422, -0.00868113711476326, -0.012932011857628822, 0.0064

- gecko-multilingual model with SEMANTIC_SIMILARITY task

In [25]:
model_multilingual = TextEmbeddingModel.from_pretrained("textembedding-gecko-multilingual@latest")
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="SEMANTIC_SIMILARITY",
    )
    for embedding_candidate in embedding_candidates
]
model_multilingual_semantic_similarity_embeddings = model_multilingual.get_embeddings(embedding_inputs)
model_multilingual_semantic_similarity_embeddings[0]

TextEmbedding(values=[0.029735861346125603, -0.06782633811235428, 0.08411906659603119, 0.022100111469626427, -0.041305381804704666, -0.015591509640216827, -0.012811971828341484, -0.01671595126390457, 0.0005775117897428572, 0.05338100716471672, 0.031389880925416946, 0.03406531736254692, 0.03461538627743721, -0.04565431550145149, 0.051861535757780075, -0.013456015847623348, 0.011313442140817642, -0.00816854927688837, -0.06532882153987885, -0.03192742168903351, -0.06341762840747833, 0.025468239560723305, -0.0367497093975544, -0.0034741307608783245, 0.012664465233683586, 0.02007066085934639, -0.010000479407608509, -0.00651200208812952, 0.003697033040225506, 0.0012131963158026338, -0.01266973651945591, -0.01601383090019226, 0.005917493719607592, -0.01801498420536518, -0.007637878879904747, 0.04341084882616997, 0.041808828711509705, -0.018606532365083694, 0.04603789001703262, 0.034295059740543365, 0.0003546519437804818, -0.0984773188829422, -0.00868113711476326, -0.012932011857628822, 0.0064

- gecko-multilingual model with CLASSIFICATION task

In [26]:
embedding_candidates = ["What is life?"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
        task_type="CLASSIFICATION"
    )
    for embedding_candidate in embedding_candidates
]
model_multilingual_classification_embeddings = model_multilingual.get_embeddings(embedding_inputs)
model_multilingual_classification_embeddings[0]

TextEmbedding(values=[0.029735861346125603, -0.06782633811235428, 0.08411906659603119, 0.022100111469626427, -0.041305381804704666, -0.015591509640216827, -0.012811971828341484, -0.01671595126390457, 0.0005775117897428572, 0.05338100716471672, 0.031389880925416946, 0.03406531736254692, 0.03461538627743721, -0.04565431550145149, 0.051861535757780075, -0.013456015847623348, 0.011313442140817642, -0.00816854927688837, -0.06532882153987885, -0.03192742168903351, -0.06341762840747833, 0.025468239560723305, -0.0367497093975544, -0.0034741307608783245, 0.012664465233683586, 0.02007066085934639, -0.010000479407608509, -0.00651200208812952, 0.003697033040225506, 0.0012131963158026338, -0.01266973651945591, -0.01601383090019226, 0.005917493719607592, -0.01801498420536518, -0.007637878879904747, 0.04341084882616997, 0.041808828711509705, -0.018606532365083694, 0.04603789001703262, 0.034295059740543365, 0.0003546519437804818, -0.0984773188829422, -0.00868113711476326, -0.012932011857628822, 0.0064

- gecko-multilingual model with Japanese

In [27]:
embedding_candidates_japan = ["おはよう"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
    )
    for embedding_candidate in embedding_candidates_japan
]
model_multilingual_japan_embeddings = model_multilingual.get_embeddings(embedding_inputs)
model_multilingual_japan_embeddings[0]

TextEmbedding(values=[-0.03492202237248421, -0.041208844631910324, 0.05854329839348793, 0.038779277354478836, -0.048455070704221725, -0.06114573776721954, 0.011971796862781048, 0.012985477223992348, 0.0026919322554022074, 0.0009371112682856619, 0.008863835595548153, -0.010660471394658089, 0.03545922413468361, -0.03754797950387001, -0.0072441683150827885, 0.024671565741300583, 0.04591364786028862, -0.0751197338104248, -0.05430164933204651, -0.006816242355853319, -0.008268974721431732, 0.037058621644973755, -0.05970405787229538, -0.030203768983483315, 0.01144502405077219, 0.020802689716219902, 0.03298156335949898, -0.02404038794338703, -0.04815303906798363, 0.020097976550459862, -0.006498627830296755, -0.01324621494859457, -0.01628614403307438, 0.03562841936945915, 0.013170063495635986, 0.02435014769434929, -0.010233160108327866, -0.041123077273368835, 0.019655995070934296, 0.023688310757279396, 0.009076820686459541, -0.058518681675195694, -0.043890971690416336, -0.023930883035063744, 0.

- gecko-multilingual model with Mandarin

In [28]:
embedding_candidates_mandarin = ["早安"]
embedding_inputs=[ 
    TextEmbeddingInput(
        text=embedding_candidate,
    )
    for embedding_candidate in embedding_candidates_mandarin
]
model_multilingual_mandarin_embeddings = model_multilingual.get_embeddings(embedding_inputs)
model_multilingual_mandarin_embeddings[0]

TextEmbedding(values=[-0.017030561342835426, -0.04270736873149872, 0.030717065557837486, 0.042703643441200256, -0.041937120258808136, -0.036839306354522705, -0.04115978628396988, 0.02055630460381508, 0.014107135124504566, -0.026743678376078606, 0.03809565305709839, -0.021275440230965614, 0.022431762889027596, -0.006750966422259808, -0.007904269732534885, 0.04307803511619568, 0.025212742388248444, -0.045962750911712646, -0.03296659141778946, 0.03635518252849579, -0.01559001486748457, 0.07360853254795074, -0.0780860111117363, -0.0024883162695914507, -0.017533330246806145, 0.04102452099323273, 0.028426803648471832, -0.028394397348165512, -0.023510286584496498, 0.029791606590151787, -0.026356862857937813, -0.029393943026661873, 0.01389541756361723, 0.019301695749163628, 0.02726704068481922, 0.006292343605309725, -0.007120843045413494, -0.001864145277068019, 0.0030720082577317953, 0.03957497701048851, 0.04478058964014053, -0.04056449607014656, -0.022092927247285843, -0.0006899595027789474, 

- compare 001 with latest model with same content, to see if the output vector are same or not

In [29]:
model_001_semantic_similarity_embeddings[0].values == model_latest_semantic_similarity_embeddings[0].values

False

- compare same 001 model with different tasks, to see if the output vector are same or not

In [30]:
model_001_semantic_similarity_embeddings[0].values == model_001classification_embeddings[0].values

True

- compare same latest model with different tasks, to see if the output vector are same or not

In [31]:
model_latest_semantic_similarity_embeddings[0].values == model_latest_classification_embeddings[0].values

True

- compare same gecko-multilingual model with different tasks, to see if the output vector are same or not

In [32]:
model_multilingual_semantic_similarity_embeddings[0].values ==model_multilingual_classification_embeddings[0].values

True

- compare latest with  gecko-multilingual model with same content, to see if the output vector are same or not

In [33]:
model_latest_semantic_similarity_embeddings[0].values ==model_multilingual_semantic_similarity_embeddings[0].values

False