# What are embeddings?
- Embeddings are a fundamental concept from natural language processings; where words, phrases, or entire documents are represented in numerical form.
- More fundamentally, embeddings models  map text onto a multi-dimensional space, or vector space, and the numbers outputted by the model represent the location of the text in that space.
- Simillar pieces of text or words, like teacher and student, are mapped closer together in the space and dissimilar words are mapped futher away.
- this abilitity  to map similar and dissimilar words means that embedding models can be used to capture the **semantic meaning** of text. (by semantic meaning, we mean the full context and intent of the word is captured)
- "which way is to the supermarkey" vs. "Could I have directions to the shop" only have two words in common but semantically very similar.
- since semantically similar texts are embedded more closely in the vector space; measuring distance allows us to measure similarity
- there are many different metrics for measuring similarity in higheer dimensions; we will use cosine distance  

In [4]:
from openai import OpenAI

In [5]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content='In the realm of code, a concept pure and true,\nLies recursion, like a cycle that continues through.\n\nA function calls itself, a waltz of delight,\nDiving deep into layers, in the still of the night.\n\nLike a mirror reflecting, an image in time,\nRecursion repeats, a magical chime.\n\nEach iteration a step, a story untold,\nUnraveling mysteries, a network unfold.\n\nAn elegant dance of patterns, a loop divine,\nIn this world of programming, recursion shines.\n\nThrough branches and loops, like a mystical thread,\nRecursion weaves magic, where logic is spread.\n\nSo embrace this concept, let it be your guide,\nIn the wondrous world of code, let recursion abide.', role='assistant', function_call=None, tool_calls=None)


In [7]:
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = "Hanan Ather is an employee at Statistics Canada"
)

response_dict = response.model_dump()
print(response_dict)

{'data': [{'embedding': [-0.026683859527111053, -0.0042109414935112, -0.0014845837140455842, -0.015069322660565376, -0.003150644712150097, 0.015028994530439377, -0.023296285420656204, 0.015378505922853947, -0.00044193040230311453, -0.004967096261680126, 0.019182804971933365, 0.008663852699100971, -0.020002812147140503, -0.019612973555922508, 2.2198917577043176e-05, 0.00060828443383798, 0.01816115528345108, -0.017287377268075943, -0.006099647842347622, -0.010512230917811394, -0.017341148108243942, 0.006620554719120264, 0.017730988562107086, 0.012394215911626816, -0.0055484953336417675, -0.003096873639151454, 0.005491363350301981, -0.017999842762947083, 0.019787728786468506, -0.0022483000066131353, 0.009927471168339252, -0.008684016764163971, -0.021011019125580788, -0.027302226051688194, -0.02755763754248619, -0.012797498144209385, 0.00806565023958683, 0.0084420470520854, 0.014101444743573666, -0.0028549041599035263, 0.025057286024093628, -0.00905369222164154, 0.000664575956761837, 0.002

As we can see the embedding model outputs 1536 numbers to represent the input string. 

In [27]:
len(response_dict['data'][0]['embedding'])

1536

In [29]:
def create_embeddings(texts):
    response = client.embeddings.create(
        model = "text-embedding-ada-002",
        input = texts
    )
    response_dict = response.model_dump()
    return [data['embedding'] for data in response_dict['data']]

In [31]:
print(create_embeddings(["Statistics Canada", "Treasury Board of Canada Secretariat"]))
print(create_embeddings(["Statistics Canada", "Treasury Board of Canada Secretariat"]))

[[-0.020564641803503036, -0.019531775265932083, 0.01048757042735815, -0.04089093208312988, -0.007322760298848152, 0.011222494766116142, -0.022868730127811432, -0.015082502737641335, -0.005498691461980343, -0.018088409677147865, 0.011838242411613464, 0.010262458585202694, -0.009309043176472187, -0.025980571284890175, -0.007792847231030464, 0.022458231076598167, 0.022444989532232285, -0.0360444001853466, 0.010593505576252937, -0.009342147037386894, 0.0007427867967635393, 0.0062203737907111645, 0.024590173736214638, -0.007269793190062046, 0.007435316685587168, 0.01592998392879963, 0.002858591265976429, -0.015771081671118736, 0.008481425233185291, -0.006021745502948761, -0.004538654815405607, -0.019346389919519424, -0.023848628625273705, -0.03530285507440567, -0.012957180850207806, -0.007680291309952736, 0.007024818100035191, -0.009527534246444702, -0.007216825615614653, -0.012089838273823261, 0.03554121032357216, 0.016843672841787338, 0.002567269839346409, -0.002908248221501708, 0.0235440