# 텍스트를 임베딩 API로 벡터 변환하고 유사한 컨텐츠를 검색하는 방법

해당 문서는 텍스트를 임베딩 API로 벡터로 변환하는 방법과 유사한 컨텐츠를 의미 기반 검색하는 방법을 실습합니다.  
아래 예시에서는 Wikipedia에서 제공하는 샘플 53개 문서에 대해서 벡터화하고, 유사한 문서를 검색하는 방법을 살펴봅니다.  
참고 자료: https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings

In [1]:
import os
import re
import pandas as pd
import numpy as np
import tiktoken
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key        = os.getenv("AZURE_OPENAI_API_KEY"),
    api_version    = os.getenv("OPENAI_API_VERSION")
)

백터화를 하기 위한 파일(./data/wiki_data.csv)을 읽어서 pandas로 조회

In [2]:
df_wiki_data=pd.read_csv(os.path.join(os.getcwd(),'data/wiki_data.csv'))
df_wiki_data

Unnamed: 0,id,url,title,text
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...


In [3]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_wiki_data['text']= df_wiki_data["text"].apply(lambda x : normalize_text(x))

Azure OpenAI에서 제공하는 Embedding API를 활용하기 위해 문서에서 Text 길이가 8,192 토큰이 넘지 않는 문서를 확인

In [4]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df_wiki_data['n_tokens'] = df_wiki_data["text"].apply(lambda x: len(tokenizer.encode(x)))
df_wiki_data = df_wiki_data[df_wiki_data.n_tokens<8192]
len(df_wiki_data)
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131


문서의 Text에서 각각의 토큰별로 나뉘어진 부분 확인

In [5]:
sample_encode = tokenizer.encode(df_wiki_data.text[0]) 
decode = tokenizer.decode_tokens_bytes(sample_encode)
decode

[b'April',
 b' is',
 b' the',
 b' fourth',
 b' month',
 b' of',
 b' the',
 b' year',
 b' in',
 b' the',
 b' Julian',
 b' and',
 b' Greg',
 b'orian',
 b' calendars',
 b',',
 b' and',
 b' comes',
 b' between',
 b' March',
 b' and',
 b' May',
 b'.',
 b' It',
 b' is',
 b' one',
 b' of',
 b' four',
 b' months',
 b' to',
 b' have',
 b' ',
 b'30',
 b' days',
 b'.',
 b' April',
 b' always',
 b' begins',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' week',
 b' as',
 b' July',
 b',',
 b' and',
 b' additionally',
 b',',
 b' January',
 b' in',
 b' leap',
 b' years',
 b'.',
 b' April',
 b' always',
 b' ends',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' the',
 b' week',
 b' as',
 b' December',
 b'.',
 b' April',
 b"'s",
 b' flowers',
 b' are',
 b' the',
 b' Sweet',
 b' Pe',
 b'a',
 b' and',
 b' Daisy',
 b'.',
 b' Its',
 b' birth',
 b'stone',
 b' is',
 b' the',
 b' diamond',
 b'.',
 b' The',
 b' meaning',
 b' of',
 b' the',
 b' diamond',
 b' is',
 b' innocence',
 b'.',
 b' The',
 b' M

In [6]:
len(decode)

3902

Text를 임베딩 API로 벡터 데이터를 생성하여 새로운 컬럼인 `content_vector`에 추가합니다.

In [7]:
def generate_embeddings(text, model="text-embedding-ada-002"): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

df_wiki_data['content_vector'] = df_wiki_data["text"].apply(lambda x : generate_embeddings (x, model = 'text-embedding-ada-002')) # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens,content_vector
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[-0.011139873415231705, -0.01703229732811451, ..."
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.0012400866253301501, 0.002778931986540556, ..."
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[-0.007485614158213139, 0.010930377058684826, ..."
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[0.023238468915224075, -0.023404547944664955, ..."
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607,"[0.021120715886354446, 0.013117602095007896, -..."
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460,"[0.015525435097515583, 0.00884158257395029, 0...."
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138,"[-0.0017295910511165857, -0.010941265150904655..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987,"[-0.012498506344854832, -0.03015793114900589, ..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94,"[-0.015608406625688076, -0.02825835719704628, ..."
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131,"[0.002429477171972394, 0.00839332677423954, 0...."


In [8]:
# Save the data to a CSV file(data/wiki_data_embeddings.csv)
df_wiki_data.to_csv(os.path.join(os.getcwd(),'data/wiki_data_embeddings.csv'), index=False)

유사도 관계를 파악하기 위해서 질의에 대한 결과 분석

In [9]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model="text-embedding-ada-002"): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        model="text-embedding-ada-002" # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
    )
    df["similarities"] = df.content_vector.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_wiki_data, "4월에 대해서 알려줘.", top_n=4)
res = search_docs(df_wiki_data, "예술의 종류를 구분해줘.", top_n=4)
res = search_docs(df_wiki_data, "웹 페이지에서 정보를 검색하기 위해서 필요한 도구는 무엇이야?", top_n=4)
res = search_docs(df_wiki_data, "4월과 8월의 차이를 표로 그려줘", top_n=4)

Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[-0.011139873415231705, -0.01703229732811451, ...",0.775084
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.0012400866253301501, 0.002778931986540556, ...",0.724088
24,48,https://simple.wikipedia.org/wiki/Astronomy,Astronomy,Astronomy (from the Greek astron (ἄστρον) mean...,2564,"[0.01999729871749878, 0.01131510641425848, 0.0...",0.689076
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138,"[-0.0017295910511165857, -0.010941265150904655...",0.688383


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[-0.007485614158213139, 0.010930377058684826, ...",0.792432
25,49,https://simple.wikipedia.org/wiki/Architecture,Architecture,Architecture is designing the structures of bu...,1017,"[0.01003633439540863, 0.008842614479362965, -0...",0.744362
24,48,https://simple.wikipedia.org/wiki/Astronomy,Astronomy,Astronomy (from the Greek astron (ἄστρον) mean...,2564,"[0.01999729871749878, 0.01131510641425848, 0.0...",0.742285
35,62,https://simple.wikipedia.org/wiki/Animal,Animal,Animals (or Metazoa) are living creatures with...,585,"[0.002394610783085227, -0.001476755365729332, ...",0.724641


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
42,76,https://simple.wikipedia.org/wiki/Browser,Browser,"A browser is a name given to any animal, usual...",43,"[-0.0192558616399765, -0.004261700436472893, 0...",0.725765
23,47,https://simple.wikipedia.org/wiki/Atom,Atom,Atoms are very small pieces of matter. There a...,3223,"[0.0009149739635176957, 0.02547287568449974, -...",0.685806
38,69,https://simple.wikipedia.org/wiki/Boot%20device,Boot device,A boot device is used to start a computer. It ...,292,"[-0.010595838539302349, -0.011619153432548046,...",0.684541
16,32,https://simple.wikipedia.org/wiki/Abbreviation,Abbreviation,An abbreviation is a shorter way to write a wo...,365,"[0.007967128418385983, 0.018927432596683502, 0...",0.684352


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.0012400866253301501, 0.002778931986540556, ...",0.758668
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[-0.011139873415231705, -0.01703229732811451, ...",0.746919
12,22,https://simple.wikipedia.org/wiki/Addition,Addition,"In mathematics, addition, represented by the s...",801,"[0.002222324488684535, 0.017958177253603935, 0...",0.680137
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[0.023238468915224075, -0.023404547944664955, ...",0.679039
