### Embeddings(임베딩)
* 추천시스템의 핵심이 임베딩이다. 임베딩은 컴퓨터가 자연어(문자)나 이미지 같은 복잡한 데이터를 이해할 수 있도록 숫자의 나열(벡터)로 변환하는 기술
* 문장이나 단어를 좌표에 수많은 점을 찍고 전부 좌표를 만든다음 거리가 인접하면 가까운 관계임을 파악한다. 의미있는 좌표로 저장하게 된다. 
* 임베딩 기술로 학습해서 자신의 좌표로 만들어서 가지고 있다는 것이다. 
* 고차원 공간 좌표 


In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from openai import OpenAI
import pandas as pd

client = OpenAI()

text = "내가 오늘 점심을...."
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[text]
)
# response.data를 임베딩을 했을 때 리스트의 길이값을 확인
# 1536이 좌표값이다. 
print(len(response.data[0].embedding))

pd.Series(response.data[0].embedding).head()

1536


0    0.037724
1    0.008514
2   -0.044167
3   -0.017805
4    0.006443
dtype: float64

In [5]:
df = pd.read_csv("fine_food_reviews_1k.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text
0,0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...
1,1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos..."
2,2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...
3,3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...
4,4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...


In [6]:
import tiktoken

# 토크나이저를 가젼온다.
gpt5nano_encoding = tiktoken.encoding_for_model("gpt-5-nano")

# 각 리뷰 텍스트가 몇 개의 토큰인지 계산해서 새 컬럼에 저장한다. 
df['n_tokens'] = df['Text'].apply(lambda x : len(gpt5nano_encoding.encode(x)))

df['n_tokens'].describe()

count    1000.000000
mean       83.818000
std        71.905308
min        22.000000
25%        38.000000
50%        59.000000
75%       104.000000
max       614.000000
Name: n_tokens, dtype: float64

In [None]:
# 전체 데이터 임베딩

def text_to_embedding(texts):

    # 줄바꿈 문자를 공백으로 바꿔주면 성는이 조금 더 좋아지기 때문에 전처리를 해준다. 
    texts = [text.replace('\n', '') for text in texts ]

    response = client.embeddings.create(
        model='text-embedding-3-large',
        input=texts
    )

    # 결과에서 벡터 리스트만 뽑아 반환
    return [data.embedding for data in response.data]

df['embedding'] = text_to_embedding(df['Text'].tolist())
df['embedding'].head()

0    [0.0012975726276636124, 0.006300435867160559, ...
1    [-0.005645995028316975, -0.020648211240768433,...
2    [0.002446384634822607, 0.004002631641924381, -...
3    [0.021143564954400063, -0.016434499993920326, ...
4    [-0.01443126704543829, -0.003087373683229089, ...
Name: embedding, dtype: object

In [10]:
%pip install scikit-learn



Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.16.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp312-cp312-win_amd64.whl (8.0 MB)
   ---------------------------------------- 0.0/8.0 MB ? eta -:--:--
   --------- ------------------------------ 1.8/8.0 MB 11.2 MB/s eta 0:00:01
   ------------------------ --------------- 5.0/8.0 MB 14.4 MB/s eta 0:00:01
   ---------------------------------------- 8.0/8.0 MB 14.2 MB/s  0:00:00
Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
Downloading scipy-1.16.3-cp312-cp312-win_amd64.whl (38.6 MB)
   ---------------------------------------- 0.0/38.6 MB ? eta -:--:--
   - --------

In [12]:
# 의미 기반 검색 구현 (코사인 유사도 확인)
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_similar_texts(query_text, df, top_k=7):
    #사용자의 검색어도 벡터로 변환
    query_vector = text_to_embedding([query_text])[0]

    #데이터프레임에 잇는 벡터들을 계산하기 쉽게 numpy 배열로 바꿔줌
    embeddings = np.array(df['embedding'].tolist())

    # query_vector를 2차원 배열로 만들어줘야 해서 []로 감싼다.
    cos_sim = cosine_similarity([query_vector], embeddings)

    df['cos_sim'] = cos_sim[0]

    return df.sort_values(by='cos_sim', ascending=False)[['Text', 'cos_sim']].head(top_k)

In [None]:
# 검색 테스트(유사한 단어가 들어간 임베딩된 텍스트를 반환한다.)
search_result = get_similar_texts("bread", df)
search_result

Unnamed: 0,Text,cos_sim
730,This mix makes a good bread or can also be use...,0.377646
951,Makes very good break sticks.. Also can be use...,0.376412
373,These are a great substitute for bread if you ...,0.356632
925,Made a batch of sourdough rye and it got rave ...,0.312238
186,compared to the usual white bread bun you migh...,0.302224
873,Bought this with my new Oster Belgium waffle m...,0.287595
408,This is a good grocery for us.<br /><br />If y...,0.282888
