In [1]:
! pip install ko-sentence-transformers

Collecting ko-sentence-transformers
  Downloading ko_sentence_transformers-0.3.tar.gz (11 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
Collecting huggingface-hub>=0.15.1
  Downloading huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
Collecting transformers<5.0.0,>=4.34.0
  Downloading transformers-4.40.2-py3-none-any.whl (9.0 MB)
Collecting fsspec>=2023.5.0
  Downloading fsspec-2024.3.1-py3-none-any.whl (171 kB)
Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp39-none-win_amd64.whl (2.2 MB)
Collecting safetensors>=0.4.1
  Downloading safetensors-0.4.3-cp39-none-win_amd64.whl (287 kB)
Building wheels for collected packages: ko-sentence-transformers
  Building wheel for ko-sentence-transformers (setup.py): started
  Building wheel for ko-sentence-transformers (setup.py): finished with status 'done'
  Created wheel for ko-sentence-transformers: filename=ko_sentence_transformers-0.3-py3-none-any.whl size=9680 sh

In [3]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

embedder = SentenceTransformer("jhgan/ko-sbert-sts")

# Corpus with example sentences
corpus = [
    'AI',
    '자동 경찰 신고',
    'Face Detect (AI)',
    '도로, 인도',
    '인원 수 제한',
    'AI 측정',
    '사고 후 관리',
    '자전거, 전동 킥보드',
    '교통사고',
    '뺑소니',
    '신호 표지판',
    '교통 체증',
    '신호 시간 조절',
    '데이터 수집',
    'Aduanced traffic control',
    '속도에 따른 방지턱 경고 생성',
    '구간 속도 제한',
    'Regulation',
    '인원수 트래킹',
    '기준 인원 초과'
]

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    '경제', 
    '사람', 
    '환경', 
    '교통', 
    '삶', 
    '정부', 
    '기술', 
    '편의 시설', 
    '에너지', 
    '재해 대처'
]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))







Query: 경제

Top 5 most similar sentences in corpus:
Regulation (Score: 0.3666)
신호 표지판 (Score: 0.2493)
자동 경찰 신고 (Score: 0.2446)
교통사고 (Score: 0.2329)
뺑소니 (Score: 0.2149)




Query: 사람

Top 5 most similar sentences in corpus:
AI (Score: 0.4513)
AI 측정 (Score: 0.4297)
Regulation (Score: 0.4265)
데이터 수집 (Score: 0.4144)
인원수 트래킹 (Score: 0.4118)




Query: 환경

Top 5 most similar sentences in corpus:
신호 표지판 (Score: 0.2913)
AI 측정 (Score: 0.2678)
신호 시간 조절 (Score: 0.2607)
Regulation (Score: 0.2603)
AI (Score: 0.2561)




Query: 교통

Top 5 most similar sentences in corpus:
교통 체증 (Score: 0.7444)
신호 표지판 (Score: 0.5760)
구간 속도 제한 (Score: 0.5585)
Aduanced traffic control (Score: 0.5220)
속도에 따른 방지턱 경고 생성 (Score: 0.4608)




Query: 삶

Top 5 most similar sentences in corpus:
Regulation (Score: 0.4117)
AI 측정 (Score: 0.3806)
AI (Score: 0.3645)
데이터 수집 (Score: 0.3511)
인원수 트래킹 (Score: 0.3458)




Query: 정부

Top 5 most similar sentences in corpus:
Regulation (Score: 0.3979)
AI (Score: 0.3712)
신호 표지판 (Score: 0.30