<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/find_similar_texts_with_sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download the dataset

In [1]:
!wget http://od.cdc.gov.tw/pr/CDC_chatbox.csv -O faq.csv

--2023-02-09 14:21:12--  http://od.cdc.gov.tw/pr/CDC_chatbox.csv
Resolving od.cdc.gov.tw (od.cdc.gov.tw)... 35.229.205.172
Connecting to od.cdc.gov.tw (od.cdc.gov.tw)|35.229.205.172|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://od.cdc.gov.tw/pr/CDC_chatbox.csv [following]
--2023-02-09 14:21:14--  https://od.cdc.gov.tw/pr/CDC_chatbox.csv
Connecting to od.cdc.gov.tw (od.cdc.gov.tw)|35.229.205.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600208 (586K) [text/csv]
Saving to: ‘faq.csv’


2023-02-09 14:21:15 (729 KB/s) - ‘faq.csv’ saved [600208/600208]



# Load the dataset as dataframe

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('faq.csv')
df.shape

(2332, 7)

In [4]:
TEXT_COL = 'Question'
COLS_TO_SHOW = [TEXT_COL, 'Disease', 'Answer1', 'Answer2']
df

Unnamed: 0,No,Disease,Question,Answer1,Answer2,Answer3,Answer4
0,1,A型肝炎,什麼是A型肝炎？,A型肝炎是由A型肝炎病毒感染所造成的急性肝臟發炎,A型肝炎主要為經口（糞口）感染，簡單來說就是食用遭A型肝炎病毒汙染的食物或水而感染，所以前往...,,
1,2,A型肝炎,A型肝炎會有什麼症狀？,這問題非常好！感染A型肝炎後，可能會出現發燒、全身倦怠不適、食慾不振、嘔吐及腹部不舒服症狀，...,症狀的嚴重程度通常隨著年齡增加而增加。大部分6歲以下的小朋友感染後沒有出現症狀或症狀輕微，而...,,
2,3,A型肝炎,罹患A型肝炎的嚴重性？,A型肝炎的致死率低（約千分之三），造成死亡的情形多半為猛爆性肝炎，通常發生於老年患者或有慢性...,,,
3,4,A型肝炎,A型肝炎有什麼併發症,A型肝炎感染很嚴重時，可能造成急性肝衰竭，僅有少數病例會因猛爆性肝炎而死亡,,,
4,5,A型肝炎,A型肝炎有哪些高風險族群,像比如說前往A型肝炎流行地區（例如非洲、南美洲、中國大陸、東南亞及南亞地區等）旅遊或工作的人,還有特殊職業如廚師及餐飲食品從業人員、醫療照護者、嬰幼兒保育工作者,那患有慢性肝病、血友病、曾經移植肝臟的病人、靜脈藥癮者、男男間性行為者也要特別注意！,
...,...,...,...,...,...,...,...
2327,2330,愛滋病,孕婦好像感染了愛滋病怎麼辦？,應儘速至愛滋病指定醫院接受篩檢，若確診感染應接受藥物治療，降低傳染給寶寶的風險,,,
2328,2331,淋病,孕婦要如何預防淋病？,避免不安全性行為及其他感染風險行為等，才能有效預防感染淋病,,,
2329,2332,淋病,孕婦好像感染了淋病怎麼辦？,勿自行至藥局買藥或誤信偏方，應儘速就醫並告知醫師妊娠狀態，以利醫師評估治療方式,,,
2330,2333,梅毒,孕婦要如何預防梅毒？,避免不安全性行為，並配合孕婦產檢時程進行梅毒篩檢，才能有效避免感染,,,


# Load a pretrained model

In [5]:
!pip install -q -U sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer

In [7]:
embed_model = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何加入會員', '入會辦法是什麼']
sentence_embeddings = embed_model.encode(sentences)

print("Sentence embeddings:")
print(sentence_embeddings)



Sentence embeddings:
[[ 0.59205085  0.27223465  0.08369207 ... -0.31279552  0.13715449
   0.35445943]
 [ 0.00216622 -0.1309313   0.16524468 ... -0.24950682 -0.56418824
   0.5346472 ]]


In [8]:
from numpy import dot
from numpy.linalg import norm

a = sentence_embeddings[0]
b = sentence_embeddings[1]
cos_sim = dot(a, b)/(norm(a)*norm(b))
print(cos_sim)

0.60111433


In [9]:
texts = df[TEXT_COL]
texts[3]

'A型肝炎有什麼併發症'

In [10]:
embeddings = embed_model.encode(texts)

In [11]:
len(embeddings)

2332

In [12]:
len(embeddings[0])

768

# Embed the corpus and build an embedding index

In [13]:
!pip install -q faiss-cpu

In [14]:
import faiss
import numpy as np

def create_index_embeddings(embed_arrays: np.array,index_arrays: np.array):

	# Step 1: Change data type
	embeddings = embed_arrays.astype("float32")
	
	# Step 2: Instantiate the index using a type of distance, which is L2 here
	index = faiss.IndexFlatIP(embeddings.shape[1])
	
	# Step 3: Pass the index to IndexIDMap
	index = faiss.IndexIDMap(index)
	
	# Step 4: Add vectors and their IDs
	index.add_with_ids(embeddings, index_arrays)
	
	return index, embeddings

In [15]:
type(embeddings)

numpy.ndarray

In [16]:
doc_ids = df.index.to_numpy()
type(doc_ids)

numpy.ndarray

In [17]:
fs_index, fs_embeddings = create_index_embeddings(embeddings, doc_ids)

# Search similar texts by a user query

In [18]:
def search_by_user_query(query: str, #User query text
                         embed_model=embed_model, #USE embed model
                         index=fs_index, #Faiss index
                         df=df, #Corpus in dataframe
                         topK=10): #TopK results
                    
		# embed the query with USE
		# Note: Make the query a list to keep it consistent with the format from above
		embeddings = embed_model.encode([query])
		
		# covert the embeddings to conform to the Faiss format
		embeddings = np.array(embeddings).astype("float32")
		
		# get the distances and indexes
		# Note: The index is 0 because there's only one document in the embeddings
		D, I = index.search(np.array([embeddings[0]]), k=topK)
		
		# results
		results_df = df.loc[I.flatten(), COLS_TO_SHOW]

		return results_df


In [19]:
query = "破傷風有哪些症狀"
search_by_user_query(query)

Unnamed: 0,Question,Disease,Answer1,Answer2
643,破傷風會有什麼症狀？,破傷風,最常見之初症狀為腹部僵硬及肌肉痙攣，典型的破傷風痙攣現象為「角弓反張」(opisthoton...,1.疼痛性肌肉收縮開始為下顎肌與頸部肌，其次為軀幹肌\n 2.開口障礙，吞嚥困難，四肢僵硬強...
645,破傷風有什麼危險徵兆,破傷風,無法張開嘴、吞嚥及呼吸困難,
642,什麼是破傷風？,破傷風,破傷風由破傷風桿菌之外毒素（exotoxin）所引起，其特徵為痛性之肌肉收縮（最初在咬肌及頸...,
671,破傷風要怎麼治療？,破傷風,1.肌肉注射破傷風免疫球蛋白(TIG)，並取少量局部注射於傷口周圍。\n 2.抗生素治療；抗...,4.支持性療法最重要，包括維持患者呼吸道之暢通，暗室中照顧，必要時可以肌肉鬆弛劑保持患者之鎮...
646,破傷風有什麼併發症,破傷風,喉痙攣、骨折、肺栓塞、吸入性肺炎、呼吸困難而導致死亡,
647,破傷風有哪些高風險族群,破傷風,沒有接種過疫苗或距離最後一次破傷風疫苗接種超過 10 ?者。建議在工作中接觸土壤、污物、動物...,
2042,破傷風用藥有哪些種類？,破傷風,1.肌肉注射破傷風免疫球蛋白（TIG），並取少量局部注射於傷口周圍。\n2.口服（或靜注）m...,
651,破傷風流行地區有哪些,破傷風,破傷風病例通常會發生在農業區或低度開發地區，因為該等地區較易與動物之排泄物接觸，或預防接種情...,
2035,罹患破傷風要注意什麼？,破傷風,請儘速至附近醫療院所就醫，接受肌肉注射破傷風免疫球蛋白(TIG)，抗生素需持續治療10～14...,
690,新生兒破傷風會有什麼症狀？,新生兒破傷風,典型特徵是嬰兒出生幾天(3至28天，通常6天)後，吸吮動作和哭泣情形由正常漸漸轉變為吸奶困難...,
