### 作業需求
利用以下資料建立一問答機器人，讓系統可以使用者輸入問題，依問題相似度給予最接近回答：

1. 將下載PCHOME常見問答 (https://raw.githubusercontent.com/ywchiu/tibame_tm/master/data/pchome_qa.xlsx) 並存到作業環境?
2. 請使用 Pandas 讀取該檔案，並轉變名為qa 的 DataFrame?
3. 請將 question 欄位進行切詞，並將切詞過後的資料放入語料庫 corpus (資料型態為List) 中
3. 請使用 sklearn 將語料庫 Corpus 轉換成名為tfidf 的 TF-IDF矩陣(TfidfVectorizer)?
4. 將使用者輸入問題「請問要如何查詢我的訂單」切詞並轉換為向量，透過相似度計算(Cosine Similarity)回覆最有可能的答案?


### 評分標準
1. 程式碼能順利運行 (20%)
2. 程式碼能順利讀取該檔案，並將資料讀成名為qa 的DataFrame (20%)
3. 程式碼能將question 欄位進行切詞，並將切詞後的資料放入語料庫 corpus 中 (20%)
3. 程式碼能使用sklearn 將資料轉換成名為tfidf 的TF-IDF矩陣 (20%)
4. 程式碼能根據使用者輸入「請問要如何查詢我的訂單」，透過相似度計算(Cosine Similarity)回覆最有可能的答案 (20%)

In [170]:
! wget https://raw.githubusercontent.com/ywchiu/tibame_tm/master/data/pchome_qa.xlsx

--2022-10-28 09:03:03--  https://raw.githubusercontent.com/ywchiu/tibame_tm/master/data/pchome_qa.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10143 (9.9K) [application/octet-stream]
Saving to: ‘pchome_qa.xlsx.3’


2022-10-28 09:03:03 (64.5 MB/s) - ‘pchome_qa.xlsx.3’ saved [10143/10143]



## 使用者輸入問題

In [171]:
user_question = input("Please type your question.\n")

Please type your question.
網路交易注意事項


## Open the excel file

In [172]:
#Solution 2
import pandas
url = "https://raw.githubusercontent.com/ywchiu/tibame_tm/master/data/pchome_qa.xlsx"
qa = pandas.read_excel(url, index_col = 0)
# print(qa)

## Jieba text segmentation

In [173]:
import jieba
import numpy

# Clean the text and word segmentation
qa_ary = qa.to_numpy()
# print(qa_ary)

q_ary = []
a_ary = []
jieba_q_ary = []
jieba_a_ary = []

for q_and_a in qa_ary:
  q = "".join(q_and_a[0].strip().split())
  a = "".join(q_and_a[1].strip().split())
  jieba_q = " ".join([w for w in jieba.lcut(q) if len(w) >= 2])
  jieba_a = " ".join([w for w in jieba.lcut(a) if len(w) >= 2])
  q_ary.append(q)
  a_ary.append(a)
  jieba_q_ary.append(jieba_q)
  jieba_a_ary.append(jieba_a)
# print(jieba_q_ary)
# print(jieba_a_ary)


#Jieba for user question
user_question = "".join(user_question.strip().split())
jieba_user_question = " ".join([w for w in jieba.lcut(user_question) if len(w) >= 2])
# print(user_qustion)
# print(jieba_user_question)

## Vectorize the corpus and calculate the cosine similarity

In [174]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import math 

t_vec = TfidfVectorizer()
t_vec.fit(jieba_q_ary)
# print(t_vec.get_feature_names_out())

q_ary_vector = t_vec.transform(jieba_q_ary).toarray()
# print(q_ary_vector)
user_question_vector = t_vec.transform([jieba_user_question]).toarray()
# print(user_question_vector)

cos_sim_matrix = cosine_similarity(q_ary_vector, user_question_vector)
print("cosine_similarity_matrix:\n", cos_sim_matrix)
answer_index = numpy.argmax(cos_sim_matrix)
print("answer_index: ", answer_index)

print("使用者輸入之問題: ", user_question)
print("比對後最相近之問題: ", q_ary[answer_index])
print("問題之解答: ", a_ary[answer_index])


cosine_similarity_matrix:
 [[0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.18085505]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [1.        ]
 [0.42993157]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]
 [0.        ]]
answer_index:  14
使用者輸入之問題:  網路交易注意事項
比對後最相近之問題:  網路交易注意事項
問題之解答:  由於網路詐騙案件層出不窮,手法也不斷更新，PChome商店街在此特別提醒您，商店街的店家與PChome工作人員，均不會要求消費者至提款機操作任何功能，請小心勿上當。如果接獲不明人士來信或來電，應立即撥打165防詐騙專線查詢或透過「PChome商店街服務中心」查證。PChome商店街與您一起努力維護網路交易安全！


##Vectorize the corpus and calculate the jaccard similarity

In [175]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import jaccard_score
from numpy import bitwise_and, bitwise_or
import math 

t_vec = CountVectorizer()
t_vec.fit(jieba_q_ary)
# print(t_vec.get_feature_names_out())

q_ary_vector = t_vec.transform(jieba_q_ary).toarray()
# print(q_ary_vector)
user_question_vector = t_vec.transform([jieba_user_question]).toarray()
# print(user_question_vector)

q_ary_vector_bool = (q_ary_vector > 0).astype(int)
user_question_vector_bool = numpy.tile((user_question_vector > 0).astype(int), (len(q_ary_vector_bool), 1))

# print(q_ary_vector_bool.shape)
# print(user_question_vector_bool.shape)

# print(bitwise_and(q_ary_vector_bool, user_question_vector_bool)[0])
# print(bitwise_or(q_ary_vector_bool, user_question_vector_bool)[0])

def jaccard_sim(vec1, vec2):
  return numpy.sum(bitwise_and(vec1, vec2), axis=1) / numpy.sum(bitwise_or(vec1, vec2), axis=1)

jaccard_sim_matrix = jaccard_sim(q_ary_vector_bool, user_question_vector_bool)
# print(jaccard_sim_matrix.shape)
print("jaccard_similarity_matrix:\n", jaccard_sim_matrix)
answer_index = numpy.argmax(jaccard_sim_matrix)
print("answer_index: ", answer_index)

print("使用者輸入之問題: ", user_question)
print("比對後最相近之問題: ", q_ary[answer_index])
print("問題之解答: ", a_ary[answer_index])

jaccard_similarity_matrix:
 [0.         0.         0.         0.         0.1        0.
 0.         0.         0.         0.         0.         0.
 0.         0.         1.         0.28571429 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.        ]
answer_index:  14
使用者輸入之問題:  網路交易注意事項
比對後最相近之問題:  網路交易注意事項
問題之解答:  由於網路詐騙案件層出不窮,手法也不斷更新，PChome商店街在此特別提醒您，商店街的店家與PChome工作人員，均不會要求消費者至提款機操作任何功能，請小心勿上當。如果接獲不明人士來信或來電，應立即撥打165防詐騙專線查詢或透過「PChome商店街服務中心」查證。PChome商店街與您一起努力維護網路交易安全！
