<a href="https://colab.research.google.com/github/panghanwu/tibame_project/blob/main/fuzzy_search_oop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 模糊搜尋功能
---

### 會使用到的套件
- gensim
- py2neo
- numpy

In [1]:
# 下載fastText繁體中文模型並解壓縮
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.zh.300.bin.gz
!gunzip cc.zh.300.bin.gz

# 載入gensim中的FastText
from gensim.models.fasttext import FastText

# 將fastText模型定義為ft_model
# 因為模型很大 會需要一些時間載入
ft_model = FastText.load_fasttext_format('cc.zh.300.bin')

--2020-11-27 15:46:13--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.zh.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4478681770 (4.2G) [application/octet-stream]
Saving to: ‘cc.zh.300.bin.gz’


2020-11-27 15:51:54 (12.5 MB/s) - ‘cc.zh.300.bin.gz’ saved [4478681770/4478681770]



In [2]:
# py2neo是python控制neo4j的套件
# Colab並未內建需要另外安裝
!pip install py2neo

Collecting py2neo
[?25l  Downloading https://files.pythonhosted.org/packages/4f/86/4cb8118794ab5965335bc8f3315c414a05cbbe5d9f978f8fcbed1bc819af/py2neo-2020.1.1-py2.py3-none-any.whl (185kB)
[K     |█▊                              | 10kB 15.7MB/s eta 0:00:01[K     |███▌                            | 20kB 11.1MB/s eta 0:00:01[K     |█████▎                          | 30kB 7.7MB/s eta 0:00:01[K     |███████                         | 40kB 6.3MB/s eta 0:00:01[K     |████████▉                       | 51kB 4.4MB/s eta 0:00:01[K     |██████████▋                     | 61kB 4.9MB/s eta 0:00:01[K     |████████████▍                   | 71kB 4.9MB/s eta 0:00:01[K     |██████████████▏                 | 81kB 5.2MB/s eta 0:00:01[K     |███████████████▉                | 92kB 5.2MB/s eta 0:00:01[K     |█████████████████▋              | 102kB 4.3MB/s eta 0:00:01[K     |███████████████████▍            | 112kB 4.3MB/s eta 0:00:01[K     |█████████████████████▏          | 122kB 4.3MB/s 

In [3]:
import numpy as np
import py2neo as neo

# 模糊搜尋
class FuzzySearch():
    def __init__(self, description_list, node_list, gender=None):
        # 需要的起始參數包含：
        # 1.辨識描述list 
        # 2.Neo4j所有商品節點list
        # 3.性別的string
        
        # 性別可以是male, female, None
        assert gender in ['male', 'female', None]
        self.sex = gender
        self.des = description_list
        
        # 把辨識描述用ft_model轉成300為的詞向量
        self.vec = np.zeros(300)
        for d in self.des:
            # 所有詞向量加總
            self.vec += ft_model.wv[d]

        # 依據性別更改商品節點清單
        if self.sex == 'male':
            self.product = [x for x in node_list if x['sn'][0]=='M']
        elif self.sex == 'female':
            self.product = [x for x in node_list if x['sn'][0]=='F']
        else:
            self.product = node_list

        # 創立一個空矩陣作為容器
        self.pro_vec = np.empty((len(self.product),300))
        # 從商品節點清單中載入每項商品的詞向量
        for i, n in enumerate(self.product):
            str_vec = n['vector']
            # 存入容器
            self.pro_vec[i] = np.fromstring(str_vec, sep=' ')
    
    # 找出和辨識描述語意最相近的商品
    def match(self):
        # 向量夾角的運算
        # 也就是計算出cos
        dot  = np.dot(self.vec, self.pro_vec.T)
        norm = np.linalg.norm(self.vec) * np.linalg.norm(self.pro_vec, axis=1)
        cos  = dot / norm
        # 找出夾角最小（cos最大）商品的索引
        recom = np.argmax(cos)

        # 傳回商品的流水號、名稱、圖片網址
        # 為tuple形式
        return (self.product[recom]['sn'], 
                self.product[recom]['name'], 
                self.product[recom]['image_url'])
    
    def __len__(self):
        return len(self.product)



### 範例

In [4]:
# 連到Neo4j伺服器（再兩天後過期）
print('Connecting to the sever...')

# 伺服器位置、密碼
sever_link = 'bolt://100.25.221.22:42993'
pws = 'farm-distributions-battles'

# 載入圖資料庫 並且命名為fashion_map
fashion_map = neo.Graph(sever_link, password=pws)
# 抓出所有商品節點list 並且命名為node_list
node_list = list(neo.NodeMatch(fashion_map, labels=frozenset(['Product'])))

print('Done.')

Connecting to the sever...
Done.


In [10]:
# 自訂偽辨識描述用來測試
text = ['拼接','很熱','素色','短袖']

# FuzzySearch是類別 加上.match()之後才會進行搜索功能
# 輸入參數依序為辨識描述、商品節點清單、性別
FuzzySearch(text, node_list, gender='male').match()

('MU15',
 '刷毛格紋寬版襯衫',
 'https://im.uniqlo.com/images/tw/gu/pc/goods/329789/item/16_329789.jpg')

In [12]:
# 因為回傳值為tuple
# 所以可以如此定義參數取值
sn, name, url = FuzzySearch(text, node_list, gender='female').match()
# 注意這邊性別改為女生所以推薦結果會不一樣
sn

'FF19'