# ANN via Annoy

In this notebook, we will use a public Japanese company name & address dataset to try the ANN search via Annoy, a lib made by Spotify.

Before importing below library, you may need to install fasttext by following the step [here](https://fasttext.cc/docs/en/support.html#building-fasttext-python-module).

In [1]:
import annoy
import pandas as pd
import numpy as np
import fasttext.util

## Load the dataset

> NOTE: skip to "Load the cache file" if you already saved a npz cache

Here we use a public dataset containing all company names and addresses in Japan.  You can download the dataset from [here](https://info.gbiz.go.jp/hojin/DownloadTop).

In [2]:
data = pd.read_csv('./resource/company_address.csv', dtype=str)


In [3]:
# convert NaN to empty string
data = data.fillna("")
data.sample(5)

Unnamed: 0,法人番号,法人名,郵便番号,g1,g2,g3,g4,rest
1732163,3110002029379,有限会社森山新聞店,9400044,新潟県,長岡市,住吉,2-4-14,
3615002,8020005013987,特定非営利活動法人日本セラエクサ協会,2360052,神奈川県,横浜市金沢区,富岡西,4-76-16,
262208,9410003002340,トヨタ觀光不動産開發合資会社,140311,秋田県,仙北市,角館町田町上丁,88,
800718,1010501033728,株式会社川口,1100016,東京都,台東区,台東,4-6-5,
2899321,7480002010455,有限会社幸徳運輸,7790312,徳島県,鳴門市,大麻町東馬詰字諏訪の元,41-1,


In [4]:
data.columns

Index(['法人番号', '法人名', '郵便番号', 'g1', 'g2', 'g3', 'g4', 'rest'], dtype='object')

In [5]:
fields_for_index = ['法人名', '郵便番号', 'g1', 'g2', 'g3', 'g4', 'rest']    
# try a small set
data.sample(5)[fields_for_index].values

array([['有限会社スパイス', '1140001', '東京都', '北区', '東十条', '3-5-12', ''],
       ['八幡神社', '3491142', '埼玉県', '加須市', '杓子木', '148', ''],
       ['南信木材工業株式会社', '3991201', '長野県', '下伊那郡天龍村', '平岡', '1415', ''],
       ['株式会社彩コーポレーション', '1450071', '東京都', '大田区', '田園調布', '2-15-7',
        '田園調布ヒルズ502号'],
       ['株式会社スマイルチンタイ', '8570052', '長崎県', '佐世保市', '松浦町', '2-17', '']],
      dtype=object)

### Experiment:
Previously I tested the AnnoyIndex with the `features` in separate columns (by flattening the array).
However, the result is not doing well because I observed that some fields with shorter text length will become more significant when performing a match.

So I want to try to merge all the column into 1 sentence before embedding, with below rules:
- Make a space between organization name and address
- Make NO space between address parts, EXCEPT the final column `rest`.

In [6]:
data["concat_name"] = data['法人名'] +  " " + data["g1"] + data["g2"] + data["g3"] + " " + data["g4"] + " " + data["rest"]
data["concat_name"].sample(5).values

array(['有限会社岳野建設 長崎県西海市大瀬戸町瀬戸西濱郷 411-19 ', '株式会社中岡 大阪府大阪市平野区加美西 1-15-4 ',
       '株式会社エス・ケーリース 東京都調布市仙川町 2-7-7 ', '有限会社丸田運輸 佐賀県佐賀市大和町大字尼寺 1891-1 ',
       '有限会社十八番 東京都西多摩郡瑞穂町むさし野 2-48-14 '], dtype=object)

In [7]:
# Here is a subset of converted data
# NOTE: remove the array slicing if you want to try the whole dataset
features = data["concat_name"].sample(1000).values

In [8]:
# Delete the df to save memory...
del(data)

-----
## Convert to Embedding

Because AnnoyIndex expects number as input.
We need to convert the text dataset into embedding first.
Here we use word vectors pre-trained by fastText [here](https://fasttext.cc/docs/en/crawl-vectors.html).

In [9]:
# Download the japanese gz
fasttext.util.download_model('ja', if_exists='ignore')

'cc.ja.300.bin'

In [10]:
# Load the pre-trained model
ft = fasttext.load_model('cc.ja.300.bin')

In [11]:
# A sample of the word vector inside
ft.get_word_vector(features[0])[:10]

array([-0.0113956 , -0.00107074,  0.00674794, -0.00688771,  0.00272137,
       -0.0007362 , -0.01064106,  0.00073996, -0.00454364, -0.00703756],
      dtype=float32)

It even provided nearest neighbor function :P 

In [12]:
ft.get_nearest_neighbors('こんにちは')


[(0.9167303442955017, 'こんばんは'),
 (0.9152794480323792, 'こんにちわ'),
 (0.8549860715866089, 'こんばんわ'),
 (0.7946215271949768, 'はじめまして'),
 (0.7212551236152649, 'おはよう'),
 (0.6740288734436035, 'どーも'),
 (0.6339334845542908, 'こんち'),
 (0.6219503283500671, 'どうも'),
 (0.6091426014900208, 'みなさん'),
 (0.5997785925865173, 'ゃにゃちは')]

Now we will convert all text features into word embeddings.
It might take you a few minutes.

In [13]:
features_vec = np.zeros((features.shape[0], ft.get_dimension()))

for i, sentence in enumerate(features):
    features_vec[i] = ft.get_word_vector(sentence)

In [14]:
print(features.shape)
# (sample, vector dim)
print(features_vec.shape)

(1000,)
(1000, 300)


Store the converted array into npz so we don't need to repeat this step every time

In [15]:

np.savez('sample_1000.npz', features=features, features_vec=features_vec)

-------
## Load the cache file

If you saved the npz file before, we can start from here directly.


In [19]:
with  np.load("sample_1000.npz", allow_pickle=True) as data:
    print("Keys in npz file: ", data.files)
    features = data["features"]
    features_vec = data["features_vec"]
print(features.shape)
print(features_vec.shape)

Keys in npz file:  ['features', 'features_vec']
(1000,)
(1000, 300)


Take a look a few sample

In [22]:
print(features[:5])
print(features_vec[:5][:10])

['北海道網走郡美幌町字栄町 4-10-6 ' '東京都江東区海辺 16-10 ' '大阪府堺市南区和田東 999-1 '
 '愛知県名古屋市千種区城木町 1-13 ' '鹿児島県志布志市志布志町安楽 2581-8 ']
[[-1.09495688e-02  3.20577458e-03 -4.78808099e-04 ... -7.76627660e-03
  -2.34671105e-02  1.96876237e-03]
 [-3.60717007e-04  1.98141905e-03 -3.94161325e-03 ... -5.49455080e-03
   4.95103840e-03  9.80593497e-04]
 [-5.71642828e-04 -6.75229821e-04  1.11593818e-02 ... -8.25786963e-03
  -4.05592255e-05  2.39052693e-03]
 [-2.50220438e-03  7.16901198e-03  1.20953145e-02 ... -7.61051057e-03
  -7.72130862e-03  6.91708410e-04]
 [-1.11861143e-03  3.19462339e-03  7.88116176e-03 ... -4.67671221e-03
  -5.70234470e-03  2.03545089e-03]]


-------------
## Build Annoy Index

Define the number of trees, which is the number of random projections used by "Annoy" to create the index. The number of trees is a hyperparameter that affects the accuracy of the ANN search. You may need to experiment with different values to find the optimal number for your dataset:

In [102]:
# Adjust it to optimize the search performance
n_trees = 1000

Here we go! We will build the whole forest with the dataset. It will take you a while to build it up.

In [147]:
# Note: shape[1] is the dimension of the embedding
index = annoy.AnnoyIndex(features_vec.shape[1], metric='euclidean')

for i, row in enumerate(features_vec):
    index.add_item(i, row)

index.build(n_trees)

True

Save the ann forest!

In [47]:
index.save('test_1000.ann')

True

In [49]:
index = annoy.AnnoyIndex(features_vec.shape[1], metric='euclidean')
index.load('test_1000.ann') # super fast, will just mmap the file

True

-----
## Test


In [104]:
print(features[:5])

['愛知県岡崎市赤渋町字野中 11-3 ' '愛知県岡崎市伝馬通 5-7 ' '大阪府大阪市東淀川区西淡路 1-3-32 '
 '埼玉県志木市柏町 5-4-6 ホワイト20C号室' '石川県加賀市黒瀬町 354-3 ']


In [156]:
# 愛知県岡崎市赤渋町字野中 11-3
# 埼玉県志木市柏町 5-4-6 ホワイト20C号室
test_emb = ft.get_word_vector("埼玉県志木市  20C号室")
print(test_emb.shape)
# Get 5 most closest index
results, dists = index.get_nns_by_vector(test_emb, 10, search_k=-1, include_distances=True)
print(results)
print(dists)
# show the original label
for i, ind in enumerate(results):
    print(f"Distance : {dists[i]:0.4f} for entity: ", features[ind])

(300,)
[757, 151, 661, 299, 983, 382, 451, 623, 221, 878]
[0.11101184040307999, 0.11249707639217377, 0.11250942945480347, 0.11417366564273834, 0.11479569226503372, 0.11494060605764389, 0.11550846695899963, 0.11552218347787857, 0.11562547832727432, 0.11582881957292557]
Distance : 0.1110 for entity:  神奈川県横浜市瀬谷区阿久和西 4-21-1 ・1棟301号
Distance : 0.1125 for entity:  東京都千代田区神田佐久間町 3-31-3 
Distance : 0.1125 for entity:  京都府京都市西京区桂春日町 75-76 合地エスパシオ離宮208号室
Distance : 0.1142 for entity:  神奈川県横浜市中区太田町 2-32-1 ビラアペックス横浜関内6B
Distance : 0.1148 for entity:  埼玉県さいたま市浦和区木崎 4-10-11 セジュール立葉II103
Distance : 0.1149 for entity:  東京都千代田区神田佐久間町 1-14 第2東ビル717号室
Distance : 0.1155 for entity:  東京都千代田区飯田橋 2-6-6 ヒューリック飯田橋ビル6階
Distance : 0.1155 for entity:  神奈川県横浜市都筑区茅ケ崎中央 21-12 森村ビル2F
Distance : 0.1156 for entity:  福岡県北九州市若松区ひびきの 1-8 事業化支援センター502号室
Distance : 0.1158 for entity:  長野県北佐久郡軽井沢町大字発地 1623 
