Skip to content

Doraemon Himitsu Dogu Japanese hybrid search based on Elascticsearch ANN x multi match

Notifications You must be signed in to change notification settings

hurutoriya/doraemon-himitsu-dogu-search

Repository files navigation

Doraemon Himitsu Dogu Japanese semantic search

Python based Doraemon Himitsu Dogu Japanese semantic search based on Elasticsearch approximate nearest neighbor(ANN) feature.

Japanese: Elasticsearch の近似近傍探索機能を使ったドラえもんのひみつ道具自然言語検索エンジン

Key technology

Dataset

I made a Himitdu Dogu dataset based on this site. ひみつ道具カタログ

Indexing phase

graph LR;
text-->|json形式に構造化|json
json-->|説明文|HuggingFace
json-->Elasticsaerch
HuggingFace-->|特徴ベクトル化|Elasticsaerch
Loading

Search phase

graph LR;
Query-->|強くなる|B["HuggingFace(encoder)"]
subgraph Streamlit
    B
end
subgraph Elasticsearch
    A[multi match]
    ANN
    A-->hybrid
    ANN-->hybrid
end
B-->|"[1.2, ... 0.3]"|ANN
B-->|"kuromoji analyzer for Japanese"|A
Loading

How to set up

# Do in background...
$ make run-es
make run-es
es01
[+] Building 0.1s (6/6) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                           0.0s
 => => transferring dockerfile: 223B
...

$ make build-index
Get the certification for ElasticSearch
Make structured data from raw data
poetry run python doraemon_himitsu_dogu_search/preprocess.py
Run sentens vectorizer
poetry run python doraemon_himitsu_dogu_search/sentents_bert_vectorizer.py
Start BERT encode
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:02<00:00,  4.45s/it]
End BERT encode
Start serialization as numpy file
End serialization
Run Elasticsearch indexing job
poetry run python doraemon_himitsu_dogu_search/indexer.py

$ make run-app
Get the certification for ElasticSearch
Running the web app for Doraemon himitsu dogu search
poetry run streamlit run doraemon_himitsu_dogu_search/app.py

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501

Related Posts

About

Doraemon Himitsu Dogu Japanese hybrid search based on Elascticsearch ANN x multi match

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published