The purpose of this repo is to test fuzzy search in Postgres and Qdrant vector database.
The names were generated to contain various versions of the same name, e.g.:
- John Smith
- John Shmith
- ...
We are testing three methods of fuzzy search in Postrgres:
- Trigrams (Similarity)
- Soundex
- Levenshtein distance
Using Qdrant vector database we are testing the "distance between names".
They are first embedded using open source all-mpnet-base-v2 model available on here.
Due to the way the embedding works, once the very similar names are found, there
are also names from a particular language group presented.
- Remove
.gitkeepfiles frompostgres-data - Run the following command:
docker-compose upPlease note that the first run will take a while, as there is a few gigabytes of data to download. Also, the first upsert to Qdrant will take a while, as the model has to be downloaded and the index needs to be built.
- Open
http://localhost:5000/in your browser - Set up your environment by clicking buttons in the first row
- Set up qdrant by clicking buttons in the second row - upsert embeddings may take a long time because first the model data has to be downloaded
- Test search using "Test search" page
When you change your requirements.txt file, you'll need to rebuild your Docker image to install the new Python packages.
Here are the steps:
-
Stop the running Docker containers with the following command:
docker-compose down
-
Then rebuild and start your Docker containers:
docker-compose up --build
CREATE EXTENSION pg_trgm;usage:
SELECT
*
FROM names
WHERE SIMILARITY(combined_name, 'John Smith') > 0.4 ;CREATE EXTENSION fuzzystrmatch;usage:
SELECT
*
FROM names
WHERE SOUNDEX(combined_name) = SOUNDEX('John Smith');SELECT
*
FROM names
ORDER BY SIMILARITY(
METAPHONE(combined_name,10),
METAPHONE('John Smith',10)
) DESC
LIMIT 5;SELECT
*,
LEVENSHTEIN(combined_name, 'John Smith')
FROM names
ORDER BY LEVENSHTEIN(combined_name, 'John Smith') ASC
LIMIT 5