Skip to content

pdulak/fuzzySearchTesting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fuzzy Search with Postgres and Qdrant vector database - comparison

The purpose of this repo is to test fuzzy search in Postgres and Qdrant vector database.

The names were generated to contain various versions of the same name, e.g.:

  • John Smith
  • John Shmith
  • ...

We are testing three methods of fuzzy search in Postrgres:

  • Trigrams (Similarity)
  • Soundex
  • Levenshtein distance

Using Qdrant vector database we are testing the "distance between names". They are first embedded using open source all-mpnet-base-v2 model available on here. Due to the way the embedding works, once the very similar names are found, there are also names from a particular language group presented.

How to run

  1. Remove .gitkeep files from postgres-data
  2. Run the following command:
docker-compose up

Please note that the first run will take a while, as there is a few gigabytes of data to download. Also, the first upsert to Qdrant will take a while, as the model has to be downloaded and the index needs to be built.

  1. Open http://localhost:5000/ in your browser
  2. Set up your environment by clicking buttons in the first row
  3. Set up qdrant by clicking buttons in the second row - upsert embeddings may take a long time because first the model data has to be downloaded
  4. Test search using "Test search" page

Developer notes

New packages in requirements.txt

When you change your requirements.txt file, you'll need to rebuild your Docker image to install the new Python packages.

Here are the steps:

  1. Stop the running Docker containers with the following command:

    docker-compose down
  2. Then rebuild and start your Docker containers:

    docker-compose up --build

Postrgres related notes

Trigrams:

CREATE EXTENSION pg_trgm;

usage:

SELECT
	*
FROM names
WHERE SIMILARITY(combined_name, 'John Smith') > 0.4 ;

Phonetic matching:

CREATE EXTENSION fuzzystrmatch;

usage:

SELECT
	*
FROM names
WHERE SOUNDEX(combined_name) = SOUNDEX('John Smith');
SELECT
	*
FROM names
ORDER BY SIMILARITY(
	METAPHONE(combined_name,10),
    METAPHONE('John Smith',10)
    ) DESC
LIMIT 5;

Levenshtein distance:

SELECT
	*,
    LEVENSHTEIN(combined_name, 'John Smith')
FROM names
ORDER BY LEVENSHTEIN(combined_name, 'John Smith') ASC
LIMIT 5

About

testing fuzzy search using different methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors