Damerau-Levenshtein Distance UDF for MySQL - Supports upper bounding for fast searching and UTF-8 case insensitive throught iconv.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
lib
src
.gitignore
LICENSE
Makefile
README
test.sh

README

Damerau-Levenshtein Distance UDF for MySQL
Supports upper bounding for fast searching and UTF-8 case
insensitive throught iconv.

Copyright (C) 2013 Diego Torres <diego dot torres at gmail dot com>


Implementing

    https://github.com/torvalds/linux/blob/8a72f3820c4d14b27ad5336aed00063a7a7f1bef/tools/perf/util/levenshtein.c


Redistribute as you wish, but leave this information intact.

Compiling

    Run make.    It will try to autodetect architecture and generate 32 or x86_64 bit
    version. Enable testing with "make debug". An executable instead of
    a library will be built. Execute it to run some tests regarding
    memory allocation and levenshtein core calculations.

    Errors you may get when compiling:

In file included from ./src/damlevlim.c:16:0:
./src/damlevlim.h:26:23: fatal error: my_global.h: No such file or directory
#include <my_global.h>
                   ^
compilation terminated.
Makefile:30: recipe for target 'mysqldamlevlim' failed
make: *** [mysqldamlevlim] Error 1

    Development libraries from mysql are missing.
    Do "sudo apt-get install libmysqlclient-dev"

/usr/include/gnu/stubs.h:7:27: fatal error: gnu/stubs-32.h: No such file or directory

    You're missing the 32 bit libc dev package. This is answered here:

http://stackoverflow.com/questions/7412548/gnu-stubs-32-h-no-such-file-or-directory

    - On Ubuntu it's called libc6-dev-i386 - do "sudo apt-get install"
    libc6-dev-i386. See below for extra instructions for Ubuntu 12.04.
    - On Red Hat distros, the package name is glibc-devel.i686 (Thanks to
    David Gardner's comment)
    - On CentOS 5.8, the package name is glibc-devel.i386 (Thanks to JimKleck's comment)
    - On CentOS 6.3, the package name is glibc-devel.i686.
    - On SLES it's called glibc-devel-32bit - do zypper in glibc-devel-32bit

Loading UDFs in MySQL STEP 1

    Copy compiled udf to plugin dir. To know where exactly, ask MySQL:

    mysql> SHOW VARIABLES LIKE 'plugin_dir';

    and use result directory.

    To modify plugin_dir to point to a custom location:

    my.cnf: plugin_dir = /usr/lib/mysql/plugin

    Don't forget to restart MySQL!!


Loading UDFs in MySQL STEP 2

    Bind the library to an UDF inside MySQL (only do it once,
    survives MySQL reboot). Depending your MySQL binaries (32
    or 64 bits), insert the right library.

    mysql> DROP FUNCTION IF EXISTS damlevlim;

    mysql> CREATE FUNCTION damlevlim RETURNS INT SONAME 'mysqldamlevlim.so';

    Check if UDF was loaded OK.

    mysql> SELECT * FROM mysql.func;
    +-----------+-----+-------------------+----------+
    | name      | ret | dl                | type     |
    +-----------+-----+-------------------+----------+
    | damlevlim |   2 | mysqldamlevlim.so | function |
    +-----------+-----+-------------------+----------+
    1 row in set (0.00 sec)


Using damlevlim

    mysql> select damlevlim();
    ERROR: DAMLEVLIM() requires three arguments (string, string, int)

    First string is compared against second string. Compare only the first
    int chars (upper bound).

    mysql> select damlevlim('coche', 'çóche', 5);
    +--------------------------------+
    | damlevlim('coche', 'çòche', 5) |
    +--------------------------------+
    |                              0 |
    +--------------------------------+
    1 row in set (0.00 sec)


Further reading about MySql, UTF8 and charset encoding

    http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html


How to search for multibyte characters in a column

    Use MySql CHAR_LENGTH function to find rows with multi-byte characters:

    mysql> SELECT name FROM users WHERE LENGTH(name) != CHAR_LENGTH(name);


How to search for strange multibyte characters in a column

    (don't remember source of original tip)
    Use MySql HEX and PHP bin2hex functions

    mysql> SELECT name, HEX(name) FROM users;

    Get the result back into PHP, and run a bin2hex on the string, compare
    it to MySQL’s hex of that same string.