A string matching library similar to a NaiveBayes classifier, but optimized for use cases where you have many possible matches.
This is especially useful if you have two large lists of names/titles/descriptions to match with each other.
Pull in this library from hex.pm. Then in your project you can do the following.
matcher = Bayesic.Trainer.new()
|> Bayesic.train(["it","was","the","best","of","times"], "novel")
|> Bayesic.train(["tonight","on","the","seven","o'clock"], "news")
|> Bayesic.finalize()
Bayesic.classify(matcher, ["the","best","of"])
# => %{"novel" => 1.0, "news" => 0.667}
Bayesic.classify(matcher, ["the","time"])
# => %{"novel" => 0.667, "news" => 0.667}
This library uses the basic idea of Bayes Theorem.
It records which tokens it has seen for each possible classification. Later when you pass a set of tokens and ask for the most likely classification it looks for all potential matches and then ranks them by considering the probabily of any given match according to the tokens that it sees.
Tokens which exist in many records (ie not very unique) have a smaller impact on the probability of a match and more unique tokens have a larger impact.
Performance varies a lot depending on the size and type of your data.
I have a built in a benchmark that you can run via mix run benchmarks/training_and_classifying.exs
.
This benchmark loads in 60k movies and tokenizes their titles by finding all the words (downcased) in the title of the movie.
We benchmark the time it takes to train on that dataset and also the time it takes to do a specific classification as well as a more generic classification.
Currently this benchmark shows the following results on my laptop:
Name ips average deviation median 99th %
match 1 word 92.36 K 10.83 μs ±3286.36% 0 μs 0 μs
match 3 words 37.77 K 26.48 μs ±2157.82% 0 μs 0 μs
training 0.00086 K 1158191.81 μs ±26.79% 1262370.70 μs 1550091.70 μs
This means it takes ~1.2sec to train the classifier on 60k titles and 10 - 26µs to do a classification of tokens on that classifier.
I don't know, but you can pretty easily test it using the benchmarks/training_and_matching.exs
script in this project.
Just generate 2 CSV files:
- The first file should have 2 columns
source_string
andsource_id
- The second file should have 2 columns
match_string
andsource_id
Then run mix run benchmarks/test_your_data.exs path/to/first_file.csv path/to/second_file.csv
.
The benchmark contains a sample tokenizer that breaks strings into words, removes punctuation, throws away single-letter words and downcases. You can replace the
tokenizer
function in the benchmark to try other forms of tokenization.
This will benchmark how long it takes to train a matcher with the data in your first file and it will also benchmark how long it takes to attempt to classify all of the entries in your second file.
The reported time for matching is the time to match all of the rows in your second file.
I use this in a project where I have ~10k possible matches and currently this libray trains the matcher in ~48ms and each attempt to classify takes ~38µs.
For my use case 26k
matches per second is "fast enough".