trying shingling / resemblance / simhash / sketching to do some data deduping
Ruby Erlang Java C++ Other
Latest commit 437c7bc Aug 21, 2015 @matpalm Update LICENSE
Permalink
Failed to load latest commit information.
cpp minor mods, found from testing with larger (20e9) datasets May 16, 2009
erl per postcode processing Jul 12, 2009
hadoop use writable stringpair instead of tab seperated string Feb 8, 2012
pig first cut of pig only version May 4, 2010
ruby fully working. untested with larger dataset, expect mem problems with… Jul 8, 2009
LICENSE Update LICENSE Aug 21, 2015
README.textile Correct syntax for compare.rb call Aug 12, 2013
compare.rb instead of rotating bits run multiple random universal hashes, across… May 17, 2009
connected_components.rb include calculation of representative id from sketch duplicate sets Jul 9, 2009
convert_points_to_distances.rb add fastmap projection from distances to points; including some helpe… Oct 17, 2008
cpp_resem_per_postcode.rb per postcode processing Jul 12, 2009
determine_jaccard.rb process per sketch, everything done in single proc, sketch in common … Jun 11, 2009
examine_result.rb fix path bug so examine_results can load ruby/read_data.rb Jan 6, 2014
explode_ids.rb helper util for exploding dup_id files Jul 9, 2009
export_to_ggobi.rb ggobi exporter May 29, 2009
fastmap.rb wip May 12, 2009
filter_under.rb bug fix for combo.ids Jul 9, 2009
find_dups.rb per postcode processing Jul 12, 2009
freq.rb include calculation of representative id from sketch duplicate sets Jul 9, 2009
histo.rb minor mods, found from testing with larger (20e9) datasets May 16, 2009
mean_sqr_err.rb correct error calc Oct 20, 2008
per_postcode.rb per postcode processing Jul 12, 2009
plot.3d.rb gnuplot script generator Oct 23, 2008
plot_histo.sh minor mods, found from testing with larger (20e9) datasets May 16, 2009
reduce_clique.rb fully working. untested with larger dataset, expect mem problems with… Jul 8, 2009
result_to_dot.rb per postcode processing Jul 12, 2009
sans_phone_heading.rb per postcode processing Jul 12, 2009
sec_to_time.rb process per sketch, everything done in single proc, sketch in common … Jun 11, 2009
split.rb per postcode processing Jul 12, 2009
test.data fully working. untested with larger dataset, expect mem problems with… Jul 8, 2009
test.rb first cut of pig only version May 4, 2010
time_to_sec.rb process per sketch, everything done in single proc, sketch in common … Jun 11, 2009

README.textile

see matpalm.com/resemblance for a proper walkthrough
instructions here are more for replicating the results on the above project page

measuring exact resemblance (jaccard coeff against shingles)

ruby

run ruby version and output all with resemblance > 0.5


> cat test.data | ./ruby/shingle.rb coeff 0.5 > result

cpp

run cpp version outputting resemblances above 0 (ie all of them)

outputs to N files generated from N cores so need to collate

> cat test.data | cpp/bin/Release/resemblance 0
> cat resemblance.*.out > result

munging

examine a result file

ie see phrases used for comparison rather than just raw result (eg 1 3 0.88)

> cat test.data | ./examine_result.rb result

generate a histogram of resemblance values

> cat result | ./histo.rb | sort -n > histo.dat
> ./plot_histo.sh histo.dat histo.100.png

measuring resemblance (distance)

ruby

> cat test.data | ./ruby/shingle.rb distance 0 > distances.original

cpp

change the code, cause i havent yet (ie wip!)

converting from distances to points

project distances into 2 dimensional space

then check mean square error for projected distances versus original distances

(similiarly change 2 to whatever dimensionality)

> cat distances.original | ./fastmap.rb 2 > points.2d
> cat points.2d | ./convert_points_to_distances.rb > distances.projected.2d
> ./mean_square_error.rb distances.original distances.projected.2d

plot 2d and 3d data with gnuplot

gnuplot> plot 'points.2d' with dots, 'points.2d' with labels
gnuplot> splot 'points.3d' with dots, 'points.3d' with labels
gnuplot> splot 'points.11d' with dots, 'points.11d' with labels # good luck with this one! ;)

simhash heuristic

compare simhash to brute force order n squared compare all (considering only values above resemblance 0.5)

> cat test.data | ./ruby/shingle.rb coeff 0.5 | sort -nr -k3 > shingling.result
> cat test.data | ./ruby/simhash.rb | sort -nr -k3 > simhash.result
> ./compare.rb shingling.result simhash.result 0.5

sketching heuristic

compare sketching to brute force order n squared compare all (considering only values above resemblance 0.5)
use 64bit hash, calculate 10 shingles and cutoff at MAX/2

> cat test.data | ./ruby/shingle.rb coeff 0.5 | sort -nr -k3 > shingling.result
> cat test.data | ./ruby/sketch.rb 64 10 2 | sort -nr -k3 > sketch.result
> ./compare.rb shingling.result sketch.result 0.5

sketching in erlang

> cd erl && rake
> ln -s ../test.data . # todo: make path configurable
> erl -noshell -pa erl/ebin -s main main > erl.result