Skip to content

himanshudas/html-distance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

html-distance

html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

We used BK Tree (Burkhard and Keller) for verifying if a fingerprint is closed to a set of fingerprint within a defined proximity distance.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

Command Line Interface

Usage of html-distance:

    go run html-distance.go

Example

$ go run html-distance.go

Enter first url:
http://www.google.com
Enter second url:
http://www.facebook.com

Fetching http://www.google.com, Got 200

Fetching http://www.facebook.com, Got 200

Fingerprint1     100011001000101110010101111011000101011101100101111000000010
Fingerprint2  101101000011101100001010011010011000001010011101111100100100100

Feature Distance is 28.
Shingle factor is 2.
HTML Similarity is 56.25%

Credits

  • The html-distance implementation is missing from the gryffin project:
  https://github.com/yahoo/gryffin
  • This is proof of concept implementation for html-distance and could be used independent of gryffin project.

Talks and Slides

Licence

Code licensed under the BSD-style license. See LICENSE file for terms.

About

html-distance implementation in golang

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages