Local Sensitivity Hashing Using the ‘Trend Micro’ ‘TLSH’ Implementation
‘Trend Micro’ provides an open source library https://github.com/trendmicro/tlsh/ for local sensitivity hashing. Methods are provided to compute and compare hashes from character/byte streams.
- Jonathan Oliver, Chun Cheng and Yanggui Chen, “TLSH - A Locality Sensitive Hash” 4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
- Jonathan Oliver, Scott Forman and Chun Cheng, “Using Randomization to Attack Similarity Digests” Applications and Techniques in Information Security. Springer Berlin Heidelberg, 2014. 199-210.
- Jonathan Oliver and Jayson Pryde’s Trend Micro Blog Post
- File input utilities
- File input DSL verb
- Docs
- Tests
-
toString()
method - Reference class-backed DSL
The following functions are implemented:
“Simple” interface (quick and dirty hashing):
tlsh_simple_hash
: Compute TLSH hash for a character or raw vector and return hash fingerprinttlsh_simple_diff
: Compute the difference between two character hashes
DSL: (WIP)
tlsh
: Create a new ‘tlsh’ objecttlsh_reset
: Clear content and hash computation from a ‘tlsh’ object fingerprinttlsh_update
: Update the ‘tlsh’ object with contenttlsh_finalize
: Finalize a ‘tlsh’ object hashtlsh_is_valid
: Test if a ‘tlsh’ hash object is validtlsh_hash
: Retrieve the hex-encoded hash string for a ‘tlsh’ objecttlsh_dist
: Compute distance between two TLSH objectstlsh_stats
: Return a data frame of lvalue and q1/2 ratios from a ‘tlsh’ object
TODO: Document DSL
devtools::install_github("hrbrmstr/tlsh")
library(tlsh)
library(tidyverse)
# current verison
packageVersion("tlsh")
## [1] '0.1.0'
index.html
is a static copy of a blog main page with a bunch of<div>
s with article snippetsindex1.html
is the same file asindex.htmnl
with a changed cache timestamp at the endindex2.html
is the same file asindex.html
with one article snippet removedRMacOSX-FAQ.html
is the CRAN ‘R for Mac OS X FAQ’
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
doc2 <- as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh")))
doc3 <- as.character(xml2::read_html(system.file("extdat", "index2.html", package="tlsh")))
doc4 <- as.character(xml2::read_html(system.file("extdat", "RMacOSX-FAQ.html", package="tlsh")))
# generate hashes
(h1 <- tlsh_simple_hash(doc1))
## [1] "B253F9F3168DC8354B2363E2A585771CD25A803BCEA099C1FBED54ACA790EB5B137346"
(h2 <- tlsh_simple_hash(doc2))
## [1] "6153E8F3168DC8355B2363E2A585771CD26A803BCEA099C1FBED44AC9790EB5B137346"
(h3 <- tlsh_simple_hash(doc3))
## [1] "6443E8F3168DC8355B6262F2A9C5771CD25A802BCEA099C1FBED54AC9780FF4A137346"
(h4 <- tlsh_simple_hash(doc4))
## [1] "B8B3A52F93C0233E0F1216576F192FA812FD5C7EA3802188B557C67F8712D9A47666BB"
# compute distance
tlsh_simple_diff(h1, h2)
## [1] 7
tlsh_simple_diff(h1, h3)
## [1] 18
tlsh_simple_diff(h1, h4)
## [1] 334
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
tlsh() %>%
tlsh_update(doc1) %>%
tlsh_finalize() -> x
tlsh_hash(x)
## [1] "B253F9F3168DC8354B2363E2A585771CD25A803BCEA099C1FBED54ACA790EB5B137346"
tlsh_is_valid(x)
## [1] TRUE
tlsh_stats(x)
## # A tibble: 1 x 3
## l_value q1_ratio q2_ratio
## <int> <int> <int>
## 1 53 15 9
doc2 <- charToRaw(as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh"))))
tlsh() %>%
tlsh_update(doc2) %>%
tlsh_finalize() -> y
tlsh_dist(x, y)
## [1] 7
tlsh_reset(x)
tlsh_reset(y)
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.