Solr Similar Text
=================

This tests using the _ssdeep_ fuzzy hashes to check for items.

Originally, I'd hoped that the ability to search for matches within a given edit distance would make this very powerful. However, the Lucene engine can only support an edit distance of 2 characters (see [fuzzy search](https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches), so in practice only very close matches are found this way.

To make this work, we'd need to do something clever like shingle the values so we could look for items with the most nearly-matching chunks.

In [18]:
import requests
import pandas as pd

# Pick an example that has an NGINX error:
url = 'http://www.eshbal.org.il/favicon.ico'

def run_q(params):
    headers = {'content-type': "application/json" }

    r = requests.post("http://solr.api.wa.bl.uk/solr/fc/select", params=params, headers=headers)

    if r.status_code != 200:
        raise Exception(r.text)
    else:
        return r.json()

# Get the most recent version:
params = {
    'q': 'url:"%s"' % url,
    'sort': 'crawl_date desc',
    'rows': 1,
    'wt': 'json'
}
sr = run_q(params)
docs = sr['response']['docs']

# Get the ss_deep results:
ssdeep_qs = []
for doc in docs:
    for key in doc:
        if key.startswith('ssdeep_hash_bs'):
            ssdeep_qs.append('%s:%s~ ' % (key,doc[key]))          
ssdeep_q = ' OR '.join(ssdeep_qs)

# Use that to search for other documents within the maximum edit distance (2!)
print("Searching using: " + ssdeep_q)
params = {
    'q': ssdeep_q,
    'sort': 'crawl_date desc',
    'rows': 1000,
    'wt': 'json'
}
sr = run_q(params)

# Show the results as a dataframe:
df = pd.DataFrame(sr['response']['docs'])
df

Searching using: ssdeep_hash_bs_3:cKCxEn~  OR ssdeep_hash_bs_6:7CxE~ 


Unnamed: 0,id,source_file,source_file_offset,url_type,warc_key_id,warc_ip,source_file_path,resourcename,content_type_ext,host,...,keywords,links_hosts,links_hosts_surts,links_domains,links_public_suffixes,elements_used,parse_error,collection,collections,wct_subjects
0,20181229115414/rDiO5iwm7ArdtN7QcYm6Xg==,BL-20181229090529907-00712-63~ukwa-h3-pulse-mo...,198566829,[normal],<urn:uuid:2a89c539-ff20-40e3-8e90-dca2805dfc22>,13.33.50.86,BL-20181229090529907-00712-63~ukwa-h3-pulse-mo...,data.rdf,rdf,legislation.gov.uk,...,,,,,,,,,,
1,20181223133358//XXXgJzPPQ65bS/J5GhdcA==,BL-20181223090529572-00592-63~ukwa-h3-pulse-mo...,480356269,[normal],<urn:uuid:7e690b87-6b67-4613-a95c-c5308c762664>,13.33.50.25,BL-20181223090529572-00592-63~ukwa-h3-pulse-mo...,data.xml,xml,legislation.gov.uk,...,,,,,,,,,,
2,20181223003143/aBc3jrTnlFMrumw2ClkXdw==,BL-20181222210530328-00584-63~ukwa-h3-pulse-mo...,386339952,[normal],<urn:uuid:ae080a83-4ef3-4eb2-b2cc-f3a23a4c9acd>,13.33.50.103,BL-20181222210530328-00584-63~ukwa-h3-pulse-mo...,data.xml,xml,legislation.gov.uk,...,,,,,,,,,,
3,20181220141849/Jc7ZxeYMdcbm/Aw2GyWBrw==,BL-20181220090529706-00531-63~ukwa-h3-pulse-mo...,608417734,[normal],<urn:uuid:f6e6e16c-1c8c-4ad2-b97f-b5b55cffd393>,13.33.50.92,BL-20181220090529706-00531-63~ukwa-h3-pulse-mo...,data.xml,xml,legislation.gov.uk,...,,,,,,,,,,
4,20181219141149/tm18rgZ+YwC8+ztva/4jPA==,BL-20181219090534592-00513-63~ukwa-h3-pulse-mo...,629838987,[normal],<urn:uuid:e71444aa-37a5-492a-ae29-1d95fffaf32b>,13.33.50.89,BL-20181219090534592-00513-63~ukwa-h3-pulse-mo...,data.rdf,rdf,legislation.gov.uk,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
978,20130806065036/feF7Y40bSv96YjqB7Ewf2g==,BL-20130806043302288-00057-22625~opera~8443.wa...,468644255,[normal],<urn:uuid:54e715f2-e0ff-49a6-8e97-b9528b8584bd>,31.222.175.237,BL-20130806043302288-00057-22625~opera~8443.wa...,feed,,teamgb.com,...,,,,,,,,,,"[Society & Culture, Sports and Recreation]"
979,20130729201807/AWZcisx4DaiR1NwUStkl9A==,BL-20130729092106588-00002-22625~opera~8443.wa...,75916190,[normal],<urn:uuid:6503c6ae-daa6-496f-931d-bfe3b4ebede6>,176.58.107.98,BL-20130729092106588-00002-22625~opera~8443.wa...,,,onthewight.com,...,,,,,,,,,,
980,20130725202712/HwxxnMmSlsRRnQYtfzXZ/Q==,BL-20130725130600102-00000-22625~opera~8443.wa...,934351233,[normal],<urn:uuid:5a21000b-0943-42d0-9181-39cc5a900071>,193.132.104.92,BL-20130725130600102-00000-22625~opera~8443.wa...,01_Bhagwanji_Lakhani.mp3,mp3,movinghere.org.uk,...,,,,,,,,,,
981,20130725172530/tb+jXZScaGlxSvPBgm99hQ==,BL-20130725131719572-00001-22625~opera~8443.wa...,525465808,[normal],<urn:uuid:306a9660-e32a-4265-bb97-448852545d9e>,193.132.104.92,BL-20130725131719572-00001-22625~opera~8443.wa...,paddy.mp3,mp3,movinghere.org.uk,...,,,,,,,,,,
