### Read docs

In [1]:
import pandas as pd

In [2]:
tcdf = pd.read_csv('TechCrunch.csv', encoding = "ISO-8859-1")
# vbdf = pd.read_csv('VentureBeat.csv', encoding='utf-8', errors='ignore')

In [3]:
tcdf.head()

Unnamed: 0,title,url,date
0,Gaming firm Razer seeks to raise over $600M in...,https://techcrunch.com/2017/07/01/razer-hong-k...,1-Jul-17
1,Mendel.ai nabs $2 million to match cancer pati...,https://techcrunch.com/2017/07/01/mendel-ai-na...,1-Jul-17
2,These cities in Californias East Bay are raki...,https://techcrunch.com/2017/07/01/these-east-b...,1-Jul-17
3,A walk around Station F with Emmanuel Macron,https://techcrunch.com/2017/07/01/a-walk-aroun...,1-Jul-17
4,Crunch Report | Facebook Helps You Find Wi-Fi,https://techcrunch.com/2017/06/30/crunch-repor...,30-Jun-17


In [4]:
def chunk_sentence(sent):
    sent = sent.split()
    sent0 = sent[:len(sent) // 2]
    sent1 = sent[len(sent) // 2:]
    ret = sent1 + sent0
    return ' '.join(ret)

In [5]:
import random
titles = tcdf.title.values.tolist()
titles_sf = [chunk_sentence(sent) for sent in titles]
titles += titles_sf
random.shuffle(titles)
len(titles)

24788

In [6]:
import pickle

In [7]:
with open('titles.pkl', 'wb') as fout:
    pickle.dump(titles, fout)

### SHingling 

In [8]:
from tqdm import tqdm

In [9]:
from shingling import Shingles
shingler = Shingles(k=10)
docs = [shingler.shingling(title) for title in tqdm(titles)]

100%|██████████| 24788/24788 [00:03<00:00, 6275.20it/s]


In [10]:
len(docs)

24788

### Min hash 

In [11]:
from corpus import Corpus
cp = Corpus(docs)

Remap token to index from 0 -> len(token)
Number of shingles = 660871


In [12]:
from min_hashing import MinHasher
mhasher = MinHasher(cp, k=200)

Generate 200 random bucker hashers


In [13]:
signatures = mhasher.pseudo_perm_hasher()
len(signatures), len(signatures[0])

2019-07-05 18:21:04.649476  - Processing shingle id 0.
2019-07-05 18:21:32.517449  - Processing shingle id 100000.
2019-07-05 18:21:55.309940  - Processing shingle id 200000.
2019-07-05 18:22:17.096038  - Processing shingle id 300000.
2019-07-05 18:22:38.191594  - Processing shingle id 400000.
2019-07-05 18:22:58.900027  - Processing shingle id 500000.
2019-07-05 18:23:19.365758  - Processing shingle id 600000.


(24788, 200)

### Lsh

In [14]:
from local_sensitive_hashing import LocalSensitiveHashing

In [15]:
lsh = LocalSensitiveHashing(signatures, row_p_band=2)

In [16]:
doc_to_docs = lsh.hashing()

In [17]:
len(doc_to_docs)

24592

In [18]:
keys = list(doc_to_docs.keys())

In [19]:
keys[:10]

[5, 255, 148, 370, 76, 414, 887, 915, 804, 1112]

In [20]:
for docid in keys[:10]:
    print("Doc id", docid)
    print("\t", titles[docid])
    print("SImilar docs")
    for adj in doc_to_docs[docid]:
        print("\t", titles[adj])
    print()

Doc id 5
	 a fight for prettier (e)books Brand new Vellum picks
SImilar docs
	 Brand new Vellum picks a fight for prettier (e)books
	 from The Last Jedi director Rian Johnson Brand new Star Wars trilogy coming
	 You can now pre-order a brand new, working Street Fighter II SNES cartridge
	 Watch BlackBerry unveil a brand new phone live right here

Doc id 255
	 Brand new Vellum picks a fight for prettier (e)books
SImilar docs
	 a fight for prettier (e)books Brand new Vellum picks

Doc id 148
	 security startup Observable Networks Cisco acquires network
SImilar docs
	 Security startup CryptoMove fragments data and moves it around to keep it secure
	 CEO Talks Cisco Acquisition Crunch Report | AppDynamics
	 Cloud security startup ProtectWise raises another $25 million
	 security startup harvest.ai for around $20M Sources: Amazon quietly acquired AI
	 Sources: Amazon quietly acquired AI security startup harvest.ai for around $20M
	 McAfee acquires cloud security startup Skyhigh Networks