<a href="https://colab.research.google.com/github/nagabathula/C255-DataMIning/blob/main/Assignment5_LSH_and_Random.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### LSH and Random
#### LSH Data

https://www.kaggle.com/code/paulrohan2020/location-sensitive-hashing-for-cosine-similarity

https://www.kaggle.com/datasets/patrickgomes/machine-learning-papers-semantic-scholar
###Random Projections Data
https://www.kaggle.com/datasets/sihuihe/lbp-random-projections

In [None]:
!pip install datasketch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Imports
import pandas as pd
import numpy as np

import re
from datasketch import MinHash, MinHashLSHForest
from sklearn.metrics.pairwise import cosine_similarity

from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
PERMUTATIONS = 64
N_RECOMMEND = 5
TEST_IDX = 67

In [None]:
# Read data
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML_papers.csv')

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Name + autors,Fields,Date,Abstract,Citations,PDF link
0,0,TensorFlow%3A A system for large scale machine...,Computer Science,27 May 2016,TensorFlow is a machine learning system that o...,8989,https://arxiv.org/pdf/1605.08695.pdf
1,1,TensorFlow%3A Large Scale Machine Learning on ...,Computer Science,14 March 2016,This paper describes the TensorFlow interface ...,8114,https://arxiv.org/pdf/1603.04467.pdf
2,2,Data Mining Practical Machine Learning Tools a...,Computer Science,2014,,9893,https://doi.org/10.1016/c2009-0-19715-5
3,3,Machine learning a probabilistic perspective M...,Computer Science,24 August 2012,A comprehensive and self-contained introductio...,5925,
4,4,Scikit learn%3A Machine Learning in Python Ped...,Computer Science,1 February 2011,Scikit-learn is a Python module integrating a ...,30040,https://arxiv.org/pdf/1201.0490.pdf


In [None]:
df.shape

(946, 7)

In [None]:
df.isna().sum()

Unnamed: 0         0
Name + autors      0
Fields            16
Date               2
Abstract           0
Citations          8
PDF link         112
dtype: int64

In [None]:
# Select only required columns
df = df[['Name + autors', 'Fields', 'Abstract']]

# Rename columns
df = df.rename(columns={'Name + autors':'paper_name_authors', 'Fields':'fields', 'Abstract':'paper_abstract'})

In [None]:
# Drop rows with nan or empty cells
df.replace('   ', np.nan, inplace=True)
df.dropna(inplace=True)

df.reset_index(inplace=True, drop=True)

In [None]:
df['text'] = df['fields'] + ' ' + df['paper_abstract']

In [None]:
df.head()

Unnamed: 0,paper_name_authors,fields,paper_abstract,text
0,TensorFlow%3A A system for large scale machine...,Computer Science,TensorFlow is a machine learning system that o...,Computer Science TensorFlow is a machine learn...
1,TensorFlow%3A Large Scale Machine Learning on ...,Computer Science,This paper describes the TensorFlow interface ...,Computer Science This paper describes the Tens...
2,Machine learning a probabilistic perspective M...,Computer Science,A comprehensive and self-contained introductio...,Computer Science A comprehensive and self-cont...
3,Scikit learn%3A Machine Learning in Python Ped...,Computer Science,Scikit-learn is a Python module integrating a ...,Computer Science Scikit-learn is a Python modu...
4,Fashion MNIST%3A a Novel Image Dataset for Mac...,"Computer Science, Mathematics","We present Fashion-MNIST, a new dataset compri...","Computer Science, Mathematics We present Fashi..."


### Locality Sensitive Hashing

In [None]:
def getShingles(text):
  text = re.sub(r'[^\w\s]','',text)
  text_lower = text.lower()
  tokens = text_lower.split()

  return tokens

In [None]:
def createSignature(text):
  minhash = []

  for text in df['text']:
    # Get tokens which are the shingles (every word is one shingle)
    tokens = getShingles(text)

    # Create minhash objects which stores signatures
    m = MinHash(num_perm=PERMUTATIONS)
    for token in tokens:
        m.update(token.encode('utf8'))
    minhash.append(m)

  return minhash

In [None]:
def buildForest(minhash):
  minhash_forest = MinHashLSHForest(num_perm=PERMUTATIONS)
    
  # Build forest of all minhashes
  for i,m in enumerate(minhash):
    minhash_forest.add(i,m)
        
  # Create index on forest
  minhash_forest.index()
      
  return minhash_forest

In [None]:
def getSimilarRows(minhash_forest, text):
  # Get shingles from input text
  tokens = getShingles(text)

  # Create minhash of test string
  m = MinHash(num_perm=PERMUTATIONS)
  for s in tokens:
    m.update(s.encode('utf8'))
      
  # Query minhash forest and get index of similar records
  idx = np.array(minhash_forest.query(m, N_RECOMMEND))

  return idx

In [None]:
# Get similar rows 

minhash_signatures = createSignature(df['text'])
signatures_forest = buildForest(minhash_signatures)

test_string = df.iloc[TEST_IDX]['text']
similar_papers_idx = getSimilarRows(signatures_forest, test_string)

In [None]:
print('Test paper')
print()
print(df.iloc[TEST_IDX])

Test paper

paper_name_authors    Machine Learning Methods for Histopathological...
fields                                       Computer Science, Medicine
paper_abstract        We introduce the application of digital pathol...
text                  Computer Science, Medicine We introduce the ap...
Name: 67, dtype: object


In [None]:
print('Similar papers')
df[df.index.isin(similar_papers_idx)]

Similar papers


Unnamed: 0,paper_name_authors,fields,paper_abstract,text
67,Machine Learning Methods for Histopathological...,"Computer Science, Medicine",We introduce the application of digital pathol...,"Computer Science, Medicine We introduce the ap..."
322,Deep Learning Applications in Medical Image Ke...,Computer Science,The tremendous success of machine learning alg...,Computer Science The tremendous success of mac...
453,Support vector machine learning for image retr...,Computer Science,A novel method of relevance feedback is presen...,Computer Science A novel method of relevance f...
625,On Kernel Target Alignment Cristianini Shawe T...,Computer Science,"We introduce the notion of kernel-alignment, a...",Computer Science We introduce the notion of ke...
657,Evaluation of a Tree based Pipeline Optimizati...,Computer Science,We introduce the concept of tree-based pipelin...,Computer Science We introduce the concept of t...


### Random Projections

In [None]:
# Read dataset
proj = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/lbp_trainning.csv')
proj.columns = [i for i in range(30)]

In [None]:
# Split dataset into train and test
split = int(0.8*len(proj))
train = proj[:split]
test = proj[split:]

In [None]:
TEST_QUERY_ID = 14

In [None]:
def binary(n):
  total = 1 << n
  comb = []
  for i in range(total):
    b = bin(i)[2:]
    b = '0' * (n - len(b)) + b
    b = [int(i) for i in b]
    comb.append(b)

  return comb

In [None]:
# Initializations
buckets = {}
counter = 0

nbits = 8
d = train.shape[1]
plane_norms = np.random.rand(d, nbits) - .5
hashes = binary(nbits)

for hash in hashes:
  hash_code = ''.join([str(i) for i in hash])
  buckets[hash_code] = []

# convert to numpy array
hashes = np.stack(hashes)

In [None]:
def getDirection(vector):
  # calculate dot product between vector and plane 
  direction = np.dot(vector, plane_norms)

  # Determine if vector lies on positive or negative side of plane
  direction = direction > 0
  binary_hash = direction.astype(int)

  return binary_hash

In [None]:
def hashVector(vector, counter):
  binary_hash = getDirection(vector)
  binary_hash = ''.join(binary_hash.astype(str))

  # add to buckets dictionary
  buckets[binary_hash].append(counter)

  counter += 1

In [None]:
def getDistance(hashed_vector):
  # get hamming distance between query vector and all buckets in hashes
  hamming_dist = np.count_nonzero(hashed_vector != hashes, axis=1).reshape(-1, 1)
  hamming_dist = np.concatenate((hashes, hamming_dist), axis=1)

  # sort by hamming distance
  hamming_dist = hamming_dist[hamming_dist[:, -1].argsort()]

  return hamming_dist

In [None]:
def getTopK(vector, k):
  binary_hash = getDirection(vector)
  hamming_distance = getDistance(binary_hash)

  vec_ids = []
  for row in hamming_distance:
      str_hash = ''.join(row[:-1].astype(str))
      bucket_ids = buckets[str_hash]
      vec_ids.extend(bucket_ids)
      if len(vec_ids) >= k:
          vec_ids = vec_ids[:k]
          break

  return vec_ids

In [None]:
for i in range(len(train)):
    hashVector(train.iloc[i], counter)

In [None]:
top_10 = getTopK(test.iloc[TEST_QUERY_ID], k=10)

In [None]:
cos = cosine_similarity(train, [test.iloc[TEST_QUERY_ID]])
np.mean(cos)

0.8973284191027858

Top k vectors calculated using random projections show a high similarity 


