
This notebook provides a complete, executable workflow for similarity search system.

## System Overview

```
Dataset → CMS Construction → MinHash Signatures → LSH Index → Query Processing
```



## Prerequisites

Ensure you have the required packages installed.

In [1]:
# Install required packages 
!pip install numpy pandas scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting numpy
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting pandas
  Downloading pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m:01[0m
Collecting scipy>=1.8.0
  Downloading scipy-1.

## Step 1: Build Count-Min Sketch (CMS)

Construct CMS representations for each column in the dataset.



In [5]:

# Build CMS for all columns
!export PYTHONPATH=/home/sadiya/Desktop/sketchjoin:$PYTHONPATH
!python3 preprocessing/cms_construction.py \
    --dataset_path /home/sadiya/Desktop/nyc_selected \
    --dataset_name nyc


print("✓ Count-Min Sketch construction completed")


Traceback (most recent call last):
  File "/home/sadiya/Desktop/sketchjoin/preprocessing/cms_construction.py", line 4, in <module>
    from utils.cms_utils import CountMinSketch, CMS_WIDTH, CMS_DEPTH
ModuleNotFoundError: No module named 'utils'
✓ Count-Min Sketch construction completed


## Step 2: Build MinHash Signatures

Generate MinHash signatures from CMS.

In [None]:

# Build MinHash signatures
!python preprocessing/minhash_construction.py \
    --dataset_path ./nyc_cleaned \
    --dataset_name nyc



## Step 3: Build LSH Index

Construct Locality-Sensitive Hashing index for fast approximate nearest neighbor search.



In [None]:

!python index/lsh_index.py \
    --dataset_path ./nyc_cleaned \
    --dataset_name nyc



## Step 4: Run Query and Evaluate Results

Searches for columns similar to the query column.

In [None]:

!python discovery/SketchJoin.py \
    --query_file query.csv \
    --query_column location \
    --dataset_path ./nyc_cleaned \
    --dataset_name nyc


## Configuration and Tuning

To adjust system parameters, modify the following files:

### `utils/cms_utils.py`
```python
CMS_WIDTH = 2000
CMS_DEPTH = 5     # Number of hash functions
```

### `utils/utils.py`
```python
ERROR = 0.05                        # Approximation error bound
PROBABILITY_OF_ERROR_MINHASH = 0.1  # MinHash error probability
THRESHOLD = 0.7                     # Similarity threshold (0.0 to 1.0)
PROBABILITY_OF_ERROR_LSH = 0.05     # LSH false negative rate
```
