Skip to content

imshashank/dm5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Mining Project for CSE OSU CLass

By Shashank Agarwal (agarwal.202@osu.edu) Anurag Kalra (kalra.25@osu.edu)

all code is in the code5 folder and reports are in the report5 folder in pdf and docx format

##Instructions

We are using the feature vector made in the first assignment which is in the file 'feature_matrix.pytext'.

Procedure: 1)Using the feature vector created in Lab 1, we create a Jaccard similarity matrix for all documents. This will be our true similarity baseline. 2)We create signature for each document using Minhashing (for k=16,32,64,128). 3)Using the signature created in step 2, we create estimate Jaccard similarity matrix for all documents. 4)Compare the estimate Jaccard similarity matrices with baseline Jaccard matrix to find accuracy variance on k. 5)Repeat steps 2 to 4 for different values of k.

=========================================================================================================================== The Code:

  1. We converted the feature vector in file "feature_matrix.pytext" to a jaccard similarity matrix saved in file "jaccard_dist2.pytext".

As the file jaccard_dist2.pytext was over 4GB we are not including that in the report. But the file can be generated by running the below file. code_file: jaccard.py

  1. The code for loading the file "jaccard similarity matrix" in memory and using that instead of comouting it again is in file 'minhash_load.py'

  2. minhash: We are using two approaches a)Create the jaccard similarity on the go and compute minhash & MSE b)Create jaccard, save to file and load this for various values of k

=========================================================================================================================== Executing the Code:

To run minhash enter: make minhash

#the default 'k' is 16 in makefile, the value can be changed in file 'minhash.py' on line 11

To generate jaccard similarity file enter using approach (b): make jaccard

To generate minhash using jaccard similarity file enter using approach (b): make minhash_load

=========================================================================================================================== Contributions: Shashank Agarwal & Anurag Kalra both implemented the "minhash" clustering and optimizations. The work overlaps and is hard to be segregated.

P.S. We had to remove the "jaccard_dist2.pytext" (jaccard similarity matrix) file as the folder exceeded the allowed size of submission

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors