Skip to content

Latest commit

 

History

History
22 lines (17 loc) · 1.07 KB

README.md

File metadata and controls

22 lines (17 loc) · 1.07 KB

duplicatated-pages-detection

Implemented simhash technique to estimate duplicated pages in a given dataset. University project for Information Retrieval (Spring 2015)

Final report can be found here in Greek.

Prerequisites

  • Matlab 2012b+
  • Matlab: 'Statistics and Machine Learning Toolbox
  • Java 1.6 (Matlab 2012b needs that version)

How to run

The main program is proj.m

  1. In DataHasher.java on lines 45 and 48 insert path for Desktop.
  2. Compile with javac -source 1.6 -target 1.6 DataHasher.java.
  3. In Matlab workspace run which classpath.txt and we add the path to the directory of DataHasher.class.
  4. Run proj.m and choose whether the input is from a .csv file or from an online source.

Authors

Course website

ECE328 Information Retrieval