Skip to content

itscarrot/PageRank-VSM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

A collection is a 10,000 Wikipedia pages.

Dataset The dataset is composed of three folders – “content”, “anchor” and “url”: • content: There are 10,000 documents in this folder. The files are named based on document id (or docID). • id2url: There is only one file containing the URLs of the 10,000 documents in this folder. The first column is docID, and the second column is URL. The two columns are separated by space. • anchor: There are 10,000 files in this folder, each containing the anchor text of a docID. Each file has two columns. The first column is anchor text and the second column is the hyperlink of the anchor text. The two columns are separated by tab.

Lucene need: lucene-queryparser-4.6.1.jar
lucene-analyzers-common-4.6.1.jar lucene-core-4.6.1.jar

About

PageRank+VSM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published