Read all text files under a directory, and calculate the TF-IDF scores of each term in each file, then write the result to a output file, you can also handle the result as you want.
String indexPath = "./index"; // path to store the lucene index files
String docsPath = "path/to/your/files/directory";
String outFilePath = "./tf_idf_output.txt";
String encoding = "UTF-8"; // encoding for your files, maybe ISO-8859-1 or UTF-8
WordVector wordVector = new WordVector(indexPath,docsPath,true,encoding);
// get the result
Map<String, HashMap> documentsScores = wordVector.TFIDFScore();
// write to file
Utils.write2DMapToFile(outFilePath,documentsScores);
You can also run it with mvn from the command line use some args:
- Run command
mvn compile
in the path./WordVector/
where pom.xml stores there.
$ mvn compile
- Run command
mvn exec:java -Dexec.mainClass="GenTFIDF"
to see the usage info.
$ mvn exec:java -Dexec.mainClass="GenTFIDF"
- Run
GenTFIDF
with command-line arguments like this:
$ mvn exec:java -Dexec.mainClass="GenTFIDF" -Dexec.args="-docs DOCS_PATH [-o OUTPUT_FILE] [-e TEXT_FILE_ENCODING]"
- Check your output file to see the tf-idf scores of each term in each document, the terms in each document have been sorted in descending order, so you can find the most important terms to this document in the collection or corpus.
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D