GitHub - nitishgupta/document-representations: learn word and document representations for various prediction tasks

nitishgupta / document-representations Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

learn word and document representations for various prediction tasks

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
categorization		categorization
data		data
output		output
python_scripts		python_scripts
thesis-nitish		thesis-nitish
writeup		writeup
Makefile		Makefile
readme.txt		readme.txt
util.cpp		util.cpp
util.hpp		util.hpp
vocab		vocab
vocab-matrix		vocab-matrix
vocab-matrix.cpp		vocab-matrix.cpp
vocab-weight		vocab-weight
vocab-weight.cpp		vocab-weight.cpp

Repository files navigation

Folders 
1. data - Contains the datasets
2. python_scripts - Contains the python scripts for preprocessing and computing entity similarity using the embeddings
3. categorization - Contains python code for document categorization, codes to compute the P/R/F1 score using the prediction files stores in Output/Dataset/Model/prediction/model-val.dat - for F1 on the Validation data or Output/Dataset/Model/prediction/model-test.dat - for F1 score on the test set.
4. output - Contains embedding and prediction outputs for all the datasets.

Contains document embedding learning code vocab-weight.cpp
The code can be run using,
$ ./vocab-weight -train-directory data/Biology/ -output-directory output/Biology/weight-update-epoch50/ -debug 0 -min-count 5 -embed-size 100 -window-size 2 -nthreads 30 -negative-samples 5 -updateweights 1 -epoch 200