Skip to content
BoV based text representation generator
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
BoV.py
LICENSE
README.md

README.md

BoV

Bag of Vectors (BoV) is a text representation technique based in vector space model. More precisally, this BoV based script use a pre-trained word embedding model to generate an unique vector representation to each document, calculating the arithmetic mean of database words's vector representations found in model. The output is a matrix, where rows are the documents ids, and columns are the dimensions values to each document, i.e., the centroid generated from model term vectors.

Generating a BoV based text representation matrix:

python3 BoV.py --n_gram 1 --model models/Google/GoogleVectors_300.txt --input in/db/ --output out/BoV/txt/

Converting a Doc-Dimension matrix to Arff file (Weka):

python3 Bag2Arff.py --token - --input out/Bag/txt/ --output out/Bag/arff/

Related scripts

Assumptions

These scripts expect a database folder following an specific hierarchy like shown below:

in/db/                 (main directory)
---> class_1/          (class_1's directory)
---------> file_1      (text file)
---------> file_2      (text file)
---------> ...
---> class_2/          (class_2's directory)
---------> file_1      (text file)
---------> file_2      (text file)
---------> ...
---> ...

Observations

All generated files use TAB character as a separator.

Requirements installation (Linux)

Python 3 + PIP installation as super user:

apt-get install python3 python3-pip

NLTK installation as normal user:

pip3 install -U nltk

See more

Project page on LABIC website: http://sites.labic.icmc.usp.br/MSc-Thesis_Antunes_2018

You can’t perform that action at this time.