GitHub - p-karisani/RST: A classification model

This is a Python code for the paper below:
*Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets, Payam Karisani. ACL Findings 2023. Link

Input
The input files (train, test, or unlabeled) should contain one document per line. Each line should have 4 attributes (tab separated):

A unique document id (integer)
An integer label:
- The number 1 for the first class
- The number 3 for the second class
- The number 4 for the third class
- The number 5 for the fourth class (I know!! the numbering is weird)
- If the document is unlabeled this column is ignored
Domain (string): a keyword describing the topic of the document
Document body (string)

See the file “sample.train” for a sample input.

Training and Evaluation
Below you can see an example command to run the code. This command tells the code to use a subset of the documents in the training and the unlabeled sets to train a model and evaluate in the test set—F1 measure is printed at the end of the execution.

python -m RST.src.MainThread --cmd bert_mine \
--itr 1 \
--model_path /user/desktop/bert-base-uncased/ \
--train_path /user/desktop/data/data.train \
--test_path /user/desktop/data/data.test \
--unlabeled_path /user/desktop/data/data.unlabeled \
--output_dir /user/desktop/output \
--device 0 \
--seed 666 \
--train_sample 500 \
--unlabeled_sample 10000

The arguments are explained below:

“--itr”: The number of iterations to run the experiment with different random seeds
“--model_path”: The path to the huggingface pretrained bert
“--train_path”: The path to the train file
“--test_path”: The path to the test file
“--unlabeled_path”: The path to the unlabeld file
“--output_dir”: A directory to be used for temporary files
“--device”: GPU identifier
“--seed”: Random seed
“--train_sample”: The number of documents to sample from the original labeled set to be used as the training data
“--unlabeled_sample”: The number of unlabeled documents to sample from the unlabeled set to be used in the model

Notes

The code uses the huggingface pretrained bert model: Link
The hyper-paremeters are set to their default values reported in the paper. You may need to change them, you can find them at the begining of the method "__run_training_with_unlabeled_mine()" in the class “EPretrainProj”.
The batch size is set to 32.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
sample.train		sample.train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

p-karisani/RST

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages