This is a Python code for the paper below:
*Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets, Payam Karisani. ACL Findings 2023. Link
Input
The input files (train, test, or unlabeled) should contain one document per line. Each line should have 4 attributes (tab separated):
- A unique document id (integer)
- An integer label:
- The number 1 for the first class
- The number 3 for the second class
- The number 4 for the third class
- The number 5 for the fourth class (I know!! the numbering is weird)
- If the document is unlabeled this column is ignored
- Domain (string): a keyword describing the topic of the document
- Document body (string)
See the file “sample.train” for a sample input.
Training and Evaluation
Below you can see an example command to run the code. This command tells the code to use a subset of the documents in the training and the unlabeled sets to train a model and evaluate in the test set—F1 measure is printed at the end of the execution.
python -m RST.src.MainThread --cmd bert_mine \
--itr 1 \
--model_path /user/desktop/bert-base-uncased/ \
--train_path /user/desktop/data/data.train \
--test_path /user/desktop/data/data.test \
--unlabeled_path /user/desktop/data/data.unlabeled \
--output_dir /user/desktop/output \
--device 0 \
--seed 666 \
--train_sample 500 \
--unlabeled_sample 10000
The arguments are explained below:
- “--itr”: The number of iterations to run the experiment with different random seeds
- “--model_path”: The path to the huggingface pretrained bert
- “--train_path”: The path to the train file
- “--test_path”: The path to the test file
- “--unlabeled_path”: The path to the unlabeld file
- “--output_dir”: A directory to be used for temporary files
- “--device”: GPU identifier
- “--seed”: Random seed
- “--train_sample”: The number of documents to sample from the original labeled set to be used as the training data
- “--unlabeled_sample”: The number of unlabeled documents to sample from the unlabeled set to be used in the model
Notes
- The code uses the huggingface pretrained bert model: Link
- The hyper-paremeters are set to their default values reported in the paper. You may need to change them, you can find them at the begining of the method "__run_training_with_unlabeled_mine()" in the class “EPretrainProj”.
- The batch size is set to 32.