GitHub - p-karisani/CEPC: A domain adaptation model

This is a Python code for the paper below:
Multiple-Source Domain Adaptation via Coordinated Domain Encoders and Paired Classifiers, Payam Karisani. AAAI 2022. Link

Pre-requirements

Python (>= 3.7.0)
Numpy (>= 1.21)
Pytorch (>= 1.8)

Input
The input file should contain one document per line. Each line should have 4 attributes (tab separated):

A unique document id (integer)
A binary label (integer):
- The number 1 for negative documents
- The number 3 for positive documents
Domain (string): a keyword specifying the domain of the document
Document body (string)

See the file “sample.data” for a sample input.

Training and Evaluation
Below you can see an example command to run the code. This command tells the code to read the input data, to separate the documents based on their domains, to iteratively assume one domain is the target domain and the rest are the source domains, to train the model, to test the model on the assumed target domain, and to print the result--F1, Precision, Recall, Accuracy.

The code is ran for the specified number of iterations, and the average results are printed at the end of the execution.

python -m CEPC.src.MainThread --cmd da_m_mine1 \
--itr 5 \
--model_path /user/desktop/bert-base-uncased/ \
--data_path /user/desktop/data/sample.data \
--output_dir /user/desktop/output \
--device 0 \
--seed 666

The arguments are explained below:

“--itr”: The number of iterations to run the experiment with different random seeds
“--model_path”: The path to the huggingface pretrained bert
“--data_path”: The path to the input data file
“--output_dir”: A directory to be used for temporary files (the results are printed on screen only)
“--device”: GPU identifier
“--seed”: Random seed

Datasets

You can find the Illness dataset here: Link
The Crisis and Tuning datasets are in the directory "sandoogh". These are meta-datasets and collected from the data published by other researchers. Please make sure to cite the original articles if you intend to use them, see the paper for the references.

Notes

A pre-requisite to the algorithm is a grid search to obtain the hyper-parameters, you cannot set the values manually (see the paper). The grid search may make the code look a bit slow, specifically if your dataset is large. During the development I used a caching module and a few programming tricks to speed up the work. The code here doesn't include the caching module--because it won't make your experiments faster.
The code uses the huggingface pretrained bert model: Link
The batch size is set to 50

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
sandoogh		sandoogh
src		src
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
sample.data		sample.data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

p-karisani/CEPC

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages