One reason for the positive impact of Pretrained Language Models (PLMs) in NLP tasks is their ability to encode semantic types, such as ‘European City’ or ‘Woman’. While previous work has analyzed such information in the context of interpretability, it is not clear how to use types to steer the PLM output. For example, in a cloze statement, it is desirable to steer the model to generate a token that satisfies a user-specified type, e.g., predict a date rather than a location.
In this work, we introduce Type Embeddings (TEs), an input embedding that promotes desired types in a PLM. Our proposal is to define a type by a small set of word examples. We empirically study the ability of TEs both in representing types and in steering masking redictions without changes to the prompt text in BERT. Finally, using the LAMA datasets, we show how TEs highly improve the precision in extracting facts from PLMs.
This repo contains the required code for running the experiments of the associated paper.
git clone https://github.com/MhmdSaiid/TypeEmbedding
cd TypeEmbedding
virtualenv TE -p $(which python3)
source TE/bin/activate
pip install -r requirements.txt
The datasets and type embeddings used for the experiments:
bash exps/prepare_data.sh
- Similarity (exps/Similarity.ipynb)
- SV distribution (exps/Type_Eigen_Analysys.ipynb)
- Adversarial metrics (Under Construction)
- Adversarial Accuracy
bash exps/PushTest.sh
- Adversarial Sensitivity
bash exps/SensTest.sh
- Adversarial Accuracy
- Layerwise Classification (Under Construction)
bash exps/LayerwiseTest.sh
- TCAV Sensitivty (exps/LW_TCAV.ipynb) (Under Construction)
-
Instrinsic Experiments (Section X.X)
bash exps/runItr.sh python src/print_avg.py --res_dir "results/ProcessedDatasets"
-
Run Extrinsic Experiments (Section X.X) (Under Construction)
-
Type Switch (exps/Type_Switch.ipynb)
-
Sampling Variations
- Sampling Method
bash exps/SampleMethod.sh
- Number of Samples
bash exps/SampleNum.sh
- Sampling Method
-
GPT-2 examples (exps/GPT2+Concept.ipynb)
-
De-toxification (Under Construction)
You can generate your own Type Embeddings using the following script:
python src/GetTypeVecs.py --model_arch 'bert-base-cased'\
--path 'data/KG Samples/'\
--seed 0\
--num_samples 10
--model_arch
: model architecture according to HuggingFace's Transformers library--path
: folder containing samples for types in csv format--seed
: seed calue for reproducability--num_samples
: number of samples used--sample_type
: Sampling Method. Choose between Uniform Random Sampling (Unif), Random Weighted Sampling (Weighted), Most important samples (Top), and Least important sample (Bot)
For any inquiries, feel free to contact us, or raise an issue on Github.
You can cite our work:
@inproceedings{saeed-etal-2022-TE,
title = {You Are My Type! Type Embeddings for Pre-trained Language Models},
author = {Saeed, Mohammed and Papotti, Paolo},
booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
month = dec,
year = {2022},
address = {Online and Abu Dhabi, UAE},
publisher = {Association for Computational Linguistics},
}