There are two parts in the experiment. The code and data is saved in /ENRON
and /LAMA
respectively.
The Enron email dataset is downloaded from http://www.cs.cmu.edu/~enron/ and unzipped to ENRON/enron/maildir/
.
Some data files are too large to be uploaded. Here is how you can prepare these data.
ENRON/enron/parsed_emails.pkl
: Run the scripts inmailparser.ipynb
ENRON/enron_count/cooccur.pkl
:python ENRON/enron_count/cooccurrence.py
The prediction script is reused from this repo: https://github.com/jeffhj/LM_PersonalInfoLeak
However, some results are uploaded under ENRON/final_result_pkl/
.
/ENRON/analysis-email*.py
are the analysis scripts used in the experiments.
The LAMA dataset is downloaded from https://dl.fbaipublicfiles.com/LAMA/data.zip
We also used Wikidump to extract contexts. The preprocessing script is LAMA/wikidump-prepare.py
.
- Prepare prompts:
prompt-prepare.py
- Prepare contexts:
LAMA/find_occurrence.py
LAMA/occurrence_agg.py
LAMA/extract_context.py
- Predict:
LAMA/pred.py
- Predict (context):
LAMA/pred_context.py
/LAMA/analysis*.py
are the analysis scripts used in the experiments.