This is the official implementation repository of NAACL findings paper, GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models
This work is about fine-tuning the large language model llama2-7B with peft and using it for temporal knowledge graph (tkg) forecasting. The training and evaluation data used are obtained by TLR retreival, and the FIT trained model weights are stored on Google Drive.
Download the codes and go to the folder:
git clone https://github.com/mayhugotong/GenTKG.git
cd GenTKG
Create an environment:
conda create -n gtkg python=3.8
conda activate gtkg
pip install -r requirements.txt
pip install git+https://github.com/huggingface/peft.git
Download data and models from Google Drive and then unzip and save them in folders "data" and "model".
You can use gdown to do it:
pip install gdown
gdown https://drive.google.com/file/d/1C63Ugg_Xc1MGgeToiYNM0X4i35CJUEWA/view?usp=sharing
unzip data.zip -d .
gdown https://drive.google.com/file/d/1mpzlfKLuh3cHvox8UpP1RkPUHKeN4_eL/view?usp=sharing
unzip model.zip -d .
gdown https://drive.google.com/file/d/145avybZXtlTrshVBJ22B6KSJPnK5nQVS/view?usp=sharing
unzip model_backup.zip -d .
Before anything, you might want to create datasets in lexicons instead of in ids. For example, for the train file of icews14:
python ./data_utils/id_words.py --file_to_convert ./data/icews14/train.txt --path_output ./data/processed_new/icews14/train.txt --dataset icews14 --period 24
Rules learning parameters:
- -f file_to_convert, input path to a certain file.
- -o --path_output, output path to a certain file
- -d --dataset, icews14, icews18, GDELT or YAGO.
- -p --period, default 1; to set 24: period for icews14/18 where timestamps increase every 24. Datasets containing all facts of train, valid and test are also provided (all_facts.txt)
By default you will create any new datasets of your own in ./data/processed_new/ .
The rules learning part is originally from Tlogic rules learning codes. It runs on lexical datasets (although it just convert them into ids). By default it only reaches datasets in ./data/ instead of ./data/processed_new/ . You can produce other rule banks besides the provided ones by running e.g. for icews14:
cd data_utils/rules_learning
python3 learn.py -d icews14 -l 1 2 3 -n 200 -p 15 -s 12
Rules learning parameters:
- -d --dataset, dataset name.
- -l --rule_lengths, default length of chains. (only length=1 is used)
- -n --num_walks
- --transition_distr, default: exp.
- -p --num_processes, for accelerating.
- -s --seed
You will get a rule bank file similar to "060723022344_r[1,2,3]_n200_exp_s12_rules.json" under the ./output/ folder.
Find the file name of rule bank json (in ./output) and run from the folder GenTKG:
cd GenTKG
python3 ./data_utils/retrieve.py --name_of_rules_file name_rules.json --dataset icews14
An example for icews18 would be like:
python ./data_utils/retrieve.py --name_of_rules_file 060723022344_r[1]_n200_exp_s12_rules.json --dataset icews18
By default you will create these following files:
- data/processed_new/{dataset}/[train, valid, test]/history_facts/history_facts_{dataset}.txt [A]
- data/processed_new/{dataset}/[train, valid, test]/history_facts/history_facts_{dataset}_idx_fine_tune_all.txt
- data/processed_new/{dataset}/[train, valid, test]/test_answers/test_answers_{dataset}.txt [B]
For training, you need to convert history_facts files into lora json file:
python3 ./data_utils/create_json_train.py --dir_of_trainset 'the_full_trainset_to_convert (see [A])' --dir_of_answers 'the_test_answers (see [B])' --dir_of_entities2id 'the_json_of_entities2id (see [C])' --path_save 'better_the_same_as_the_trainset_before_converting'
An example for icews18 would be like:
python ./data_utils/create_json_train.py --dir_of_trainset './data/processed_new/icews18/train/history_facts/history_facts_icews18.txt' --dir_of_answers './data/processed_new/icews18/train/test_answers/test_answers_icews18.txt' --dir_of_entities2id './data/icews18/entity2id.json' --path_save './data/processed_new/icews18/train'
Basic training:
python3 main.py --OUTPUT_DIR "your_output_directory" --DATA_PATH "path_of_dataset_file"
Example for training:
python3 main.py --OUTPUT_DIR "./model/output_model_icews14_1024" --DATA_PATH "./data/processed/train/icews14/icews14_1024.json"
Training parameters (in config.py):
- MICRO_BATCH_SIZE, Per device train batch size.
- BATCH_SIZE, batch size.
- EPOCHS, Training epochs.
- WARMUP_STEPS, Warmup steps.
- LEARNING_RATE, Training learning rate.
- CONTEXT_LEN, Truncation length of context (in json).
- TARGET_LEN, Truncation length of target (in json).
- TEXT_LEN, Truncation length of text (in txt).
- LORA_R, Lora low rank.
- LORA_ALPHA, Lora Alpha.
- LORA_DROPOUT, Lora dropout.
- MODEL_NAME, Model name.
- LOGGING_STEPS, Logging steps in training.
- LOAD_BEST_MODEL_AT_END, set 1 to save the best checkpoint.
- OUTPUT_DIR, Output dir.
- DATA_PATH, Input dir of trainset.
- DATA_TYPE, Input trainsetfile type.
If you want to use logging platform like WandB, you may need these:
- REPORT_TO, logging to e.g. wandb.
- PROJ_NAME, Project name for e.g. wandb.
- RUN_NAME, Run name for e.g. wandb.
- SAVE_STEPS, Save the model according to steps.
- SAVE_TOTAL_LIMIT, The number of the checkpoint you will save (Excluding the final one).
- W_RESUME, set 1 to enable WANDB_RESUME.
- W_ID, set 1 to enable WANDB_RESUME'
Basic test:
python3 inference.py --LORA_CHECKPOINT_DIR "path of model checkpoint" --output_file "your output directory" --input_file "path of history_facts file" --test_ans_file "path of test_answers file"
Example for testing:
python3 main.py --LORA_CHECKPOINT_DIR "./model/icews14" --output_file "./results/prediction_icews14.txt" --input_file "./data/processed/eval/history_facts/history_facts_icews14.txt" --test_ans_file "./data/processed/eval/test_answers/test_ans_icews14.csv"
Testing parameters (in eval_utils.py):
- LORA_CHECKPOINT_DIR, the path of model checkpoint.
- output_file, your output directory.
- input_file, path of history_facts file.
- test_ans_file, path of test_answers file.
If you want to begin from a certain i-th question (like resuming):
- begin, the number of the checkpoint you will save (Including the final one).
- last_metric, the path for the saved metric file. It will read the results from it.
The repository contains codes for both TLR and few-shot instruction-tuning llama2 and inference. Learned rule banks are also provided here:
Root
|--data_utils/
|--rules_learning/ (codes from [Tlogic](https://github.com/liu-yushan/TLogic))
|--basic.py (utils for data reading/writing etc)
|--create_json_train.py (convert dataset into lora json format)
|--id_words.py (convert between id and lexical entities, relations and timestamps)
|--retrieve.py (data reading/writing and so on for retrieving)
|--TLR.py (retrieve history according to rules)
|--llama2_ori_repo/ (In-context Learning codes for llama2; imported in evaler.py)
|--minimal20b/ (In-context Learning codes for gpt-neox; imported in evaler.py)
|--output/ (contains rules banks from Tlogic rules learning)
|--results/ (stores inference results; empty)
|--config.py
|--eval_utils.py
|--evaler.py
|--inference.py (inference)
|--main.py (training)
|--neox.py (gpt-neox inference)
|--utils.py
The structure should be similar to this:
Datasets
|--processed/
|--train/ (trainsets for Gentkg; JSON files)
|--icews14/
|--icews14.json (full set)
|--icews14_16.json (sampled set)
...
|--icews14_1024.json (sampled set)
|--icews18/
...
|--eval/
|--history_facts/
|--history_facts_icews14.txt
|--history_facts_icews18.txt
|--history_facts_GDELT.txt
|--history_facts_YAGO.txt
|--test_answers/
|--test_ans_icews14.csv
|--test_ans_icews18.csv
|--test_ans_GDELT.txt
|--test_ans_YAGO.txt
|--original/ (original datasets mainly for rule-based models)
|--icews14/
|--all_facts.txt
|--train.txt
|--valid.txt
|--test.txt
|--stat.txt
|--entity2id.json (JSON as dictionary format; for GenTKG) [C]
|--relation2id.json (JSON as dictionary format; for GenTKG)
|--ts2id.json (JSON as dictionary format; for GenTKG)
|--icews18/
|--all_facts.txt
|--train.txt
|--valid.txt
|--test.txt
|--stat.txt
|--entity2id.json
|--relation2id.json
|--ts2id.json
...
{"context":question1, "target":answer1}{"context":question2, "target":answer2}...
The file format is as follows:
history_facts:
history1.1
history1.2
history1.3
...
query1
history2.1
history2.2
history2.3
...
query2
...
...
test_ans:
query_answer1
query_answer2
query_answer3
...
Please cite our work as follow if you find our work helpful.
@inproceedings{liao2024gentkg,
title={GenTKG: Generative Forecasting on Temporal Knowledge Graph with Large Language Models},
author={Liao, Ruotong and Jia, Xu and Li, Yangzhe and Ma, Yunpu and Tresp, Volker},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},
pages={4303--4317},
year={2024}
}