This repository introduces the use of HetTransformer model.
This is the intended directory sturcture after the completion of data collection and processing steps.
.
├── data/ # Datasets and their processing scripts
| ├── FakeNewsNet/ # Cloned FakeNewsNet repository
| ├── PHEME/ # Collected and unzipped PHEME datapath
| ├── processed_data/ # Pre-processed data
| | ├── FakeNewsNet/ # Pre-processed FakeNewsNet data
| | └── PHEME/ # Pre-processed PHEME data
| ├── rwr_results/ # Generated RWR neighbors
| ├── README.md # Data pre-processing instructions
| └── ... # Data pre-processing scripts
├── figures_and_tables/ # Figures and tables in this README.md
├── models/ # Experiments-related scripts
| ├── train_and_evaluation/ # The model training and evaluation code
| ├── para_sensitivity/ # Parameter sensitivity code
| ├── data_splits/ # Train-val-test split used
| ├── best_models/ # The reserved model for users to run the training script to save their models
| └── pre-trained/ # The pre-trained models
├── README.md # Reproduction instructions
└── requirements.txt # Dependencies
Run the following commands to creat the directory sturcture needed.
mkdir data/processed_data data/rwr_results
This repository is coded in python==3.8.5
.
Please run the following command to install the other requirements from requirements.txt
.
pip install -r requirements.txt
Three datasets, PolitiFact, GossipCop and PHEME are used. While the collection of the first two takes many days, the last one can be done in minutes.
To compile with Twitter Developer Policy, Twitter datasets cannot be shared. Hence, each developer must crawl their own copies of FakeNewsNet for PoliFact and GossipCop datasets.
First of all, run the following to get a copy of FakeNewsNet under the data/
directory.
cd data
git clone https://github.com/KaiDMML/FakeNewsNet
cd ..
Then, please follow the steps in FakeNewsNet.
!! Due to the Twitter API key limits, it may take more than 20 days to collect a complete set of FakeNewsNet if you only have one Twitter API key. To verify the collection, you may follow the instructions in data/README.md
.
Run the following on command line to collect PHEME under data/
, unzip it, and rename it.
cd data
wget -O PHEME.tar.bz2 "https://ndownloader.figshare.com/files/6453753"
tar -vxf PHEME.tar.bz2
mv pheme-rnr-dataset PHEME
cd ..
The zipped file is only 25M and can be downloaded in around 3 minutes.
Data pre-processing includes image fetching, image encoding, text encoding, graph construction, and the extraction of other features.
The details are described in data/README.md
.
We also provide a processed version of the three datasets via this Google Drive link.
You can download them and place them under data/processed_data/FakeNewsNet/GossipCop/batch/
, data/processed_data/FakeNewsNet/PolitiFact/batch/
and data/processed_data/PHEME/batch/
respectively in align with the data paths described in training_and_evaluation scripts.
After generating batch files following step 1 and 2 in data/processed_data/FakeNewsNet/PolitiFact/batch/
; data/processed_data/FakeNewsNet/GossipCop/batch/
; data/processed_data/PHEME/batch/
respectively.
Download and place the n_neighbors.txt files following the instructions in models/README.md
.
You can download the pre-trained models following the instructions in models/pre-trained/README.md
.
You can also load the pre-trained models in models/pre-trained/
following the evaluation scripts decribed under models/train_and_evaluation/
.
Run scripts in models/train_and_evaluation/
to train the model and get the evaluation results. You may need to rename the pre-trained models or change the PATH in train_and_evaluation codes if you do so. Since the default setting is to overwrite the models saved under models/pre-trained/
.