- python2.7
- tensorflow 0.8-0.12
- ujson
- numpy
- scipy (for retrieval_map.py, geo_test.py)
- matplotlib (for geo_test.py)
- Validation (development) and test tweet data are provided (under data/)
- For training tweets, use the training downloader script from WNUT Workshop 2016. The script will fetch the training tweet metadata from the official API
- Pre-trained models can be downloaded here (2017-ijcnlp-deepgeo/deepgeo_models.tgz; 1.6GB)
- There are a total of 12 models:
- deepgeo_RXXX: R=XXX, sigma=0.0, alpha=0.0
- deepgeo_RXXX_noise: R=XXX, sigma=0.1, alpha=0.0
- deepgeo_RXXX_loss: R=XXX, sigma=0.0, alpha=0.1
- These are the models presented in Table 6 in the paper. R is the dimension of the representation, sigma is the Gaussian noise standard deviation, and alpha is the scaling factor the additional loss term l
python geo_train.py
- Configurations are all defined in config.py
- The default values are the optimal hyper-parameter settings used in the paper
- Note that the first epoch can take a long time to finish (potentially 6+ hours), but subsequent epochs should run fairly quickly. The slow start is due to network initialisation.
- On a single K80 GPU, it takes around 25-30 hours to train 10 epochs on the full training data.
usage: geo_test.py [-h] -m MODEL_DIR [-d INPUT_DOC] [-l INPUT_LABEL]
[--predict] [--save_rep SAVE_REP] [--save_label SAVE_LABEL]
[--save_mat SAVE_MAT] [--print_attn] [--print_time]
Given trained model, perform various test inferences
optional arguments:
-h, --help show this help message and exit
-m MODEL_DIR, --model_dir MODEL_DIR
directory of the saved model
-d INPUT_DOC, --input_doc INPUT_DOC
input file containing the test documents
-l INPUT_LABEL, --input_label INPUT_LABEL
input file containing the test labels
--predict classify test instances and compute accuracy
--save_rep SAVE_REP save representation (thresholded and converted to
binary) of test instances
--save_label SAVE_LABEL
save label of test instances
--save_mat SAVE_MAT save representation (floats) and label of test
instances in MAT format
--print_attn print attention on text span
--print_time print time, offset and usertime distribution for
popular locations
python geo_test.py -m MODEL_DIR -d data/test/data.tweet.json -l data/test/label.tweet.json --predict
python geo_test.py -m MODEL_DIR -d data/test/data.tweet.json -l data/test/label.tweet.json --print_attn
- Example script is given in: compute_map.sh
- The idea is to first generate binary code representation for train and test data, and then use retrieval_map to compute hamming distance and MAP
- In the example script, we use the validation data as the train data, as it is much smaller
Lau, Jey Han, Lianhua Chi, Khoi-Nguyen Tran and Trevor Cohn (to appear). End-to-end Network for Twitter Geolocation Prediction and Hashing. In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan.