Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The result seems strange in my experiments #46

Closed
cooelf opened this issue Mar 12, 2018 · 18 comments
Closed

The result seems strange in my experiments #46

cooelf opened this issue Mar 12, 2018 · 18 comments

Comments

@cooelf
Copy link

cooelf commented Mar 12, 2018

Thanks for your codes and instructions!

I'm using the code (with no revision) to run some experiments on CoNLL2003 dataset (english), the F1 scores of testa and testb are about 91% and 87% which is not consistant with the reported 91% on the test set.

I have tried to optimize the hyper-parameterss but the F1 score can only reach 88.8% at most. I'm wondering if it could be due to the environment, like python (3.6.4), tensorflow version (tensorflow-gpu==1.3.0) or CUDA (8.0 with cudnn 5.1).

Could you provide your enviroment for comparison or give some insight about this result?

Thanks

@cooelf cooelf changed the title The results seem to be strange in my experiments The result seems strange in my experiments Mar 12, 2018
@emrekgn
Copy link

emrekgn commented Mar 26, 2018

I am wondering this too. What are your hyperparameters?

Trying to get the same results (F1:90.94%) as reported in Lample et al.'s LSTM-CRF model for some time. This is how my (hyper)params roughly look like:

dim_word = 100
dim_char = 25
nepochs = 100
dropout = 0.5
batch_size = 10
lr_method = "sgd"
lr= 0.01
lr_decay = 1.0 # original work does not use decay either!
clip = 5.0 # gradient clipping
hidden_size_char = 25
hidden_size_lstm = 100
# I also replace numeric with zero as stated in the original implementation of Lample.

I'm getting approx. 88.5% F1 score for this setting.

The only difference I see compared to the original implementation of Lample, is the addition of singletons (w/ 0.5 probability) to train UNK token but IMO this should not make a huge difference, right?

Any help would be appreciated.
Thanks.

@jayavardhanr
Copy link

Firstly, Thanks for sharing the code and detailed instructions.

I have been facing similar issues, I tried the same parameters as mentioned in the paper. It only gives a Test F-1 Score of around 87.

I also tried tuning the hyper-parameters using different learning methods, learning rates, decays, momentum values. The best result achieved with the code is 88.5 F1.

It would be great if you can share the hyperparameters using which you were able to reproduce the results in the paper.

My Environment Details:
Python 2.7
Tensorflow-gpu 1.2.0
CUDA 8.0.44

Thanks

@cooelf
Copy link
Author

cooelf commented Apr 9, 2018

I tried the following setting and the Test F-1 Score is 90.02

# embeddings
dim_word = 300
dim_char = 100    
# training
train_embeddings = False
nepochs          = 50
dropout          = 0.3
batch_size       = 50
lr_method        = "adam"
lr               = 0.005
lr_decay         = 0.9
clip             = 5 # if negative, no clipping
nepoch_no_imprv  = 7

# model hyperparameters
hidden_size_char = 100 # lstm on chars
hidden_size_lstm = 300 # lstm on word embeddings

My Environment Details:
Python 3.6
Tensorflow-gpu 1.3.0
CUDA 8.0.61 with cudnn 5.1

@jayavardhanr
Copy link

@cooelf Thanks for the reply.
Did you use glove.840B.300d or word2vec 300d for word embeddings?

@cooelf
Copy link
Author

cooelf commented Apr 9, 2018

@jayavardhanr I simply used glove.6B.300d word embeddings. It's quite small actually. My partner tried using the codes with glove.840B.300d in a similar task, which showed a big improvment (+3.8%) than glove.6B.300d.

From my previous experiments, adam also seems to be better than SGD. Maybe you can try the embedding with the parameters.

Hoping for your feedback!

@jayavardhanr
Copy link

@cooelf Thanks for the details. I tried your mentioned hyper-parameters. I did achieve an F-1 score of 90.10 on the Test set.

Thanks again.

@Jonida88
Copy link

@cooelf @jayavardhanr . Hey guys... please maybe someone can help me....I try to run the model by myself. I am following the steps: 1.model/data_utils, config.py and than build_data.py but at the referenc he write that first you run bild_data and than config.py...which steps should i use? and when I run data_utils its not iterating over the CoNLL dataset but isnt showing any Error I dont know what i am doing wrong.... i will really appreciate your help..

@jayavardhanr
Copy link

jayavardhanr commented Apr 12, 2018

  1. You need to download the CONLL data and place it at the appropriate location. You can find the data here - https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003

  2. Make this change in model/config.py:

'''
Initial code(line number:73 to 78):

# filename_dev = "data/coNLL/eng/eng.testa.iob"
# filename_test = "data/coNLL/eng/eng.testb.iob"
# filename_train = "data/coNLL/eng/eng.train.iob"

filename_dev = filename_test = filename_train = "data/test.txt" # test

Changed Code:

filename_dev = "data/coNLL/eng/eng.testa.iob"
filename_test = "data/coNLL/eng/eng.testb.iob"
filename_train = "data/coNLL/eng/eng.train.iob"

#filename_dev = filename_test = filename_train = "data/test.txt" # test

'''

The author provides test.txt, which will be used if you don't change this part of the code.

@luto65
Copy link

luto65 commented Apr 12, 2018

I had to remove the ".iob" from the downloaded files ... did you do it too ?

@jayavardhanr
Copy link

@luto65 Yes. Forgot to mention that

@luto65
Copy link

luto65 commented Apr 12, 2018

Using defaults (without touching the installation) on macOS I got following on the CONLL dataset.
acc 97.91 - f1 89.54

impressive ! Congrats !

@Jonida88
Copy link

Hi @luto65 and @jayavardhanr thank you very much for your help. Have some of you a idee why i am getting this error at the issue 3 (i was opening one issue 3)? I was trying many other ways but i am getting always the same error... Thanks again in advance...

@ShengleiH
Copy link

Hi @jayavardhanr, I have a question about the 'build data' part. I found in the 'build_data.py' file, the author build the vocabulary by using all of 'train', 'dev' and 'test' data. But in my view, the vocabulary should be built on the train set. May be I missed something, can you give me some advices? Thanks a lot!

@sbmaruf
Copy link

sbmaruf commented Apr 16, 2018

hi!
the vocab is alright with train test dev. here's the reason.

  1. you are actually not using the lable of the dev and test
  2. assume you are not using dev and test. now you got an unknown word from dev. you searched the word's embedding in glove or word2vec or fasttext (or initialize randomly). you found the embedding. you add the embedding to your vocabulary and lookup according to it. it's like while you find an unknown word at runtime and you process the word as your embedding will always be open for you to take. there's no harm in it.

now, if you want to do this procedure at runtime it might be hard to track. instead that you took all the words from train test and dev at the beginning or the training as vocabulary. the procedure is equivalent.

@ShengleiH
Copy link

@sbmaruf Hi, thank you~ Can I use the embedding of 'UNK' always for the unknown words in dev/test set when evaluation? I mean I don't want to assign the corresponding embeddings in glove to these unknown words.

@sbmaruf
Copy link

sbmaruf commented Apr 19, 2018

@ShengleiH sorry for being late

I don't see any problem with doing this at the time of evaluation. Since at the training state you are only training the model based on the token from the train set. If you are using pretrained embedding, this is also done by the original author (@glample) of the paper.

No need to consider < UNK >, while doing evaluation you only lookup on the embedding of dev and test and pass them to your model. Remember you haven't trained the model based on them(dev or test). That's why there is no problem. On the contrary, using their pretrained embedding doesn't contradict that you are using them to train your model.

Remember at the train time you model never sees < UNK > tagged data. If you can differentiate two data that is considered as < UNK > in dev or test time with different embeddings, there is no problem. Apart from that if you are not using pre-trained embedding (initializing the embedding as random distribution), there should not be any problem though the original author (@glample) of the paper use < UNK > tag at that time.

I would also like to have some input from @guillaumegenthial in this regard.

@guillaumegenthial
Copy link
Owner

Because we're using pre-trained embeddings, we can keep the vectors of all the words present in the train test and dev set. (Ideally we would keep all GloVe vectors but that's unnecessary for our experiment). Also, at training time, your model does see the UNK word (not all words in the training set are words in the GloVe vocab!).

@guillaumegenthial
Copy link
Owner

Also, if you use the IOBES and GloVe6B you should get results similar to the paper. I wrote a new version of the code, that achieves higher scores : https://github.com/guillaumegenthial/tf_ner/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants