Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets in .pkl format? #1

Open
jazzsaxmafia opened this issue May 25, 2015 · 40 comments
Open

Datasets in .pkl format? #1

jazzsaxmafia opened this issue May 25, 2015 · 40 comments

Comments

@jazzsaxmafia
Copy link

Hello, thank you for sharing this great project.

I would like to run the code, but it seems like the project does not contain the datasets used. Even though I can get flickr or coco dataset but I do not know how the data is preprocessed in those .pkl files.

Can I possibly get the data as it is used in the project?

Thank you.

@kelvinxu
Copy link
Owner

Hey, thanks for your question. Unfortunately, the preprocessed datasets are still quite large so we have no resources to host all of them at the moment. What we can do however is add some preprocessing instructions so that you can extract the same features using an open source tool. We will try to do so in the next few days.

@jnhwkim
Copy link

jnhwkim commented May 27, 2015

@jazzsaxmafia @kelvinxu I've also encountered the same problem.

@leo-zhou
Copy link

@kelvinxu Could you provide some simple information about the pkl file? For example, what's in the pkl file and their structures. Thank you very much. The preprocessing instruction is nicer if it won't take too long.

@jnhwkim
Copy link

jnhwkim commented May 28, 2015

@leo-zhou Just reference, it's my tentative guess.
data.pkl -> first dump: cap, second dump: feat
cap -> [[sentence, feature #], ... ]
feat-> numpy.array(vgg_conv4).shape[N, 1, L x D] scipy.sparse.csr_matrix(vgg_conv4).shape[N, L x D]

dictionary.pkl -> [[word, word #], ... ] {word: word#, ... }
(word # starts from 2, (= frequency rank + 1), 0 and 1 are reserved by the program.)

Updated based upon other comments.

@kyunghyuncho
Copy link
Collaborator

@leo-zhou
@jnhwkim is correct, except that feat is saved as a sparse matrix of shape [N, 14 * 14 * 512]. In the dictionary (dictionary.pkl), 0 and 1 are reserved for the end of caption and a unknown word.

@jnhwkim
Copy link

jnhwkim commented May 28, 2015

@kyunghyuncho Oh, I got it. This is so why you've used ff.todense() in coco.py:38. Thanks!

@leo-zhou
Copy link

@jnhwkim @kyunghyuncho Thanks a lot !

@jazzsaxmafia
Copy link
Author

Thank you very much. I think those were enough for me to set the data myself

@kelvinxu
Copy link
Owner

@jnhwkim, very minor addition to prevent confusion is that the dictionary.pkl doesn't load a list but a python dictionary in the form {word : word #}. This is probably what you meant. Thanks!

@jnhwkim
Copy link

jnhwkim commented May 28, 2015

@kelvinxu Yes, you're right. For preventing confusion, I'll update my comment.

@samim23
Copy link

samim23 commented Jul 23, 2015

Any news on the preprocessing instructions or even the preprocessed datasets upload? Great library, bit improved documentation would be welcome though.

@kelvinxu
Copy link
Owner

Hey samim23,

The feature extraction procedure was described in the paper (you should extract conv5_4), but I agree that it should be explained reproduced somewhere here in the repo.

@asampat3090
Copy link

Has anyone gotten the dataset conversion working? If so, it would be great if you could share the code. Will be trying this myself as well.

@cxj273
Copy link

cxj273 commented Sep 17, 2015

@asampat3090 I saw you have implemented the code of dataset conversion. Can you reproduce the results in Kelvin's paper? Thanks.

@ffmpbgrnn
Copy link

Hey guys, anyone succeeded in generating the pkl file? Any link would be very helpful! Thank you.

@asampat3090
Copy link

@cxj273 I haven't actually tried. I'll try this weekend. @ffmpbgrnn check out my code - I have a generator for the flickr_30k, but I haven't documented much though.

@ffmpbgrnn
Copy link

@asampat3090 I will have a look. Many thanks! :-)

@flipvrijn
Copy link

@asampat3090 Would your code actually work though? The image ids here refer to the whole image collection, whereas here you point to an image feature in a subset using the index that is meant for the whole image collection. Or am I missing something?

I'm trying to port your code to the COCO dataset.

@cxj273
Copy link

cxj273 commented Oct 10, 2015

@asampat3090 From my understanding, line 54 is wrong. You can't get all the training captions using the training image idx. Correct me if I am wrong.

@xlhdh
Copy link

xlhdh commented Nov 4, 2015

Hi, can I ask how large those .pkl files are? I tried to make them for the MSCOCO dataset, and the features from VGG for the training set alone take around 75GB. I stored them in scipy.sparse.csr_matrix. According to coco.py, it seems they all get loaded into memory together, so I was wondering if there is anything I was missing...

@kelvinxu
Copy link
Owner

kelvinxu commented Nov 4, 2015

@xlhdh It should be something around 15 Gbs. They are all loaded into memory at once, but we unsparsifying them one batch at a time. Are you unsparsifying them all at once?

@xlhdh
Copy link

xlhdh commented Nov 4, 2015

@kelvinxu The original weights were around 15GB, but once I pickle them, they got to like 75... And they were csr_matrix from top to toe. I guess I'll look at it again to see if there's any bug!

@kyunghyuncho
Copy link
Collaborator

It's likely because you didn't use "protocol=cPickle.HIGHEST_PROTOCOL" as
an argument with cPickle.dump.

  • K

On Wed, Nov 4, 2015 at 10:54 AM, Yizhou Hu notifications@github.com wrote:

@kelvinxu https://github.com/kelvinxu The original weights were around
15GB, but once I pickle them, they got to like 75... And they were
csr_matrix from top to toe. I guess I'll look at it again to see if there's
any bug!


Reply to this email directly or view it on GitHub
#1 (comment)
.

@xlhdh
Copy link

xlhdh commented Nov 6, 2015

@kyunghyuncho Thank you, I used the highest protocol (I thought that was default) and it worked! The only thing I wasn't able to do was to dump the image features to disk all at once so I had to read several files in and assemble them in memory.

@asampat3090
Copy link

@cxj273 @gamer13 sorry for the delay, I'm not sure I quite understood the issue. So I suppose there might be a mismatch between the "features" and "caps" variables in "prepare_data" here, but if I understand correctly you're saying we would need to re-index all of the image ids? If so, did you guys have any success doing that? I'm still trying to figure that out.

UPDATE: I believe I have reindexed it such that features are referenced properly. Does anyone else have working code?

@intuinno
Copy link

@asampat3090 Thank you for your effort for sharing your script. I had trouble running this model and your code was very helpful. I am still struggling. But here is my suggestions for your code.

Suggestions
  • It seems like the capgen.py train function requires dictionary with 'A' and it means in your vectorizer code, the options should include following
    • vectorizer = CountVectorizer(analyzer=str.split, lowercase=False).fit(captions)
    • Or you could make sentence in caps file as lower case. This would be better approach for the data sparseness problem.
  • You used conv5_3 as the feature extraction. However according to the paper, con5_4 seems better features.

Thanks.

@frajem
Copy link

frajem commented Jan 28, 2016

@kyunghyuncho , @kelvinxu
Hello,
By using default parameters for Coco, soft attention, I get much lower results on the test set than what was published: BLEU-1=0.545, METEOR=0.164, CIDer=0.274.
The only difference I see is that early stop is done on NLL. Can this cause such big gaps?
Also, my coco_align.train.pkl is about 6 GB, and not 15 GB.
Thanks!

@rayz0620
Copy link

I observe that in function prepare_data() line 40 of file flickr30k.py, the code set all words with id larger than n_words to be 1(UNK). Therefore when we create the dictionary, we should assign id in descending order of word frequency, assign smaller id for words with larger frequency.
@asampat3090 In your make_flickr_data.py, you used CountVectorizer from scikit.learn, which assigns word id in occurrence order. This might be the reason why there are too many UNK in training data.

@frajem
Copy link

frajem commented Feb 3, 2016

Yes, the dictionary has IDs in descending frequency order.
Any idea about why I'm getting so much lower metrics on Coco (see my comment above)?
Thanks.

@rowanz
Copy link

rowanz commented Feb 16, 2016

Hey all, I've created a script that appears to work for preprocessing. The source is
here. It does everything besides create the word-ID dictionary.

@frajem
Copy link

frajem commented Feb 16, 2016

Thanks @rowanz
What metrics values do you get, for example for Coco, soft attention?

@intuinno
Copy link

Hello, Thank you @rowanz . I myself struggled for creation of the preprocessing and created a repo for anybody needs it. You can check it at here

@ericclei
Copy link

Hi @intuinno, I'm trying to run your prepare_caffe_and_dictionary_coco.ipynb. Could you please explain what the file dataset_coco.json is?

@Lorne0 Lorne0 mentioned this issue May 7, 2016
@Lorne0
Copy link

Lorne0 commented May 13, 2016

I forked @intuinno 's work and added some codes and a simple doc in README.md . (No need dataset_coco.json)
https://github.com/Lorne0/arctic-captions
Hope it's helpful.

@Litchiware
Copy link

Just run this one line script to generate file dictionary.pkl.
cat flickr8k/Flickr8k_text/Flickr8k.token.txt | awk -F '\t' '{print $2}' | awk '{for(i=1;i<=NF;i++) print $i}' | sort | uniq -c | sort -nr | awk '{print $2,NR+1}' | python -c "import sys; import cPickle as pkl; pkl.dump(dict([line.strip('\n').split(' ') for line in sys.stdin.readlines()]), open('features/dictionary.pkl', 'wb'))"

@athenspeterlong
Copy link

Hello @Lorne0 , thank you so much for your code. It helps me a lot to reproduce the project.
I am looking into the code and have a question, for the preprocess.sh, why do the crop 224_224 instead of stay 256_256?
Thanks

@vyouman
Copy link

vyouman commented May 22, 2016

Hi, @athenspeterlong. Because the pretrained CNN requires 224*224 input, we should crop the images at first to feed them to CNN.

@dipanjan06
Copy link

Hi ,@intuinno , thank you for sharing the preprocessing code. I am using Flicker8k dataset and was able to build the necessary .pkl files and dictionary using prepare_flickr8k.py.
Now I am trying to run the train function using evaluate_flickr8k.py but I am getting "coo_matrix object does not support indexing" in flickr8k.py line no 16 ..

Any idea why this is happening ..

Thanks

@jetsmith
Copy link

jetsmith commented Sep 8, 2016

@Lorne0, I have tried to reproduce your results using your code,when I run the prepare_model_coco.py,some errors happen:
val(5000,100352)
train (5000,100352)
Traceback (most recent call last):
File "prepare_model_coco.py", line 70, in
result = np.empty((numImage, 100352))
MemoryError
I don't know why it happens?
thanks

@xxxyyyzzzz
Copy link

@rowanz Thanks for ur preprocessing code using pkl files. In your code I can see that the training imgs include restval set also. Is that recommended by the author @kelvinxu! Can you clarify the reason behind it?
It will be really helpful for me. Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests