You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey!
I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn.
While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.
I am using the requests library to read URLs and the pillow library in Python to read from them.
The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list
from PIL import Image
import requests
train_file = 'data/Train%2FGCC-training.tsv' # train file
with open(train_file,'r') as f:
train_read = f.readlines()
sample_train = train_read[:10000]
train_map = {
line.split("\t")[1][:-1] : line.split("\t")[0] for line in sample_train
}
links = [k for k,v in train_map.items()]
not_read = 0 # keep a count of images that were not possible to read
# loop over the links and read whichever possible
for link in links:
try:
im = Image.open(requests.get(link, stream=True).raw)
except:
print(link)
not_read += 1
Hi,
The owner of an image might chose to remove the image anytime. So we do expect to lose some train/dev images over time. But that should be a very small fraction (approx 0.5% in your case). Given that we have over 3M images for training, this should not be a problem.
However, the test set for Conceptual Captions (hosted in the competition server) is fixed and will not vary over time.
Hi!
Thank you so much for your response. It's awesome that you have everything covered at your side. Look forward to the some amazing insights and results from this dataset.
Thanks for clearing out once again.
Hi, @sharma-piyush
I tried to download the whole CC datasets using VL-BERT author's script.
But I could get only 630k images, which is 20% of total.
Any way to download 3.3M total images for research purpose?
Thanks for your reading.
I also tried downloaded using the VL-BERT script, and I could only download 340k images. @Gyubin could you download majority of the images? If you have a link to the 630k images that you could download, that would be great?
Hey!
I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn.
While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.
I am using the requests library to read URLs and the pillow library in Python to read from them.
The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list
Here are some of the links that did not work.
From a sample of 10000, I was able to get at least 51 links that did not work.
Looking forward to hearing more from you guys.
Thanks!
The text was updated successfully, but these errors were encountered: