A lot of the links from the training/validation set do not exist or cannot be read. #5

karansomaiah · 2018-10-12T01:04:09Z

Hey!
I recently started working on the competition and thank you so much to the Google AI Research team for open sourcing such data sets for us to use and learn.
While going through the data set (train and validation both), it seems some of the links do not exist or for some reason cannot be read through. A lot of URLs can be parsed but cannot be read by using Image from the pillow package in python. I will post some of the scripts and the some output to give everyone a better idea of what I am seeing. Please feel free to correct me if I am doing something wrong. Hope this helps anyone facing the errors too.

I am using the requests library to read URLs and the pillow library in Python to read from them.
The code is very primitive since I am still in the exploration stages and hence I'm appending images to the list

from PIL import Image
import requests 

train_file = 'data/Train%2FGCC-training.tsv'  # train file
with open(train_file,'r') as f:
    train_read = f.readlines()

sample_train = train_read[:10000]

train_map = {
   line.split("\t")[1][:-1] : line.split("\t")[0] for line in sample_train
}
links = [k for k,v in train_map.items()]

not_read = 0 # keep a count of images that were not possible to read

# loop over the links and read whichever possible
for link in links:
    try:
        im = Image.open(requests.get(link, stream=True).raw)
    except:
        print(link)
        not_read += 1

Here are some of the links that did not work.

https://cdn.mantelligence.com/wp-content/uploads/2017/08/Questions-to-Ask-a-Girl-to-Get-to-Know-Her-What-do-you-want-most-out-of-life.jpg
http://duro6.com/weather/images/gallery3_lightning_rainbow_shot.jpg
http://image.dailyfreeman.com/storyimage/DF/20170505/NEWS/170509808/AR/0/AR-170509808.jpg&maxh=400&maxw=667
https://cdn.bravehunters.com/wp-content/uploads/2017/09/Guide-to-Living-in-a-Tent-800x416.jpg
http://www.saltandpinephoto.com/wp-content/uploads/2016/06/Bride-and-Groom-Walking-through-the-Forest.jpg
http://blog.visitmo.com/wp-content/uploads/2014/03/12506026093_092d091fc2_b.jpg
https://lynismael.com/wp-content/uploads/2014/07/Belwood-Lake-Conservation-wedding-sara-ayron-_0011(pp_w768_h534).jpg
http://www.eurasianet.org/sites/default/files/imagecache/galleria_fullscreen/060613_0.jpg
https://www.bailiwickexpress.com/files/cache/88ec9331c05013c55b49024a551341ac_f587432.jpg
https://i2-prod.mirror.co.uk/incoming/article1443634.ece/ALTERNATES/s615/%C2%A3%C2%A3%C2%A3%20%20Police%20car%20driving%20straight%20into%20a%20road%20of%20freshly%20layed%20cement
http://www.nerjarob.com/nature/wp-content/uploads/Cormorants-in-tree-sized.jpg
http://grantbaldwin.com/wp-content/uploads/2015/11/ScottVoelker.jpeg
https://drawinglics.com/view/186698/how-to-draw-flowers-and-leaves-in-a-vase-9-steps-with-pictures-image-titled-draw-flowers-and-leaves-in-a-vase-step-9bullet1.jpg

From a sample of 10000, I was able to get at least 51 links that did not work.
Looking forward to hearing more from you guys.
Thanks!

The text was updated successfully, but these errors were encountered:

sharma-piyush · 2018-10-24T00:14:03Z

Hi,
The owner of an image might chose to remove the image anytime. So we do expect to lose some train/dev images over time. But that should be a very small fraction (approx 0.5% in your case). Given that we have over 3M images for training, this should not be a problem.
However, the test set for Conceptual Captions (hosted in the competition server) is fixed and will not vary over time.

karansomaiah · 2018-10-25T21:44:57Z

Hi!
Thank you so much for your response. It's awesome that you have everything covered at your side. Look forward to the some amazing insights and results from this dataset.
Thanks for clearing out once again.

Gyubin · 2020-04-28T03:16:50Z

Hi, @sharma-piyush
I tried to download the whole CC datasets using VL-BERT author's script.
But I could get only 630k images, which is 20% of total.
Any way to download 3.3M total images for research purpose?
Thanks for your reading.

Best regard,
Gyubin Son

gsrivas4 · 2020-09-09T06:10:58Z

I also tried downloaded using the VL-BERT script, and I could only download 340k images. @Gyubin could you download majority of the images? If you have a link to the 630k images that you could download, that would be great?

sharma-piyush closed this as completed Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of the links from the training/validation set do not exist or cannot be read. #5

A lot of the links from the training/validation set do not exist or cannot be read. #5

karansomaiah commented Oct 12, 2018 •

edited

Loading

sharma-piyush commented Oct 24, 2018

karansomaiah commented Oct 25, 2018

Gyubin commented Apr 28, 2020

gsrivas4 commented Sep 9, 2020

A lot of the links from the training/validation set do not exist or cannot be read. #5

A lot of the links from the training/validation set do not exist or cannot be read. #5

Comments

karansomaiah commented Oct 12, 2018 • edited Loading

sharma-piyush commented Oct 24, 2018

karansomaiah commented Oct 25, 2018

Gyubin commented Apr 28, 2020

gsrivas4 commented Sep 9, 2020

karansomaiah commented Oct 12, 2018 •

edited

Loading