Help wanted - training sets / images and text description #7

johndpope · 2021-01-07T09:21:35Z

It’s possible we could use Wikipedia extract ~ 40gb offline file and scrape the images / text. Unless ms coco is easier or ???

https://levelup.gitconnected.com/two-simple-ways-to-scrape-text-from-wikipedia-in-python-9ce07426579b

abhi1nandy2 · 2021-01-07T10:41:09Z

https://www.w3resource.com/python-exercises/web-scraping/web-scraping-exercise-8.php - check this out. This only extracts .jpg'image links. Wikipedia in general uses these three formats - .jpg, .gif, .png. Use these three as regex.
Sample -

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen(<URL OF WIKIPEDIA PAGE>)
bs = BeautifulSoup(html, 'html.parser')
images1 = bs.find_all('img', {'src':re.compile('.jpg')})
images2 = bs.find_all('img', {'src':re.compile('.jpg')})
images3 = bs.find_all('img', {'src':re.compile('.jpg')})
images = []
images.extend(images1)
images.extend(images2)
images.extend(images3)
for image in images: 
    print(image['src']+'\n')

Please check if this works. Similarly, I guess you would be able to extract the image description by taking the next HTML element adjacent to the <a> of the image.

johndpope · 2021-01-07T12:16:53Z

i wonder if just using the alt might work
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Images

[[File:Siberian Husky pho.jpg|thumb|alt=A white dog in a harness playfully nuzzles a young boy |A [[Siberian Husky]] used as a pack animal]]

there are these data dumps here
https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

abhi1nandy2 · 2021-01-07T14:48:51Z

This seems a better 'alt'ernative :P. Have u tried using it?

johndpope · 2021-01-07T21:04:26Z

no - wont get to it. need help.
mega.nz give you 12 gb of space - if someone manages to get anywhere and needs a place to host data.

josephcappadona · 2021-01-11T10:05:42Z

Here is 78k image-caption pairs (3.8 GB zipped) I scraped recently as part of a project. A smaller, ~1k pairs sample of the data (only 58MB, useful for testing) can be found here.

I'll take a look at extracting something out of the wikipedia data (I also found this wikimedia archive), but if someone gets to it before me then please do share 😉 there's also probably some stuff that could be extracted from sites like the-eye.eu and archive.org. There's data everywhere, it usually only needs a bit of cleaning.

johndpope · 2021-01-17T20:16:43Z

does this help - https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=segmentation&r=false&c=%2Fm%2F039xj_

https://github.com/Tencent/tencent-ml-images

skywo1f · 2021-02-04T13:59:43Z

Here is 78k image-caption pairs (3.8 GB zipped) I scraped recently as part of a project. A smaller, ~1k pairs sample of the data (only 58MB, useful for testing) can be found here.

I'll take a look at extracting something out of the wikipedia data (I also found this wikimedia archive), but if someone gets to it before me then please do share wink there's also probably some stuff that could be extracted from sites like the-eye.eu and archive.org. There's data everywhere, it usually only needs a bit of cleaning.

I found it helps if you remove all txt files with size zero. vqvae will then point out which images are bad (jpgs that werent jpgs) and you can delete them.

also run
find -name "*.jpg" -type f |xargs --no-run-if-empty identify -format '%f' 1>ok.txt 2>errors.txt
in your data folder to find bad jpgs. (dont forget to delete your ok.txt and errors.txt files before you train)

another filter that will find "odd" images is
find . -type f -iname ".jpg" -o -iname ".jpeg"| xargs jpeginfo -c | grep -E "WARNING|ERROR" | cut -d " " -f 1

This was referenced Feb 2, 2021

Not an issue - question on training set deepglugs/dalle#1

Open

confusion #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help wanted - training sets / images and text description #7

Help wanted - training sets / images and text description #7

johndpope commented Jan 7, 2021

abhi1nandy2 commented Jan 7, 2021 •

edited

Loading

johndpope commented Jan 7, 2021 •

edited

Loading

abhi1nandy2 commented Jan 7, 2021

johndpope commented Jan 7, 2021

josephcappadona commented Jan 11, 2021 •

edited

Loading

johndpope commented Jan 17, 2021

skywo1f commented Feb 4, 2021 •

edited

Loading

Help wanted - training sets / images and text description #7

Help wanted - training sets / images and text description #7

Comments

johndpope commented Jan 7, 2021

abhi1nandy2 commented Jan 7, 2021 • edited Loading

johndpope commented Jan 7, 2021 • edited Loading

abhi1nandy2 commented Jan 7, 2021

johndpope commented Jan 7, 2021

josephcappadona commented Jan 11, 2021 • edited Loading

johndpope commented Jan 17, 2021

skywo1f commented Feb 4, 2021 • edited Loading

abhi1nandy2 commented Jan 7, 2021 •

edited

Loading

johndpope commented Jan 7, 2021 •

edited

Loading

josephcappadona commented Jan 11, 2021 •

edited

Loading

skywo1f commented Feb 4, 2021 •

edited

Loading