Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help wanted - training sets / images and text description #7

Open
johndpope opened this issue Jan 7, 2021 · 7 comments
Open

Help wanted - training sets / images and text description #7

johndpope opened this issue Jan 7, 2021 · 7 comments

Comments

@johndpope
Copy link

It’s possible we could use Wikipedia extract ~ 40gb offline file and scrape the images / text. Unless ms coco is easier or ???

https://levelup.gitconnected.com/two-simple-ways-to-scrape-text-from-wikipedia-in-python-9ce07426579b

@abhi1nandy2
Copy link

abhi1nandy2 commented Jan 7, 2021

https://www.w3resource.com/python-exercises/web-scraping/web-scraping-exercise-8.php - check this out. This only extracts .jpg'image links. Wikipedia in general uses these three formats - .jpg, .gif, .png. Use these three as regex.
Sample -

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen(<URL OF WIKIPEDIA PAGE>)
bs = BeautifulSoup(html, 'html.parser')
images1 = bs.find_all('img', {'src':re.compile('.jpg')})
images2 = bs.find_all('img', {'src':re.compile('.jpg')})
images3 = bs.find_all('img', {'src':re.compile('.jpg')})
images = []
images.extend(images1)
images.extend(images2)
images.extend(images3)
for image in images: 
    print(image['src']+'\n')

Please check if this works. Similarly, I guess you would be able to extract the image description by taking the next HTML element adjacent to the <a> of the image.

@johndpope
Copy link
Author

johndpope commented Jan 7, 2021

i wonder if just using the alt might work
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Images

[[File:Siberian Husky pho.jpg|thumb|alt=A white dog in a harness playfully nuzzles a young boy |A [[Siberian Husky]] used as a pack animal]]

there are these data dumps here
https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

@abhi1nandy2
Copy link

This seems a better 'alt'ernative :P. Have u tried using it?

@johndpope
Copy link
Author

no - wont get to it. need help.
mega.nz give you 12 gb of space - if someone manages to get anywhere and needs a place to host data.

@josephcappadona
Copy link

josephcappadona commented Jan 11, 2021

Here is 78k image-caption pairs (3.8 GB zipped) I scraped recently as part of a project. A smaller, ~1k pairs sample of the data (only 58MB, useful for testing) can be found here.

I'll take a look at extracting something out of the wikipedia data (I also found this wikimedia archive), but if someone gets to it before me then please do share 😉 there's also probably some stuff that could be extracted from sites like the-eye.eu and archive.org. There's data everywhere, it usually only needs a bit of cleaning.

@skywo1f
Copy link

skywo1f commented Feb 4, 2021

Here is 78k image-caption pairs (3.8 GB zipped) I scraped recently as part of a project. A smaller, ~1k pairs sample of the data (only 58MB, useful for testing) can be found here.

I'll take a look at extracting something out of the wikipedia data (I also found this wikimedia archive), but if someone gets to it before me then please do share wink there's also probably some stuff that could be extracted from sites like the-eye.eu and archive.org. There's data everywhere, it usually only needs a bit of cleaning.

I found it helps if you remove all txt files with size zero. vqvae will then point out which images are bad (jpgs that werent jpgs) and you can delete them.

also run
find -name "*.jpg" -type f |xargs --no-run-if-empty identify -format '%f' 1>ok.txt 2>errors.txt
in your data folder to find bad jpgs. (dont forget to delete your ok.txt and errors.txt files before you train)

another filter that will find "odd" images is
find . -type f -iname ".jpg" -o -iname ".jpeg"| xargs jpeginfo -c | grep -E "WARNING|ERROR" | cut -d " " -f 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants