wikimedia-transp

Dataset of transparent images from Wikimedia.

e.g. an image of some dice

Requirements

Python 3.8+

License

TODO: determine license based on licensing of images (I presume they will be CC, so CC0 if possible).

Note: this remains to be determined and the dataset will probably be split up accordingly

This library is MIT licensed (a permissive license).

Usage (TBD)

Intended usage is to provide a simple interface to a scraped dataset of images from Wikimedia

Scraping of the 1% sample can be carried out as:
```
from wikitransp.scraper import scrape_images
scrape_images(sample=True)
```
and the full dataset can be scraped instead by toggling the sample flag to False.

Currently this only filters the dataset according to size and transparency (this step remains too slow to proceed further, and will be parallelised with async HTTP requests).

This then saves the filtered TSV (in future will save the images themselves) to the default directory within the package (src/wikitransp/data/store) or an argument save_dir could be passed to scrape_images() [not implemented for now].
If you cancel after a while (KeyboardInterrupt) or it finishes up, you'll get a nice summary, e.g. here's the summary after the above command scrape_images(sample=True):

Note: log event timings have become unreliable after moving to async (to fix later!)

    HaltFinished ⠶ Successful completion in 29 minutes and 30.99 seconds.
    LogNotify ⠶ See /home/louis/dev/wikitransp/src/wikitransp/logs/wit_v1.train.all-1percent_sample_PNGs_with_alpha.log for full log.

    BonVoyage ⠶ Thank you for scraping with Wikitransp :^)-|-<
    ----------------------------------------------------------
    Init                     : n=1
    CheckPng                 : n=8181
    InternalLogException     : n=15
    PrePngStreamAsyncFetcher : n=1
    FetchIteration           : n=8
    PngStream                : n=7615, μ=605.6515, min=0.1247, max=1753.2464
    PngSuccess               : n=7615
    PngDone                  : n=7615
    GarbageCollect           : n=7615, μ=0.0128, min=0.0048, max=0.0347
    BanURL                   : n=52
    RoutineException         : n=8
    HaltFinished             : n=1
    ----------------------------------------------------------

BanURLException indicates a suggestion to add URLs to the banned URL list (usually due to 404, but check these manually).

Data access will look something like this (I expect):

from wikitransp.dataset import all_images, large_images, medium_images, small_images

Guide to filtering images

I'm currently filtering the dataset for PNGs with semitransparency, rather than PNGs with transparency which would include 'stickers', with potentially only two levels of alpha, 0 or 255, (which would presumably be less effective to train on for alpha decompositing)
The dataset source is WIT, the Wikipedia Image-Text dataset from Google Research.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
docs		docs
src/wikitransp		src/wikitransp
tests		tests
tools/github		tools/github
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikimedia-transp

Requirements

License

Usage (TBD)

Guide to filtering images

About

Releases

Packages

Languages

License

lmmx/wikitransp

Folders and files

Latest commit

History

Repository files navigation

wikimedia-transp

Requirements

License

Usage (TBD)

Guide to filtering images

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages