Skip to content
This repository has been archived by the owner on Feb 13, 2025. It is now read-only.

Latest commit

 

History

History
98 lines (70 loc) · 3.97 KB

README.md

File metadata and controls

98 lines (70 loc) · 3.97 KB

wikimedia-transp

Documentation CI Status Coverage Checked with mypy Code style: black

Dataset of transparent images from Wikimedia.

Requirements

  • Python 3.8+
  • cuDF (GPU-accelerated replacement for pandas)

To install:

conda create -n wikitransp
conda activate wikitransp
conda install -c rapidsai -c nvidia -c numba -c conda-forge cudf=21.08 python=3.8 cudatoolkit=11.2
pip install .

License

TODO: determine license based on licensing of images (I presume they will be CC, so CC0 if possible).

  • Note: this remains to be determined and the dataset will probably be split up accordingly

This library is MIT licensed (a permissive license).

Usage (TBD)

Intended usage is to provide a simple interface to a scraped dataset of images from Wikimedia

  • Scraping of the 1% sample can be carried out as:

    from wikitransp.scraper import scrape_images
    scrape_images(sample=True)

    and the full dataset can be scraped instead by toggling the sample flag to False.

    Currently this only filters the dataset according to size and transparency (this step remains too slow to proceed further, and will be parallelised with async HTTP requests).

    This then saves the filtered TSV (in future will save the images themselves) to the default directory within the package (src/wikitransp/data/store) or an argument save_dir could be passed to scrape_images() [not implemented for now].

  • If you cancel after a while (KeyboardInterrupt) or it finishes up, you'll get a nice summary, e.g. here's the summary after the above command scrape_images(sample=True):

  • Note: log event timings have become unreliable after moving to async (to fix later!)

        HaltFinished ⠶ Successful completion in 29 minutes and 30.99 seconds.
        LogNotify ⠶ See /home/louis/dev/wikitransp/src/wikitransp/logs/wit_v1.train.all-1percent_sample_PNGs_with_alpha.log for full log.
    
        BonVoyage ⠶ Thank you for scraping with Wikitransp :^)-|-<
        ----------------------------------------------------------
        Init                     : n=1
        CheckPng                 : n=8181
        InternalLogException     : n=15
        PrePngStreamAsyncFetcher : n=1
        FetchIteration           : n=8
        PngStream                : n=7615, μ=605.6515, min=0.1247, max=1753.2464
        PngSuccess               : n=7615
        PngDone                  : n=7615
        GarbageCollect           : n=7615, μ=0.0128, min=0.0048, max=0.0347
        BanURL                   : n=52
        RoutineException         : n=8
        HaltFinished             : n=1
        ----------------------------------------------------------
    

    BanURLException indicates a suggestion to add URLs to the banned URL list (usually due to 404, but check these manually).

  • Data access will look something like this (I expect):

    from wikitransp.dataset import all_images, large_images, medium_images, small_images

Guide to filtering images

  • I'm currently filtering the dataset for PNGs with semitransparency, rather than PNGs with transparency which would include 'stickers', with potentially only two levels of alpha, 0 or 255, (which would presumably be less effective to train on for alpha decompositing)
  • The dataset source is WIT, the Wikipedia Image-Text dataset from Google Research.