## CURSE OF DIMENSIONALITY
**Philipp Schmitt, 2020**

This notebook contains a workflow to download diagrams from www.arxiv.org, an open-access archive of research papers popular in the machine learning community.

**Careful: re-running this notebook will overwrite the diagrams/ folder and image-list.js file, altering the work.**

## Scraping Diagrams

In [42]:
import requests
from xml.etree import ElementTree
import os
import urllib
import tarfile
import glob
import shutil
from PIL import Image, ImageStat
from wand.image import Image as WandImage
import wand

In [53]:
# Define a query for the arxiv.org search. A few examples:
# "all:latent+space"
# "all:curse+of+dimensionality"
# "all:high+dimensional+space"
# "1209.4915" (This one will return just a single paper. Useful for testing)
query = "all:'curse of dimensionality'"

# Set http request params
start = 0
max_results = 100
url = "http://export.arxiv.org/api/query?search_query=%s&start=%d&max_results=%d" % (query, start, max_results)

# Make request and store response xml tree
response = requests.get(url)
tree = ElementTree.fromstring(response.content)

print('entries found:', len(tree.findall('{http://www.w3.org/2005/Atom}entry')))

entries found: 100


In [48]:
# Settings
valid_filetypes = ['png', 'jpg', 'jpeg', 'gif', 'pdf']
# folders in which diagrams will be exported. Relative to Jupyter directory
download_folder = 'downloads/'
dest_folder = 'diagrams/'
# I filter diagrams by the image brightness. That way I get mostly figures on white background
brightness_thresh = 200
# create a list of filenames
file_list = []

In [None]:
# Create output folder if it doesn't exist yet
if not os.path.exists(dest_folder):
    os.makedirs(dest_folder)
        
# Iterate through document tree
for item in tree.findall('{http://www.w3.org/2005/Atom}entry'):
    id = item.find('{http://www.w3.org/2005/Atom}id').text.split('/')[-1]
    src_url = 'http://arxiv.com/e-print/%s' % id
    
    # Store the paper and all assets in a downloads folder first.
    folder = download_folder + '%s/' % id
    file = folder + 'download.tar.gz'
    
    # Skip download if that paper was already downloaded
    # (could have been part of an earlier query)
    if os.path.exists(folder):
        print('skipped', id)

    # Nope, doesn' exist. Download the paper then!
    else:
        os.makedirs(folder)
        try:
            urllib.request.urlretrieve(src_url, file)
        except:
            print(id, 'had an error')
            continue
        print('downloaded', id)

        # extract compressed file
        try:
            tar = tarfile.open(file, "r:gz")
            tar.extractall(folder)
            tar.close()
        except: 
            try:
                tar = tarfile.open(file, "r:")
                tar.extractall(folder)
                tar.close()
            except:
                pass
        
        # iterate over files to select keepers
        for i, filename in enumerate(glob.iglob(folder + '**', recursive=True)):
            
            # first, check if file has approved file type
            if filename.split('.')[-1] in valid_filetypes:
                filetype = filename.split('/')[-1][-4:]
                fn = filename
                im = 0
    
                # convert PDF images to JPG format
                if filetype == '.pdf':
                    with WandImage(filename=fn, resolution=200) as img:
                        img.background_color = wand.color.Color('white')
                        img.alpha_channel='remove'
                        im = img.save(filename="temp.jpg")
                        im = Image.open("temp.jpg").convert('L')
                        filetype = '.jpg'
                        filename = "temp.jpg"
                else:
                    im = Image.open(filename).convert('L')
                
                # filter by brightness threshold
                brightness = ImageStat.Stat(im).mean[0]
                if brightness_thresh <= brightness:
                    # generate output filename + rename
                    file_out = id + '_%d' % i + filetype
                    os.rename(filename, dest_folder + file_out)
                    # add file name to list
                    if file_out not in file_list :
                        file_list.append(file_out)
                    
        # delete content in download folder; preserve empty folder to check for duplicates later
        contents = [os.path.join(folder, i) for i in os.listdir(folder)]
        x = [os.remove(i) if os.path.isfile(i) or os.path.islink(i) else shutil.rmtree(i) for i in contents]

In [52]:
# Now, export a .js file with a list of image filenames.
file = open("image-list.js", "w")
file.write('const images = ["')
file.write('","'.join(file_list))
file.write('"];')
file.close()
"File image-list.js created"

'File image-list.js created'

## Post-Processing
I go through the `diagrams` folder and manually deleted duplicates and diagrams I didn't like. Then I compress all image files for web using [ImageOptim](https://imageoptim.com/mac).

## Giving credit

**All diagrams belong to the respective authors and their publications**. Folder names in `downloads/` are named with the arxiv.org publication ID. Image files in `images/` are as well, with the ID ending before the `_n.jpg`. Use the ID to look up the paper on www.arxiv.org

On *web scraping*: I've touched on the politics and violence of scraping images for datasets [in this essay](https://humans-of.ai/editorial/). And in journalism, The Markup makes a case for [Why Web Scraping Is Vital to Democracy](https://themarkup.org/news/2020/12/03/why-web-scraping-is-vital-to-democracy).