# BFICTDS: Baby's First Image Classification Training Data Set

An opensource book.

This noteboook downloads images from the Open Images dataset and arranges them into the pages of the BFICTDS book, output in the form of a pdf. The title and image classes are easily customizable.


In [None]:
import glob
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image, ImageEnhance
from matplotlib.backends.backend_pdf import PdfPages
import os
import pandas as pd
import csv
import json

## 1) Download images from Open Images dataset

### Download/load OID images and class lists

We are using the human image labels with the full image-level label class list (rather than the reduced class list for the box-labeled set) -- download information obtained here https://storage.googleapis.com/openimages/web/download.html.

In [None]:
#OID v6
OID_data_path = "OID_data_v6/"
if (not os.path.exists(OID_data_path)):
    os.makedirs(OID_data_path)
    
label_csv_path = OID_data_path+"oidv6-train-annotations-human-imagelabels.csv"
remote_label_csv_path = "https://storage.googleapis.com/openimages/v6/oidv6-train-annotations-human-imagelabels.csv"
label_names_csv_path = OID_data_path+"oidv6-class-descriptions.csv"
remote_label_names_csv_path = "https://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv"

if not os.path.exists(label_csv_path):
    ! wget "$remote_label_csv_path" -O "$label_csv_path"
if not os.path.exists(label_names_csv_path):
    ! wget "$remote_label_names_csv_path" -O "$label_names_csv_path"
    
# load the label data from disk into Pandas DataFrames -- can take a minute due to large file size
label_df_v6 = pd.read_csv(label_csv_path)
label_names_df_v6 = pd.read_csv(label_names_csv_path,names=['key','name'],index_col=1) # this maps human-readable label names to their key value


        
display(label_df_v6)
display(label_names_df_v6)

### Select image classes
Image classes can be picked out from the list above. They can also be found online using the Open Image dataset explorer.
Here's a useful code snippet to find classes starting with a given letter in case you want to do an A-Z thing:

In [None]:
# find all image classes that start with a letter, e.g for the letter X:
label_names_df_v6[label_names_df_v6.index.str.startswith('X')]
# ... none of these turn out to be very good classes for a children's book :) 

In [None]:
# This is the list of classes we will download images for:
classes = ['Apple', 
           'Bird', 
           'Cat', 
           'Dog',
           'Elephant',
           'Fish',
           'Giraffe',
           'Helicopter',
           'Ice cream',
           'Jellyfish',
           'Kangaroo',
           'Lemon',
           'Motorcycle',
           'Noodle',
           'Orange',
           'Pear',
           'Quilt',
           'Reptile',
           'Snowplow',
           'Train',
           'Umbrella',
           'Vegetable',
           'Waffle',
           # no suitable X classes in dataset :(
           'Yak',
           'Zebra']

### Download images examples in selected classes

In [None]:
# function to obtain images from s3 based on their human-readable class name
def get_images_v6(label_name,limit=-1,base_path=OID_data_path):
    # limit = number of images to download (-1 for all available)
    
    s3_path = "s3://open-images-dataset/train/"
    
    label_key = label_names_df_v6.loc[label_name,'key']
    image_names = label_df_v6.query("LabelName==@label_key and Confidence==1")['ImageID'][:limit]
    print(image_names)
    
    if (not os.path.exists(base_path + label_name)):
        os.makedirs(base_path + label_name)
        
    for img in image_names:
        file_name = img+".jpg"
        ! aws s3 --no-sign-request cp "$s3_path$file_name" "$base_path$label_name/"

# Download up to 100 examples in each image class
for c in classes:
    get_images_v6(c,100)

## 2) Curate Images (by hand)

The images downloaded in the previous step are stored in the OID_data_v6/<classname> folders.
It's useful to go through them by hand to remove misclassified images, images featuring people, drawings/advertisements, images featuring another object class. The rest of the notebook works with the images downloaded to the folder, so you can simply delete any you want to remove. I recommend copying all the images into a "curated_images" folder and then going through them using your file manager.

It'd be a fun improvement to this project to check the other image-labels by hand (for competing object classes, etc. ) and do some automatic filtering. You could also use the bounding-box dataset to crop images to specific objects.

## 3) Generate pages and print-ready PDF

Each page type is rendered by a separate function.
We use combine these with the pdfpages library to render a complete pdf of the book.

### page rendering functions

In [None]:
# Front Cover
def render_cover(
    im_path,
    size=15,
    figsize=(8.5,8.5),
    dpi = 300,
    font="Monospace",
    title="Baby's First\nImage Classification\nTraining Data Set",
    subtitle="an opensource book\nhttps://github.com/ksocks/BFICTDS\nillustrated by the open images dataset"
    ):
    # render the front cover
    # size is the dimension of the image grid, e.g. size=15 is 15*15 grid

    im_files = np.random.choice(glob.glob(im_path+"*/*.jpg"),size*size,replace=False)
    fig, ax = plt.subplots(size,size, figsize=figsize, dpi=dpi)
    fig.patch.set_facecolor('black')

    ind = 0
    for f in im_files[:size**2]:
        ax[ind%size, int(ind/size)].set_axis_off()
        im = Image.open(f)
        enhancer = ImageEnhance.Brightness(im)
        ax[ind%size, int(ind/size)].imshow(enhancer.enhance(0.4))
        ind += 1
     
    fig.text(0.5,0.5,title.upper(),size=50,fontname=font,color='white',horizontalalignment='center',verticalalignment='center',weight='bold')
    fig.text(0.5,0.1,subtitle,size=20,fontname=font,color='white',horizontalalignment='center',verticalalignment='center',weight='bold')

    plt.tight_layout()
    plt.show()
    
    return fig

# render lowres sample page:
render_cover(im_path=OID_data_path,dpi=50);

In [None]:
# Back Cover
def render_back_cover(
    im_path,
    size=15,
    figsize=(8.5,8.5),
    dpi = 300,
    ):
    # render the front cover
    # size is the dimension of the image grid, e.g. size=15 is 15*15 grid
    
    im_files = np.random.choice(glob.glob(im_path+"*/*.jpg"),size*size,replace=False)
    fig, ax = plt.subplots(size,size, figsize=figsize, dpi=dpi)
    #fig.patch.set_facecolor('black')

    ind = 0
    for f in im_files[:size**2]:
        ax[ind%size, int(ind/size)].set_axis_off()
        im = Image.open(f)
        ax[ind%size, int(ind/size)].imshow(im)
        ind += 1

    plt.tight_layout()
    plt.show()
    
    return fig
    
# render lowres sample page:
render_back_cover(im_path=OID_data_path,dpi=50);

In [None]:
# Front Matter
def render_front_matter(
     figsize=(8.5,8.5),
     dpi=300
     ):
    # ---------
    # Front Matter
    # ---------

    copyright_text="""
    This work is licensed under a creative commons 
    Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0),
    which can be found here https://creativecommons.org/licenses/by-nc-sa/4.0/

    First date of publication, August 2020.
    
    ISBN: 978-1-71666-838-8
    Imprint: Lulu.com

    This is an open source book. The contents and code to 
    generate or modify the book are available on github:
    https://github.com/ksocks/BFICTDS

    The images reproduced in this work were obtained from the Open Images Dataset v6
    https://storage.googleapis.com/openimages/web/index.html
    The individual images are available under a creative commons license,
    and specific attribution data can be found on this book's github page.
    """

    fig = plt.figure(figsize=figsize,dpi=dpi,facecolor='white')
    plt.text(0.1,0.6, copyright_text,fontname='arial',size=8,color='black')
    plt.tight_layout()
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)
    fig.gca().spines["top"].set_visible(False)
    fig.gca().spines["left"].set_visible(False)
    fig.gca().spines["bottom"].set_visible(False)
    fig.gca().spines["right"].set_visible(False)
    
    plt.show()
    
    return fig
    
# render lowres sample page:
render_front_matter(dpi=50);

In [None]:
# Introduction
def render_intro(
     figsize=(8.5,8.5),
     dpi=300,
     font="Monospace"
     ):
    
    # ---------
    # Intro
    # ---------
    
    poem_text="""
    > Dear little neural net,
    > 
    > You are making sense of a world of sights and sounds,
    > Where your labels are not always certain,
    > And your examples sometimes confound.
    > 
    > But, you are a little miracle
    > With the power to represent anything you wish
    > And with some parental supervision 
    > And a push or two from a teacher 
    > Through the flat gradients
    > And local minima of life
    > Your vision will become clear.
    > 
    > Yours truly,
    >
    > A neural net with a few more 
    > training epochs under its belt"""
    
    fig = plt.figure(figsize=figsize,dpi=dpi,facecolor='white')
    plt.text(0.1,0.3, poem_text,fontname=font,size=12,color='black')
    plt.tight_layout()
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)
    fig.gca().spines["top"].set_visible(False)
    fig.gca().spines["left"].set_visible(False)
    fig.gca().spines["bottom"].set_visible(False)
    fig.gca().spines["right"].set_visible(False)
        
    plt.show()
    
    return fig

# render lowres sample page:
render_intro(dpi=50);

In [None]:
# Body pages for each image class
def render_body_page(
    im_path,
    im_class,
    figsize=(8.5,8.5),
    dpi=300,
    font="Monospace"):
    
    im_files = glob.glob(im_path+im_class+"/*.jpg")
    len(im_files)

    # this creates a simple adaptive grid size depending on the number of available images:
    if (len(im_files)>=9):
        size = 3
    else:
        print("not enough images in class {}".format(im_class))
        return
    if (len(im_files)>=16):
        size = 4
    if (len(im_files)>=25):# up to 5x5 grid depending on number of successfully downloaded images
        size = 5

    fig, ax = plt.subplots(size,size, figsize=figsize, dpi=dpi)

    ind = 0
    for f in im_files[:size**2]:
        ax[ind%size, int(ind/size)].set_axis_off()
        ax[ind%size, int(ind/size)].imshow(Image.open(f))
        ind += 1

    plt.subplots_adjust(left=0.1, bottom=0.1, right=0.9, top=0.85, wspace=0.1, hspace=0.1)
    fig.text(0.5,0.97," ") #whitespace margin hack
    fig.text(0.5,0.03," ") #whitespace margin hack

    fig.text(0.5,0.87,im_class.upper(),size=40,fontname=font,weight='bold',verticalalignment='bottom',horizontalalignment='center')
    
    plt.show()
    
    return fig

# render lowres sample pages:
render_body_page(im_path=OID_data_path,im_class="Apple",dpi=50);



### Render the complete book to a pdf

In [None]:
# Output options

font = "Monospace" # main book font
figsize = (8.5,8.5) # <-- 8.5 in x 8.5 in is optimized for publishing on LULU, and the pages have been tweaked to fit this.
dpi = 300

im_path = "curated_images/" # where to look for image class subfolders, change if you copy them to a different directory after using the downloader

# render book.pdf
with PdfPages('book.pdf') as pdf:
    
    # ----------------
    # Cover
    # ----------------
    fig = render_cover(im_path=im_path,figsize=figsize,dpi=dpi,font=font)
    pdf.savefig(fig,facecolor=fig.get_facecolor())  # saves the current figure into a pdf page
    plt.close()
    
    # ---------
    # Front Matter
    # ---------

    fig = render_front_matter(figsize=figsize,dpi=dpi)
    pdf.savefig(fig,facecolor=fig.get_facecolor())  # saves the current figure into a pdf page
    plt.close()
    
    # ---------
    # Intro
    # ---------
    
    fig = render_intro(figsize=figsize,dpi=dpi,font=font)
    pdf.savefig(fig,facecolor=fig.get_facecolor())  # saves the current figure into a pdf page
    plt.close()
 
    
    # ---------
    # Book body
    # ---------
    
    for c in classes[:]:
        fig = render_body_page(im_path=im_path,im_class=c,figsize=figsize,dpi=dpi,font=font)
        pdf.savefig(fig,facecolor=fig.get_facecolor())  # saves the current figure into a pdf page
        plt.close()

        
    # ----------------
    # Back Cover
    # ----------------
    fig = render_back_cover(im_path=im_path,figsize=figsize,dpi=dpi)
    pdf.savefig(fig,facecolor=fig.get_facecolor())  # saves the current figure into a pdf page
    plt.close()


## 4) (optional) Prepare cover files for Lulu.com (by hand)

Lulu requires the front and back covers to be provided in a separate file with a given template format. You can use the provided "lulu covers.svg" and the pdf import capabilities of the [Inkscape editor](https://inkscape.org/) for example to paste in your new cover images if you want to change them, then re-generate a pdf to use on Lulu.

## 5) (optional) Obtain attribution information for selected images

Note this step involves downloading the complete label information, which is a quite large file (2.3GB), to get attribution information for the individual images selected.

In [None]:
# download the csv with attribution info 
attribution_csv_path = OID_data_path + "oidv6-train-images-with-labels-with-rotation.csv"
remote_attribution_csv_path = "https://storage.googleapis.com/openimages/v6/oidv6-train-images-with-labels-with-rotation.csv"

if not os.path.exists(attribution_csv_path):
    ! wget "$remote_attribution_csv_path" -O "$attribution_csv_path"
    

In [None]:
# get list of images used in the book
# note this actually includes some images that aren't rendered onto the pages, since we ignored the grid size logic

imgs_used = []
im_path = "curated_images/"
imgs=glob.glob(im_path+"*/*.jpg")
for i in imgs:
    imgs_used.append(i.split("/")[2].split(".")[0])
print(imgs_used)

In [None]:
# get csv column headers
with open(attribution_csv_path,"r",newline='') as f:
    r = csv.reader(f)
    # print header
    print(next(r))
    print(next(r))

In [None]:
# get attribution info by brute force scanning file - takes about 15 minutes to go through the ~7 million entries
attribution = dict()
with open(attribution_csv_path,"r",newline='') as f:
    r = csv.reader(f)
    n = 0
    for row in r:
        if row[0] in imgs_used:
            attribution[row[0]]=[row[2],row[4],row[6],row[7]]
        print("scanning row {}".format(n),end='\r')
        n+=1

In [None]:
# print attribution info and dump to file

print(attribution)
with open('attribution.json', 'w') as file:
     file.write(json.dumps(attribution))