# Explorer Notebook
This notebook is for a bunch of little experiments here and there. Mostly just a place to run Python code.

In [2]:
import pandas as pd

In [68]:
# These are the subset of classes Airbnb are most concerned with
subset = ["Toilet",
          "Swimming_pool",
          "Bed",
          "Billiard_table",
          "Sink",
          "Fountain",
          "Oven",
          "Ceiling_fan",
          "Television",
          "Microwave_oven",
          "Gas_stove",
          "Refrigerator",
          "Kitchen_&_dining_room_table",
          "Washing_machine",
          "Bathtub",
          "Stairs",
          "Fireplace",
          "Pillow",
          "Mirror",
          "Shower",
          "Couch",
          "Countertop",
          "Coffeemaker",
          "Dishwasher",
          "Sofa_bed",
          "Tree_house",
          "Towel",
          "Porch",
          "Wine_rack",
          "Jacuzzi"]

len(subset)

30

## Start exploring the class names in Open Images
Downloaded the class descriptions from Open Images: `!wget https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv`

This file contains all of the codenames for the classes which have bounding box labels in Open Images.

In [None]:
#!wget https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv

In [69]:
# All the classes in Open Images
classes = pd.read_csv("class-descriptions-boxable.csv", names=["ID", "Name"])
classes

Unnamed: 0,ID,Name
0,/m/011k07,Tortoise
1,/m/011q46kg,Container
2,/m/012074,Magpie
3,/m/0120dh,Sea turtle
4,/m/01226z,Football
...,...,...
596,/m/0qmmr,Wheelchair
597,/m/0wdt60w,Rugby ball
598,/m/0xfy,Armadillo
599,/m/0xzly,Maracas


In [70]:
# Let's get a subset or at least all the columns which match
classes["match"] = classes["Name"].isin(subset)
classes

Unnamed: 0,ID,Name,match
0,/m/011k07,Tortoise,False
1,/m/011q46kg,Container,False
2,/m/012074,Magpie,False
3,/m/0120dh,Sea turtle,False
4,/m/01226z,Football,False
...,...,...,...
596,/m/0qmmr,Wheelchair,False
597,/m/0wdt60w,Rugby ball,False
598,/m/0xfy,Armadillo,False
599,/m/0xzly,Maracas,False


In [71]:
classes.match.value_counts()

False    581
True      20
Name: match, dtype: int64

In [72]:
# Where do they match up?
matches = classes[classes["match"] == True]["Name"].tolist()
matches

['Sink',
 'Towel',
 'Stairs',
 'Fountain',
 'Oven',
 'Couch',
 'Shower',
 'Pillow',
 'Bathtub',
 'Bed',
 'Fireplace',
 'Refrigerator',
 'Porch',
 'Mirror',
 'Jacuzzi',
 'Television',
 'Coffeemaker',
 'Toilet',
 'Countertop',
 'Dishwasher']

In [73]:
# Where are they different?
missing_classes = list(set(subset)-set(matches))
missing_classes # missing classes in Open Images that are in Airbnb's classes of concern 

['Kitchen_&_dining_room_table',
 'Swimming_pool',
 'Gas_stove',
 'Ceiling_fan',
 'Washing_machine',
 'Tree_house',
 'Billiard_table',
 'Microwave_oven',
 'Sofa_bed',
 'Wine_rack']

In [74]:
# Are there similar versions of these classes in the descriptions I could use?
classes[classes["Name"].str.contains("pool")]

Unnamed: 0,ID,Name,match
444,/m/0b_rs,Swimming pool,False


In [75]:
classes[classes["Name"].str.contains("stove")]

Unnamed: 0,ID,Name,match
197,/m/02wv84t,Gas stove,False
270,/m/04169hn,Wood-burning stove,False


In [76]:
classes[classes["Name"].str.contains("stove")]["Name"].tolist()

['Gas stove', 'Wood-burning stove']

In [77]:
# Get the individual words from each string of missing classes
strings = [i.split('_') for i in missing_classes]
strings = [item for sublist in strings for item in sublist]
strings

['Kitchen',
 '&',
 'dining',
 'room',
 'table',
 'Swimming',
 'pool',
 'Gas',
 'stove',
 'Ceiling',
 'fan',
 'Washing',
 'machine',
 'Tree',
 'house',
 'Billiard',
 'table',
 'Microwave',
 'oven',
 'Sofa',
 'bed',
 'Wine',
 'rack']

In [78]:
# Now find if any of the strings match up
more_matches = []
for string in strings:
    more_matches.append(classes[classes["Name"].str.contains(string)]["Name"].tolist())
more_matches = list(set([item for sublist in more_matches for item in sublist]))
more_matches

['Mechanical fan',
 'Tree',
 'Bathroom accessory',
 'Table tennis racket',
 'Mushroom',
 'Infant bed',
 'Kitchenware',
 'Tennis racket',
 'Kitchen utensil',
 'Kitchen appliance',
 'Spice rack',
 'Wine glass',
 'Gas stove',
 'Bathroom cabinet',
 'Billiard table',
 'Kitchen & dining room table',
 'Washing machine',
 'Tree house',
 'Kitchen knife',
 'Dog bed',
 'Lighthouse',
 'Wine rack',
 'Wood-burning stove',
 'Ceiling fan',
 'Swimming pool',
 'Wine',
 'Sewing machine',
 'Sofa bed',
 'Coffee table',
 'Microwave oven',
 'Vegetable']

In [80]:
# Take out the underscore
missing_classes_no_space = [i.replace("_", " ") for i in missing_classes]
missing_classes_no_space

['Kitchen & dining room table',
 'Swimming pool',
 'Gas stove',
 'Ceiling fan',
 'Washing machine',
 'Tree house',
 'Billiard table',
 'Microwave oven',
 'Sofa bed',
 'Wine rack']

In [82]:
# Find the actual missing classes
actual_missing_classes = list(set(missing_classes_no_space) - set(more_matches))
actual_missing_classes

[]

Turns out there aren't any missing classes from the Open Images set! The only difference here is the naming convention. Airbnb used underscores "_" in their class names. This is a simple fix we can implement later. 

Let's remove the underscores from our `subset` list and play with that to start downloading classes.

In [83]:
subset_no_underscore = [i.replace("_", " ") for i in subset]
subset_no_underscore

['Toilet',
 'Swimming pool',
 'Bed',
 'Billiard table',
 'Sink',
 'Fountain',
 'Oven',
 'Ceiling fan',
 'Television',
 'Microwave oven',
 'Gas stove',
 'Refrigerator',
 'Kitchen & dining room table',
 'Washing machine',
 'Bathtub',
 'Stairs',
 'Fireplace',
 'Pillow',
 'Mirror',
 'Shower',
 'Couch',
 'Countertop',
 'Coffeemaker',
 'Dishwasher',
 'Sofa bed',
 'Tree house',
 'Towel',
 'Porch',
 'Wine rack',
 'Jacuzzi']

Okay we'll start with a small class (small as in, there are likely not many examples), let's use `Jacuzzi` first.

In [85]:
!python3 downloadOI.py --classes 'Jacuzzi' --mode train

Class 0 : Jacuzzi
grep: ./train-annotations-bbox.csv: No such file or directory
Annotation Count : 0
Number of images to be downloaded : 0
0it [00:00, ?it/s]


Get all the files we need from Open Images (labels, annotations, descriptions, etc)

In [86]:
!wget https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/train/train-annotations-bbox.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/validation/validation-annotations-bbox.csv
 
!wget https://storage.googleapis.com/openimages/2018_04/test/test-annotations-bbox.csv 

--2020-02-20 06:04:12--  https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.203.112, 2404:6800:4006:807::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.203.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11255 (11K) [text/csv]
Saving to: ‘class-descriptions-boxable.csv.1’


2020-02-20 06:04:13 (75.0 MB/s) - ‘class-descriptions-boxable.csv.1’ saved [11255/11255]

--2020-02-20 06:04:13--  https://storage.googleapis.com/openimages/2018_04/train/train-annotations-bbox.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.167.112, 2404:6800:4006:803::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.167.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1194033454 (1.1G) [text/csv]
Saving to: ‘train-annotations-bbox.csv’


2020-02-20 06:04:39 (45.6 MB/s) - ‘train-annot

In [87]:
!python3 downloadOI.py --classes 'Jacuzzi' --mode train

Class 0 : Jacuzzi
Annotation Count : 103
Number of images to be downloaded : 102
100%|█████████████████████████████████████████| 102/102 [00:24<00:00,  4.15it/s]


In [90]:
!python3 downloadOI.py --classes 'Toilet,Bathtub' --mode validation

Class 0 : Toilet
Class 1 : Bathtub
Annotation Count : 55
Number of images to be downloaded : 45
100%|███████████████████████████████████████████| 45/45 [00:10<00:00,  4.13it/s]
