# Building an Inverse Image Search Service

Leveraging the power of transfer learning can be very useful in many applications. Particularly, in this notebook we'll harness the readiness of pre-trained network to build an inverse image search engine or, more clearly, a "search by example" service.

Of course, all deep learning tasks begin with data acquisition, so this will be no different. 

The images we will use will be extracted from Wikipedia and then passed through a pre-trained network to compile a "embedding" dictionary that'll later on will allow us to fetch similar images using a simple nearest neighbor search.

## Prerequisites

Let's import the libraries we'll need.

In [1]:
%matplotlib inline

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import requests
import random
import os
import base64
from sklearn.decomposition import TruncatedSVD
from keras.models import Model
from hashlib import md5
import pickle
from urllib.parse import unquote
from PIL import Image
from io import BytesIO
from IPython.display import HTML, Image as IPImage, display
import numpy as np
from tqdm import tqdm
from keras.applications.inception_v3 import InceptionV3, preprocess_input
from keras.preprocessing import image

Using TensorFlow backend.


## Getting Images from Wikipedia

Wikipedia is a great soure of knowledge in many formats. This, of course, includes visual data (i.e., images). Given we want our search service to be as flexible as possible, is our desire to gather a corpus of images as possible. Hence, Wikipedia is a great place to do so.

The thing with the images in Wikipedia is that they represent concrete instances of things, which is not exactly what we need in this case. Instead, we want to return a picture that represents a dog as a species, instead of a specific one, such as Pluto.

Wikipedia is parented with Wikidata, which structure is based around triplets of the form {subject, relation, object}, and has a great number of predicates encoded, many of them on top of Wikipedia. One of these predicates, called "instance of" is represented by P31. Using this predicate, we can get the images of the objects in the ends of this relationship. 

Here's a query expressed in Wikidata's query language to get this information:

In [2]:
query = '''SELECT ?pic
WHERE
{
    ?item wdt:P31 ?class . 
    ?class wdt:P18 ?pic
}'''

Let's call the Wikidata using this query and unfold the resulting JSON into a list of images.

In [3]:
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'

data = requests.get(url, params={'query': query, 'format': 'json'})
print(data.text)
data = data.json()

pages_urls = [result['pic']['value'] for result in data['results']['bindings']]

print(f'Number of fetched pages urls: {len(pages_urls)}')
print(random.sample(pages_urls, 10))

The results correspond to URLs sto the image pages, not the images themselves.

In [4]:
print(len(pages_urls))