In [2]:
import turicreate

In [14]:
image_train = turicreate.SFrame('image_train_data/')
image_test = turicreate.SFrame('image_test_data/')

# Exercise

**1. Computing summary statistics of the data**
 - Sketch summaries are techniques for computing summary statistics of data very quickly. In GraphLab Create, SFrames and SArrays include a method: .sketch_summary() 
 - which computes such summary statistics. Using the training data, compute the sketch summary of the ‘label’ column and interpret the results. What’s the least common category in the training data?

In [18]:
sketch = turicreate.Sketch(image_train['label'])
sketch


+------------------+-------+----------+
|       item       | value | is exact |
+------------------+-------+----------+
|      Length      |  2005 |   Yes    |
| # Missing Values |   0   |   Yes    |
| # unique values  |   4   |    No    |
+------------------+-------+----------+

Most frequent items:
+------------+-------+
|   value    | count |
+------------+-------+
|    cat     |  509  |
|    dog     |  509  |
| automobile |  509  |
|    bird    |  478  |
+------------+-------+


**2. Creating category-specific image retrieval models**
In most retrieval tasks, the data we have is unlabeled, thus we call these unsupervised learning problems. However, we have labels in this image dataset, and will use these to create one model for each of the 4 image categories, {‘dog’,’cat’,’automobile’,bird’}. To start, follow these steps:

 - Split the SFrame with the training data into 4 different SFrames. Each of these will contain data for 1 of the 4 categories above. Hint: if you use a logical filter to select the rows where the ‘label’ column equals ‘dog’, you can create an SFrame with only the data for images labeled ‘dog’.

 - Similarly to the image retrieval notebook you downloaded, you are going to create a nearest neighbor model using the 'deep_features' as the features, but this time create one such model for each category, using the training_data. You can call the model with the ‘dog’ data the dog_model, the one with the ‘cat’ data the cat_model, as so on. You now have a nearest neighbors model that can find the nearest ‘dog’ to any image you give it, the dog_model; one that can find the nearest ‘cat’, the cat_model; and so on.

In [20]:
dog_train = image_train[image_train['label'] == 'dog']
cat_train = image_train[image_train['label'] == 'cat']
auto_train = image_train[image_train['label'] == 'automobile']
bird_train = image_train[image_train['label'] == 'bird']

In [24]:
# create nearest neighbours model for each category

In [25]:
dog_model = turicreate.nearest_neighbors.create(dog_train, features=['deep_features'], label='id')

In [26]:
cat_model = turicreate.nearest_neighbors.create(cat_train, features=['deep_features'], label='id')

In [23]:
automobile_model  = turicreate.nearest_neighbors.create(auto_train, features=['deep_features'], label='id')

In [27]:
bird_model = turicreate.nearest_neighbors.create(bird_train, features=['deep_features'], label='id')

In [30]:
# nearest neighbour for the first test cat image
cat_model.query(image_test[0:1])

query_label,reference_label,distance,rank
0,16289,34.62371920804245,1
0,45646,36.00687992842462,2
0,32139,36.52008134363789,3
0,25713,36.754850252057054,4
0,331,36.87312281675268,5


In [31]:
def get_images_from_ids(query_result):
    return image_train.filter_by(query_result['reference_label'],'id')

In [34]:
cat_image = image_train[image_train['id']==16289]

In [36]:
# nearest neighbour for the first test dog image
dog_model.query(image_test[0:1])

query_label,reference_label,distance,rank
0,16976,37.464262878423774,1
0,13387,37.56668321685285,2
0,35867,37.60472670789396,3
0,44603,37.70655851529755,4
0,6094,38.51132549073972,5


In [37]:
dog_image = image_train[image_train['id']==16976]

**3.Try a simple example of nearest-neighbors classification**

When you queried the nearest neighbors model, the distance column in the Task 2 showed the computed distance between the input and each of the retrieved neighbors. In this task, you will use these distances for classification, using a nearest-neighbors classifier.

 - For the first image in the test data (image_test[0:1]), compute the mean distance between this image at its five nearest neighbors that are labeled ‘cat’ in the training data (similar to what you did in the previous question).
 - For the first image in the test data (image_test[0:1]), compute the mean distance between this image at its five nearest neighbors that are labeled ‘dog’ in the training data (similar to what you did in the previous question).

On average, is the first image in the test data closer to its five nearest neighbors in the ‘cat’ data or in the ‘dog’ data?

In [38]:
cat_model.query(image_test[0:1])['distance'].mean()

36.15573070978294

In [39]:
dog_model.query(image_test[0:1])['distance'].mean()

37.77071136184156

**4.Compute nearest neighbors accuracy**

In [40]:
# split the test data on the cat, dog, automobile, and bird labels
image_test_automobile = image_test.filter_by(['automobile'],'label')
image_test_cat = image_test.filter_by(['cat'],'label')
image_test_dog = image_test.filter_by(['dog'],'label')
image_test_bird = image_test.filter_by(['bird'],'label')

In [41]:
# finds one neighbor (i.e., k=1) to the dog test images (image_test_dog) in the cat portion of the training data
dog_cat_neighbors = cat_model.query(image_test_dog, k=1)

In [42]:
dog_dog_neighbors = dog_model.query(image_test_dog, k=1)

In [43]:
dog_automobile_neighbors = automobile_model.query(image_test_dog, k=1)

In [44]:
dog_bird_neighbors = bird_model.query(image_test_dog, k=1)

In [45]:
dog_distances = turicreate.SFrame({'dog_automobile': dog_automobile_neighbors['distance'],
                              'dog_bird': dog_bird_neighbors['distance'],
                              'dog_cat': dog_cat_neighbors['distance'],
                              'dog_dog': dog_dog_neighbors['distance']
                             })

In [46]:
dog_distances.head()

dog_automobile,dog_bird,dog_cat,dog_dog
41.95797614571203,41.75386473035126,36.419607706754384,33.47735903726335
46.00213318067788,41.3382958924861,38.83532688735542,32.84584956840554
42.9462290692388,38.615759085289056,36.97634108541546,35.03970731890584
41.68660600484793,37.08922699538214,34.575007291446106,33.90103276968193
39.22696649347584,38.27228869398105,34.77882479101661,37.48492509092564
40.58451176980721,39.146208923590486,35.11715782924591,34.94516534398124
45.10673529610854,40.52304010596232,40.60958309132649,39.095727834463545
41.32211409739762,38.19479183926956,39.90368673062214,37.76961310322034
41.82446549950164,40.156713166131446,38.067470016821176,35.10891446032838
45.497692940110376,45.55979626027668,42.72587329506032,43.242283258453455


In [47]:
def is_dog_correct(row):  
    if row['dog_dog'] <= min(row.values()):     
        return 1    
    else:        
        return 0

In [48]:
dog_distances.apply(is_dog_correct)

dtype: int
Rows: 1000
[1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, ... ]

In [49]:
dog_distances.apply(is_dog_correct).sum()

678

Hint: To make sure your code is working correctly, if you were to do steps d) and e) in this question to count the number of correctly classified ‘cat’ images in the test data, instead of ‘dog’, the result would be 548.

In [50]:
cat_distances = turicreate.SFrame({'cat_automobile': automobile_model.query(image_test_cat, k=1)['distance'],
                                 'cat_bird': bird_model.query(image_test_cat, k=1)['distance'],
                                 'cat_cat': cat_model.query(image_test_cat, k=1)['distance'],
                                 'cat_dog': dog_model.query(image_test_cat, k=1)['distance'],
                                })

In [51]:
def is_cat_correct(row):  
    if row['cat_cat'] <= min(row.values()):     
        return 1    
    else:        
        return 0

In [52]:
cat_distances.apply(is_cat_correct).sum()

548

In [53]:
dog_distances.apply(is_dog_correct).sum()/float(len(dog_distances))

0.678