## Sketch Understanding exercise

Complete and submit this completed worksheet, including its outputs and any supporting code outside of the worksheet.

In this exercise you will:

- work with high-level feature representations of photos and sketches extracted from a modern CNN pre-trained on image categorization on the ImageNet photo dataset (VGG-19; Simonyan & Zisserman, 2014). 
- matched photo-sketch dataset from: _Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4), 119._
- compute a "Representational Dissimilarity Matrix" (RDM; Kriegeskorte, 2008) for each image domain (i.e., photos, sketches) using features from different layers of the CNN
- apply clustering to find groups of visually similar object categories
- compare RDMs between image domains for different layers
- practice using commonly used methods from `sklearn`/`scipy`
- practice using `pandas` to manipulate dataframes and designing your own custom functions to analyze high-dimensional data

Note that the dataset for this exercise is around 3GB in size. You may experience some slowdowns when loading and working with datasets of this size in a jupyter notebook running on your local machine. If these slowdowns are prohibitive, please know that this course has access to shared machine learning & data science compute resources through: https://datahub.ucsd.edu. If you are thinking about using the datahub resources, let the instructor know, so we can troubleshoot any issues.

In [None]:
## general
import numpy as np
import os, sys
import pandas as pd

## plotting
import  matplotlib
from matplotlib import pylab, mlab, pyplot
%matplotlib inline
from IPython.core.pylabtools import figsize, getfigs
plt = pyplot
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42

import seaborn as sns
sns.set_context('talk')
sns.set_style('darkgrid')

## sklearn
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import svm
from sklearn import linear_model, datasets, neighbors
from sklearn.cluster import AffinityPropagation, KMeans
from sklearn import metrics

## scipy
import scipy.spatial.distance as dist
from scipy import stats

## UI nice thing
from IPython.display import clear_output

### define paths

In [None]:
## directory & file hierarchy
proj_dir = os.path.abspath(os.getcwd())
feature_dir = os.path.join(proj_dir,'features')

### helper functions

In [None]:
def normalize(X):
    '''
    z-score normalization to center and re-scale embeddings
    '''
    X = X - X.mean(0)
    X = X / np.maximum(X.std(0), 1e-5)
    return X

def flatten(x):
    '''
    flatten a list of lists
    '''
    return [item for sublist in x for item in sublist]

def preprocess_meta(M):
    '''
    input: pandas dataframe with a column named 'path'
    output: copy of pandas dataframe with additional 'category' and 'fname' columns, parsed from 'path'
            'fname' stands for filename
    '''    
    #############################################################################
    # TODO: Fill in this function according to docstring.                       #    
    #############################################################################
        
    return M

def load_features(feature_dir,
                  layer_name='FC6',
                  data_type='photo',
                  normed=True):
    '''
    load in features (.npy) and metadata (.csv) for particular layer of VGG
    data type: 'photo' or 'sketch'
    normed: boolean indicating whether to z-score features within feature dimension
    '''
    F = np.load(os.path.join(feature_dir,'FEATURES_VGG_{}_{}.npy'.format(layer_name,data_type)))
    M = pd.read_csv(os.path.join(feature_dir,'METADATA_{}.csv'.format(data_type)))
    M = preprocess_meta(M)
    if normed:
        F = normalize(F)
    return F, M

def extract_features_only(DF):
    '''
    input: dataframe with both feature indices and metadata columns
    output: dataframe with only feature indices (numerical)
    '''
    num_feats = len([i for i in list(DF.columns) if type(i) is not str]) ## only numeric columns extracted
    return DF[list(np.arange(num_feats))]

def visualize_matrix(D,obj_list=None):
    '''
    generate visualization of custom distance matrix
    '''
    fig = plt.figure(figsize(16,16))
    sns.set_style('dark')
    plt.matshow(D,cmap='viridis')    
    
    ## plot params
    plt.colorbar(fraction=0.045)
    plt.tick_params(
        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom=False,      # ticks along the bottom edge are off
        top=False,         # ticks along the top edge are off
        labelbottom=False) # labels along the bottom edge are off    
    
    ## add object labels, if passed to func
    if obj_list is not None:
        plt.xticks(range(len(D)), obj_list, fontsize=9,rotation='vertical')
        plt.yticks(range(len(D)), obj_list, fontsize=9)  
        
def apply_clustering(DF, n_clusters=4):
    '''
    apply Kmeans clustering to feature vectors and add cluster indices to dataframe
    '''
    F = extract_features_only(DF)
    #############################################################################
    # TODO: Apply KMeans clustering with n_clusters, then add new column to DF  #
    # called `cluster_inds` that contains the cluster indices.                  #
    #############################################################################    
    return DF

def get_common_cluster_inds(Pmean,Smean, n_clusters=14):
    '''
    input: class-mean feature representation dataframes for photos (Pmean) and sketches (Smean) 
    purpose: apply clustering to photo feature matrix, and use to get common cluster indices that are then 
             added to both Pmean and Smean
    output: Pmean and Smean with additional column named 'common_cluster_inds'
    '''
    _Pmean = apply_clustering(Pmean, n_clusters=n_clusters)
    #############################################################################
    # TODO: Fill in this function according to docstring.                       #    
    #############################################################################    
    
    return Pmean, Smean                        

def get_ordered_distance_matrix(DF, 
                                metric='correlation', 
                                viz=True):
    '''
    input:
        DF is a dataframe containing feature columns and metadata
        metric: pick a distance metric from options available from scipy.spatial.distance
                e.g., correlation, cosine, cityblock, euclidean
        viz is boolean flag to control whether we visualize matrix or not
    '''
    #############################################################################
    # TODO: Fill in this function. Compute distance matrix using pdist and      #
    # squareform from scipy.spatial.distance. Make sure that categories are     #    
    # ordered in same way for both photo and sketch domains in distance matrix  #
    # that is passed to visualize matrix below.                                 #
    #############################################################################        
    
    if viz==True:
        if obj_list is not None:
            visualize_matrix(D,obj_list=obj_list)
        else:
            visualize_matrix(D)
    return D

def get_upper_triangle(D):
    #############################################################################
    # TODO: Return only values (inds) in upper triangle of square matrix.       #
    #############################################################################
    return D[inds]

def evaluate_rdm_similarity(D1,D2):
    '''
    input: two distance matrices
    output: r, Spearman rank correlation between values in upper-triangle of these two vertices
    '''
    #############################################################################
    # TODO: Fill in function according to docstring, using get_upper_triangle.  #
    #############################################################################        
    return r 

### Explore "FC6" feature representation of matched photos and sketches 

In [None]:
## load in features and metadata
PF,PM = load_features(feature_dir,layer_name='FC6',data_type='photo') ## photos
SF,SM = load_features(feature_dir,layer_name='FC6',data_type='sketch') ## sketches

In [None]:
## concatenate feature matrix and metadata along columns
P = pd.concat([pd.DataFrame(PF),PM],axis=1)
S = pd.concat([pd.DataFrame(SF),SM],axis=1)

In [None]:
## get category-mean feature vectors for each image domain
#############################################################################
# TODO: Using groupby from pandas, compute mean feature vectors for each    #
# category for P and S, and assign to variables: Pmean & Smean, resp.       #
#############################################################################
Pmean = P##
Smean = S##

In [None]:
## get common cluster inds, based on clustering applied to one of these image domains
Pmean, Smean = get_common_cluster_inds(Pmean,Smean,n_clusters=14)
#############################################################################
# TODO: Play around with different values of n_clusters                     #
#############################################################################

In [None]:
## get ordered distance matrix (ordered on common cluster inds)
PD = get_ordered_distance_matrix(Pmean,viz=True,metric='correlation')
SD = get_ordered_distance_matrix(Smean,viz=True,metric='correlation')
#############################################################################
# TODO: Play around with different choices of distance metric.              #
#############################################################################


In [None]:
if 'PD' in locals():
    r = evaluate_rdm_similarity(PD,SD)
    print('FC6 Similarity between Photo and Sketch Domans = ', np.round(r,4))

### Generalize to other layers

In [None]:
def analyze_layer_append_result(R=None,layer_name='FC6',viz=False):
    '''
    input: R = dictionary to store results of cross-domain similarity analysis. 
               keys are layer names, values are RDM similarities            
           layer_name = string in ['P1','P2','P3','P4','P5','FC6','FC7']
           viz = boolean flag to control whether to display matrices or not
    output: R = same dictionary with additional layer result appended    
    '''

    #############################################################################
    # TODO: Fill in this function with the necessary steps to apply all of the  #
    # analysis steps from above to an arbitrary layer, by name. These steps     #
    # should yield two ordered distance matrices, to be passed into             #
    # evaluate_rdm_similarity.                                                  #    
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    ## load in features and metadata

    ## concatenate feature matrix and metadata along columns

    ## get category-mean feature vectors for each image domain

    ## get common cluster inds, based on clustering applied to one of these image domains

    ## get ordered distance matrix (ordered on common cluster inds)
    pass

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****        
        
    r = evaluate_rdm_similarity(PD,SD)
    if R is None:
        R = dict()
    R[layer_name] = r    
    return R


def analyze_all_layers(R=None,layer_list=['P1','P2','P3','P4','P5','FC6','FC7']):
    '''
    iterate over all layers, calling func analyze_layer_append_result
    '''
    if R is None:
        R = dict()
    for i,layer_name in enumerate(layer_list):
        print('Analyzing layer {} ...'.format(layer_name))
        R = analyze_layer_append_result(R,layer_name=layer_name,viz=False)
        clear_output(wait=True)
    return R

In [None]:
R = analyze_all_layers()

### bundle cross-domain similarity numbers into dataframe

In [None]:
#############################################################################
# TODO: Convert R into a dataframe called SIM that has the following        #
# columns: `layer` and `similarity`, where the similarity values are the    #
# correlation between photo-sketch RDMs for each layer of VGG.              #
#############################################################################

In [None]:
## inspect SIM
if 'SIM' in locals():
    SIM

### generate visualization

In [None]:
#############################################################################
# TODO: Generate lineplot of cross-domain RDM similarity by number using    #
# seaborn.lineplot. Make sure that all of your axes are labeled and are     #
# scaled appropriately. Save figure out as a PNG/PDF image.                 #
#############################################################################