Version 1.1.0

# The task

In this assignment you will need to implement features, based on nearest neighbours. 

KNN classifier (regressor) is a very powerful model, when the features are homogeneous and it is a very common practice to use KNN as first level model. In this homework we will extend KNN model and compute more features, based on nearest neighbors and their distances. 

You will need to implement a number of features, that were one of the key features, that leaded the instructors to prizes in [Otto](https://www.kaggle.com/c/otto-group-product-classification-challenge) and [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response) competitions. Of course, the list of features you will need to implement can be extended, in fact in competitions the list was at least 3 times larger. So when solving a real competition do not hesitate to make up your own features.   

You can optionally implement multicore feature computation. Nearest neighbours are hard to compute so it is preferable to have a parallel version of the algorithm. In fact, it is really a cool skill to know how to use `multiprocessing`, `joblib` and etc. In this homework you will have a chance to see the benefits of parallel algorithm. 

# Check your versions

Some functions we use here are not present in old versions of the libraries, so make sure you have up-to-date software. 

In [None]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 

for p in [np, pd, sklearn, scipy]:
    print (p.__name__, p.__version__)

The versions should be not less than:

    numpy 1.13.1
    pandas 0.20.3
    sklearn 0.19.0
    scipy 0.19.1
   
**IMPORTANT!** The results with `scipy=1.0.0` will be different! Make sure you use _exactly_ version `0.19.1`.

# Load data

Learn features and labels. These features are actually OOF predictions of linear models.

In [None]:
from pathlib import Path

features_data_path = Path('.').absolute().parent.joinpath('readonly', 'KNN_features_data')

train_path = features_data_path.joinpath('X.npz')
train_labels = features_data_path.joinpath('Y.npy')

test_path = features_data_path.joinpath('X_test.npz')
test_labels = features_data_path.joinpath('Y_test.npy')

In [None]:
# Train data
X = scipy.sparse.load_npz(train_path)
Y = np.load(train_labels)

# Test data
X_test = scipy.sparse.load_npz(test_path)
Y_test = np.load(test_labels)

# Out-of-fold features we loaded above were generated with n_splits=4 and skf seed 123
# So it is better to use seed 123 for generating KNN features as well 
skf_seed = 123
n_splits = 4

The worlds shortest EDA:

In [None]:
Y[:5]

Below you need to implement features, based on nearest neighbors.

In [None]:
import sys

def parallel_class_method_call(parameters):
    """
    NOTE: This is actually only needed in python2
    
    A helper for calling class methods in parallel

    Parameters
    ----------
    parameters : list or tuple
        A list containing the following elements:
        class_name : str
            The name of the class
        class_state : dict
            The __dict__ attribute (i.e. the state) of the class
        method_name : str
            Name of the class method to call
        args
            The positional arguments passed to the function to be called
        kwargs
            The keyword arguments passed to the function to be called

    Returns
    -------
    result
        The result of the call

    References
    ----------
    https://stackoverflow.com/questions/44185770/call-multiprocessing-in-class-method-python
    http://qingkaikong.blogspot.com/2016/12/python-parallel-method-in-class.html
    """

    class_name, class_state, method_name, args, kwargs = parameters

    # Get our class type
    cls = getattr(sys.modules[__name__], class_name)
    # Create a new instance without invoking __init__
    instance = cls.__new__(cls)
    # Apply the passed state to the new instance
    instance.__dict__ = class_state
    # Get the requested method
    method = getattr(instance, method_name)

    # Properly format the arguments
    args = args if isinstance(args, (list, tuple)) else (args,)
    if kwargs is None:
        kwargs = dict()

    # Call the function
    result = method(*args, **kwargs)

    return result

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import NearestNeighbors
from multiprocessing import Pool

import numpy as np


class NearestNeighborsFeats(BaseEstimator, ClassifierMixin):
    '''
        This class should implement KNN features extraction 
    '''
    def __init__(self, n_jobs, k_list, metric, n_classes=None, n_neighbors=None, eps=1e-6):
        self.n_jobs = n_jobs
        self.k_list = k_list
        self.metric = metric
        
        if n_neighbors is None:
            self.n_neighbors = max(k_list) 
        else:
            self.n_neighbors = n_neighbors
            
        self.eps = eps
        
        # NOTE: Temporary variable (notice trailing _)
        # NOTE: If None, this will be set in self.fit()
        self.n_classes_ = n_classes
    
    def fit(self, X, y):
        '''
            Set's up the train set and self.NN object
        '''
        # Create a NearestNeighbors (NN) object. We will use it in `predict` function 
        self.NN = NearestNeighbors(n_neighbors=max(k_list), 
                                   metric=self.metric, 
                                   n_jobs=1, 
                                   algorithm='brute' if self.metric=='cosine' else 'auto')
        self.NN.fit(X)
        
        # Store labels 
        self.y_train = y
        
        # Save how many classes we have
        self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_
        
        
    def predict(self, X):       
        '''
            Produces KNN features for every object of a dataset X
        '''
        if self.n_jobs == 1:
            test_feats = []
            for i in range(X.shape[0]):
                test_feats.append(self.get_features_for_one(X[i:i+1]))
        else:
            '''
                 *Make it parallel*
                     Number of threads should be controlled by `self.n_jobs`  
                     
                     
                     You can use whatever you want to do it
                     For Python 3 the simplest option would be to use 
                     `multiprocessing.Pool` (but don't use `multiprocessing.dummy.Pool` here)
                     You may try use `joblib` but you will most likely encounter an error, 
                     that you will need to google up (and eventually it will work slowly)
                     
                     For Python 2 I also suggest using `multiprocessing.Pool` 
                     You will need to use a hint from this blog 
                     http://qingkaikong.blogspot.ru/2016/12/python-parallel-method-in-class.html
                     I could not get `joblib` working at all for this code 
                     (but in general `joblib` is very convenient)
                     
            '''
            
            # YOUR CODE GOES HERE
            with Pool(self.n_jobs) as p:
                test_feats = p.map(parallel_class_method_call,
                                   self.prepare_parallel_map_call('get_features_for_one',
                                                                  [X[i:i+1] for i in range(X.shape[0])]))
             
            # assert False, 'You need to implement it for n_jobs > 1'
            
        return np.vstack(test_feats)
            
    def prepare_parallel_map_call(self, class_method_name, args):
        """
        NOTE: This is actually only needed in python2

        Prepares a parallel map call
        
        Parameters
        ----------
        class_method_name : str
            Name of the class method to be called
        args : list or tuple
            The elements to be used in the map call
            
        Yields
        ------
        parallel_call_argmuents : list
            Arguments passed to the parallel_class_method_call function
        """
        
        for arg in args:
            parallel_call_argmuents = [self.__class__.__name__, self.__dict__, class_method_name, arg, None]
            yield parallel_call_argmuents
        
    def get_features_for_one(self, x):
        '''
            Computes KNN features for a single object `x`
        '''

        # NOTE: kneighbors(x) finds the K-neighbors of a point
        #       Returns distance and index of nearest point as
        #       (array([[first_closest_distance, second_closest_distance]]),
        #        array([[first_closest_index, second_closest_index]])) 
        NN_output = self.NN.kneighbors(x)
        
        # Vector of size `n_neighbors`
        # Stores indices of the neighbors
        # NOTE: These are the indices
        neighs = NN_output[1][0]
        
        # Vector of size `n_neighbors`
        # Stores distances to corresponding neighbors
        # NOTE: These are the distances
        neighs_dist = NN_output[0][0] 

        # Vector of size `n_neighbors`
        # Stores labels of corresponding neighbors
        # NOTE: Slicing by indices
        neighs_y = self.y_train[neighs] 
        
        ## ========================================== ##
        ##              YOUR CODE BELOW
        ## ========================================== ##
        
        # We will accumulate the computed features here
        # Eventually it will be a list of lists or np.arrays
        # and we will use np.hstack to concatenate those
        return_list = [] 
        
        
        ''' 
            1. Fraction of objects of every class.
               It is basically a KNNÐ¡lassifiers predictions.

               Take a look at `np.bincount` function, it can be very helpful
               Note that the values should sum up to one
        '''
        # NOTE: k_list is the list of the 'k' (i.e. number of neareast neighbors) in kNN
        for k in self.k_list:
            # YOUR CODE GOES HERE
            
            # NOTE: We are after finding the fraction of the objects in every class
            #       I.e. if neighs_y is ['cat', 'cat', 'dog'], and the possible classes are 
            #       'cat', 'dog', 'parrot', the fractions would be [0.66, 0.33, 0]
            
            # NOTE: We need to know the unique classes and be able to map between label and numbers
            #       From the EDA, we know that Y contains only integers
            #       Since np.bincount([3]) returns array([0, 0, 0, 1]), we can be sure that
            #       np.bincount(neighs_y) returns the count up until the highest class presentet in
            #       neighs_y
            
            # Initial feats, so that is has the same len as n_classes
            feats = np.zeros(self.n_classes)
            # Get the fraction bincount (up to the desired k), and store it in feats
            frac_bincount = np.bincount(neighs_y[:k])/len(neighs_y[:k])
            feats[:len(frac_bincount)] = frac_bincount
            
            assert len(feats) == self.n_classes
            return_list += [feats]
        
        
        '''
            2. Same label streak: the largest number N, 
               such that N nearest neighbors have the same label.
               
               What can help you: `np.where`
        '''
        
        # YOUR CODE GOES HERE
        
        # NOTE: I.e. how many consecutive labels are the same in neighs_y
        #       if neighs_y == [5,3,1], the label streak would be 1
        #       if neighs_y == [5,5,1], the label streak would be 2, as two 5 follow eachother
        #       if neighs_y == [3,5,5], the label streak would be 1 as the repeated labels are not consecutive
        
        streak = 1
        # NOTE: No need to loop over the first element, as that is what we are comparing against
        for label in neighs_y[1:]:
            if label == neighs_y[0]:
                streak += 1
            else:
                break
                
        feats = [streak]         
        
        assert len(feats) == 1
        return_list += [feats]
        
        '''
            3. Minimum distance to objects of each class
               Find the first instance of a class and take its distance as features.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.

               `np.where` might be helpful
        '''
        feats = []
        for c in range(self.n_classes):  
            # YOUR CODE GOES HERE
            
            # np.where returns a 1-D tuple in this case (as neighs_y is 1D)
            # Hence, we unwrap the tuple with the [0]
            findings = np.where(neighs_y == c)[0]
            
            # If np.where found anything, select the first instance in neighs_dist, else use 999 as distance
            if len(findings) > 0:
                first_ind = findings[0]
                feats.append(neighs_dist[first_ind])
            else:
                feats.append(999)
        
        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            4. Minimum *normalized* distance to objects of each class
               As 3. but we normalize (divide) the distances
               by the distance to the closest neighbor.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.
               
               Do not forget to add self.eps to denominator.
        '''
        feats = []
        for c in range(self.n_classes):
            # YOUR CODE GOES HERE
        
            findings = np.where(neighs_y == c)[0]
            
            # If np.where found anything, select the first instance in neighs_dist, else use 999 as distance
            if len(findings) > 0:
                first_ind = findings[0]
                feats.append(neighs_dist[first_ind]/(neighs_dist[0]+self.eps))
            else:
                feats.append(999)        
        
        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            5. 
               5.1 Distance to Kth neighbor
                   Think of this as of quantiles of a distribution
               5.2 Distance to Kth neighbor normalized by 
                   distance to the first neighbor
               
               feat_51, feat_52 are answers to 5.1. and 5.2.
               should be scalars
               
               Do not forget to add self.eps to denominator.
        '''
        for k in self.k_list:
            # YOUR CODE GOES HERE
            
            # NOTE: Use k-1 as k are counts starting on 1
            feat_51 = neighs_dist[k-1]
            feat_52 = neighs_dist[k-1]/(neighs_dist[0]+self.eps)
            
            return_list += [[feat_51, feat_52]]
        
        '''
            6. Mean distance to neighbors of each class for each K from `k_list` 
                   For each class select the neighbors of that class among K nearest neighbors 
                   and compute the average distance to those objects
                   
                   If there are no objects of a certain class among K neighbors, set mean distance to 999
                   
               You can use `np.bincount` with appropriate weights
               Don't forget, that if you divide by something, 
               You need to add `self.eps` to denominator.
        '''
        for k in self.k_list:
            
            # YOUR CODE GOES IN HERE

            feats = []
            for c in range(self.n_classes):
                findings = np.where(neighs_y[:k] == c)[0]
                if len(findings) > 0:
                    # NOTE: Due to epsilon, doing the following will give an non-zero deviation in the sanity check
                    # feats.append(np.mean(neighs_dist[findings]))
                    feats.append(np.sum(neighs_dist[findings])/(len(neighs_dist[findings]) + self.eps))
                else:
                    feats.append(999)
            
            assert len(feats) == self.n_classes
            return_list += [feats]    
        
        # merge
        knn_feats = np.hstack(return_list)
        
        assert knn_feats.shape == (239,) or knn_feats.shape == (239, 1)
        return knn_feats

## Sanity check

To make sure you've implemented everything correctly we provide you the correct features for the first 50 objects.

In [None]:
# a list of K in KNN, starts with one 
k_list = [3, 8, 32]

# Load correct features
first50_path = features_data_path.joinpath('knn_feats_test_first50.npy')
true_knn_feats_first50 = np.load(first50_path)

# Create instance of our KNN feature extractor
NNF = NearestNeighborsFeats(n_jobs=1, k_list=k_list, metric='minkowski')

# Fit on train set
NNF.fit(X, Y)

# Get features for test
test_knn_feats = NNF.predict(X_test[:50])

# !!!!!!!!!!NOTE!!!!!!!!!!
# The following lines has the [44:45] bug:
# This should be zero
# print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50[44:45]).sum())
# deviation =np.abs(test_knn_feats - true_knn_feats_first50[44:45]).sum(0)

# NOTE: We change them to this
# This should be zero
print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50).sum())

deviation = np.abs(test_knn_feats - true_knn_feats_first50).sum(0)

for m in np.where(deviation > 1e-3)[0]: 
    p = np.where(np.array([87, 88, 117, 146, 152, 239]) > m)[0][0]
    print ('There is a problem in feature %d, which is a part of section %d.' % (m, p + 1))

Now implement parallel computations and compute features for the train and test sets. 

## Get features for test

Now compute features for the whole test set.

In [None]:
data_path = Path('.').absolute().joinpath('data')
data_path.mkdir(exist_ok=True)

for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Create instance of our KNN feature extractor
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Fit on train set
    NNF.fit(X, Y)

    # Get features for test
    test_knn_feats = NNF.predict(X_test)
    
    # Dump the features to disk
    np.save(data_path.joinpath(f'knn_feats_{metric}_test.npy'), test_knn_feats)

## Get features for train

Compute features for train, using out-of-fold strategy.

In [None]:
# Differently from other homework we will not implement OOF predictions ourselves
# but use sklearn's `cross_val_predict`
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold

# We will use two metrics for KNN
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Set up splitting scheme, use StratifiedKFold
    # use skf_seed and n_splits defined above with shuffle=True
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=skf_seed)
    
    # Create instance of our KNN feature extractor
    # n_jobs can be larger than the number of cores
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Get KNN features using OOF use cross_val_predict with right parameters
    preds = cross_val_predict(NNF, X, y=Y, cv=skf)
    
    # Save the features
    np.save(data_path.joinpath(f'knn_feats_{metric}_train.npy'), preds)

# Submit

If you made the above cells work, just run the following cell to produce a number to submit.

In [None]:
s = 0
for metric in ['minkowski', 'cosine']:
    knn_feats_train = np.load(data_path.joinpath(f'knn_feats_{metric}_train.npy'))
    knn_feats_test = np.load(data_path.joinpath(f'knn_feats_{metric}_test.npy'))
    
    s += knn_feats_train.mean() + knn_feats_test.mean()
    
answer = np.floor(s)
print (answer)

Submit!

In [None]:
from honor_grader import Grader

grader = Grader()

grader.submit_tag('statistic', answer)

STUDENT_EMAIL = ''
STUDENT_TOKEN = ''
grader.status()

grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)