## KNN: A good place to start for supervised classification
#### Lew Sears

I will write a knn classification algorithm from scratch and then check the results compared to the *Scikit Learn* implementation.

In [1]:
#Mine/yours/everyone's favorite libraries
import pandas as pd
import numpy as np

Before we go any further, just remind yourself how to properly write a class. It's quite simple but all new programmers have trouble getting started. Our algorithm will be written as a class so we can store information and streamline a work flow like every package we know in SKlearn.

In [2]:
#Just recall some class things:
class lews_simple_class:
    '''This is a nonsense class'''
    
    def __init__(self, num):
        self.num = num
        
    def multiply(self, a):
        return self.num * a

In [3]:
practice_class = lews_simple_class(5)
practice_class.multiply(2)

10

-----

### Distance Function

Perhaps the workhorse function in a knn classification algorithm is a streamlined distance function. So we begin by writing a function that takes an *n*-dimensional point and a data frame with *n* features and calculates the euclidean distance for every row. When we're working with a lot of data points, we want to make every function as streamlined and fast as possible. You're thinking what I'm thinking: we need to make sure we take advantage of NumPy's CPython implementation. If you don't care about the details just remember that NumPy is FAST.  

In [5]:
def numpy_distance(point,df):
    '''Given a point and a pandas dataframe, numpy_distance computes the euclidean distance '''
    try: 
        return np.sqrt(np.sum((np.array(point) - np.array(df))**2, axis = 1))
    except:
        pass
    
    #It helps to add some error messages in case something goes wrong
    try:
        if len(point) != len(df.columns):
            return "Error: The dimensions of your point and DataFrame don't match!"
    except:
        pass

    return "User Error: Please review input critera."

Lets try a quick example with a simple DataFrame and point. 

In [6]:
X = pd.DataFrame({'col1': [1,2,3], 'col2': [1,2,3]})
point = [0,0]
numpy_distance(point, X)

array([1.41421356, 2.82842712, 4.24264069])

Looks good! You'll notice we have some "try" and "except" statements. These are actually straightforward; if the function doesn't work you should check the reasons why and give whoever is using the function a heads up about what they can change. This is a good habit and definitely best practice on any team sharing code. Let's see what happens if two inputs dimensions don't match or if you put in some nonsense:  

In [7]:
X = pd.DataFrame({'col1': [1,2,3], 'col2': [1,2,3]})
point = [0,0,0]
numpy_distance(point, X)

"Error: The dimensions of your point and DataFrame don't match!"

In [8]:
numpy_distance([0,'apple'], 'cat')

'User Error: Please review input critera.'

----

#### The KNN Algorithm

In [71]:
class KNNClassifier:
    
    #initialize the hyperparameter k
    def __init__(self, k):
        try: 
            if type(k) != int:
                return print("k-Value Error:\n-------------\n k must be a nonzero positive integer")
        except:
            pass
        try: 
            if K < 1:
                return print("k-Value Error:\n-------------\n k must be a nonzero positive integer")
        except:
            pass
        self.k = k
        
    #Fit the training data.
    #You should recall that KNN doesn't actually calculate anything to fit. It just creates a copy of the data.
    def fit(self, X_train, y_train):
        '''Makes a copy of training data and the target to train knn'''
        if len(X_train) != len(y_train):
            return print("Dimensionality Error: \n---------------------\n Training data and training target dimensions don't match.")
        
        #Filter out non numeric rows that may occur in the training data
        #Careful the output may not be the same size if you have messy data
        X_train_filtered = X_train[X_train.applymap(np.isreal).all(1)]
        y_train_filtered = [val for i, val in enumerate(list(y_train)) if X_train.applymap(np.isreal).all(1)[i]]
        self.train_data = X_train_filtered
        self.train_target = y_train_filtered
    
    #This is the function we trained earlier
    def numpy_distance(point,df):
        '''Given a point and a pandas dataframe, numpy_distance computes the euclidean distance '''
        
        #Just some cleaning:
        df_clean = df[df.id.apply(lambda x: x.isnumeric())]
        
        try: 
            return np.sqrt(np.sum((np.array(point) - np.array(df))**2, axis = 1))
        except:
            pass

        #It helps to add some error messages in case something goes wrong
        try:
            if len(point) != len(df.columns):
                return "Error: The dimensions of your point and DataFrame don't match!"
        except:
            pass

        return "User Error: Please review input critera."
    
    def predict_fast(self, x_test):
        '''Classify unseen data using the k-nearest points in the train data'''
        
        # First, Make a list of distances:
        distances = numpy_distance(x_test, self.train_data)
        distances_index = distances.argsort()
        
        
        #Now pick the k-closest points:
        k_nearest = [val for i, val in enumerate(list(self.train_target)) if i in distances_index[:self.k]]
        
        #Count the unique values
        counts = np.unique(k_nearest, return_counts=True)
        
        #Find all of the max value classes:
        max_values = counts[0][np.where(counts[1] == max(counts[1]))[0]]
        return np.random.choice(max_values,1)[0]
    
    
    #After fitting the model, we make predictions on unseen test data
    def predict_tie_break(self, x_test):
        '''Classify unseen data using the k-nearest points in the train data'''
        
        # First, Make a list of distances:
        distances = numpy_distance(x_test, self.train_data)
        distances_index = distances.argsort()
        
        
        #Now pick the k-closest points:
        k_nearest = [val for i, val in enumerate(list(self.train_target)) if i in distances_index[:self.k]]
        
        #Count the unique values
        counts = np.unique(k_nearest, return_counts=True)
        
        #Find all of the max value classes:
        max_values = counts[0][np.where(counts[1] == max(counts[1]))[0]]
        
        if len(max_values) == 1:
            return max_values[0]
        
        #What if we have a tie situation?
        #For this situation, we will iteratively remove a neighbor from consideration until there is a unique max
        new_k = self.k - 1
        while new_k > 0:
            #This is all the same code:
            k_nearest = [val for i, val in enumerate(list(self.train_target)) if i in distances_index[:self.k]]
            counts = np.unique(k_nearest, return_counts=True)
            max_values = counts[0][np.where(counts[1] == max(counts[1]))[0]]
            if len(max_values) == 1:
                return max_values[0] 
    
    #A different tie-breaker
    def predict_imbalanced(self, x_test):
        '''If you are working with imbalanced data and want to give priority to minority class,
        this prediction function always gives any ties to the minority class.'''
        
        # First, Make a list of distances:
        distances = numpy_distance(x_test, self.train_data)
        distances_index = distances.argsort()
        
        
        #Now pick the k-closest points:
        k_nearest = [val for i, val in enumerate(list(self.train_target)) if i in distances_index[:self.k]]
        
        #Count the unique values
        counts = np.unique(k_nearest, return_counts=True)
        
        #Find all of the max value classes:
        max_values = counts[0][np.where(counts[1] == max(counts[1]))[0]]
        
        if len(max_values) == 1:
            return max_values[0]
        
        #If we have a tie situation, just pick the smallest class
        return max_values[np.array([self.train_target.count(x) for x in max_values]).argmin()]   

We created 3 functions, all generally the same except for the tie breaker. The choice of how to break ties can become less important when the training data gets massive. It could be argued that for the sake of speed, one could just leave that part out, in which case *predict_fast* is a good choice. A little more robust, *predict_tie_break* runs a loop removing the farthest of the *k* points until it has a unique nearest neighbor.

An interesting idea to consider is the case with large class imbalance. It could make sense to just give to tie to the smaller class to balance since we like to skew our classification model in their favor. That's what my *predict_imbalanced* does. With other classification algorithms, there are more nuanced ways to accomplish this but knn is simple and thats why everybody likes it.  

Not my quote, but from some PHd guy who has published some papers on this matter:

*Developments in learning from imbalanced data have been mainly motivated by numerous real-life applications in which we face the problem of uneven data representation. In such cases the minority class is usually the more important one and hence we require methods to improve its recognition rates. <br />  
-Bartosz Krawczyk*

------

#### Some Examples: