Well the logistic regression parameter estimation was a doozy. Let's try something more simple for a change. Let's try an algorithm that is not so much grounded in theory but is very flexible, universal (can be used for value and class prediction) and fast to compute: K-nearest neaighbours aka KNN.

The idea of KNN is that similar objects are close to each other in the feature space. By using the distance metric, we can calculate the distances between any two objects and determine which objects are closest to a given object. In the classification task, the distance metric used is typically Euclidean distance, although other metrics such as Manhattan distance or cosine similarity can also be used.

The algorithm steps go like this:

1. Take the point we want to predict ($X_p$, $y_p$) and calculate its distance  to all other points $ d(X_p;X_i) \forall i$
2. Sort the distances in from smallest to largest, and use $K$ points to calculate $y_p$, we can use various metrics, but mean/median for continous variables, and mode for categorical are generally used
3. Assess the fit of $y_p$ the algorithm success mostly relies only one parameter $K$. We can change the change $K$ to arrive at the best fit. 

Of course there can be more improvements to the model such as changing the distance metric, scaling the input values $X_i$ and dealing with outliers. But let's just to the basics for now.

Let's start with euclidean distance function to calculate  
$\Sigma_i^n (X_{n1}-X_{n2})^2$

In [1]:
import numpy as np
#calculate distance function
def calc_euclidean_distance(p1, p2):
    return(np.sqrt(np.sum((p1-p2)**2)))

Then add just look for the smallest $K$ distances

In [2]:
def predict(X_train, y_train, x_pred, k):
    y_preds = np.array([])
    for j in range(0, x_new.shape[0]):
        distances = np.array([])
        for i in range(0, X_train.shape[0]):
            distances=np.append(distances, calc_euclidean_distance(x_new[j], X_train.iloc[i].values))
            idx = np.argsort(distances)[:k] 
            y_pred = mode(y_train[idx])[0]
        y_preds=np.append(y_preds,y_pred)
        
    return(y_preds)

Let's do a small example for the famous Iris dataset

In [3]:
import pandas as pd
from sklearn import datasets
from scipy.stats import mode

iris = datasets.load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
X = df.drop('target', axis=1)
y = df.target

#print(df.tail())

#now the create some value we'll want to predixt
x_new1=[5, 3, 1.9, 0.5]
x_new2=[ 6.0, 2.8, 5.1, 2.1   ]
x_new = np.vstack([x_new1, x_new2])

In [4]:
predict(X, y, x_new1, k=5)

array([0., 2.])

And we're set. Since the algorithm was not very hard let's put in a class to appear more classy. We'll just initialize the KNN with some $K$ and add functions we already defined, but adjust `predict` to work with pandas arrays.

In [5]:
class knn_model:
    def __init__(self, K):
        self.K = K
    def calc_euclidean_distance(p1, p2):
        return(np.sqrt(np.sum((p1-p2)**2)))    
    def predict(self, X_train, y_train, x_pred):
        y_preds = np.array([])
        for j, rj in x_pred.iterrows():
            distances = np.array([])
            for i,r in X_train.iterrows():
                distances=np.append(distances, calc_euclidean_distance(rj, r))
                idx = np.argsort(distances)[:self.K] 
                y_pred = mode(y_train[idx])[0]
            y_preds=np.append(y_preds, y_pred)

        return(y_preds)

The model predict out imaginary iris plant to be in the 0th class so the funcitons seem to work, but let's see how it would perform in a realistic setting: let's do a simple train/test split and train see how the model performs for various $K$

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=369)

In [7]:
#let's see if our sample is biased, we're in luck becase it is very nicely balances.
#we can use simple accuracy as measure of success
y_train.value_counts()

2    44
0    38
1    38
Name: target, dtype: int64

In [119]:
accuracies=list()
for k in range(2, 10):
    model2 = knn_model(K=k)
    result = model2.predict(X_train, y_train.reset_index(drop=True), X_test)
    acc=np.mean(y_test == result)
    accuracies.append(acc)

In [120]:
accuracies

[0.9333333333333333,
 0.9333333333333333,
 0.9333333333333333,
 0.9666666666666667,
 0.9666666666666667,
 0.9666666666666667,
 0.9666666666666667,
 0.9666666666666667]

Seems to be very accurate, either way, probably the problem is not that hard