# Machine Learning in Python - Roll your Own Estimator Example

This notebook demonstrates how scikit-learn can be extended to include new models by implementing the **EducatedGuessClassifier**. 

The **EducatedGuessClassifier** is a **very naive** classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The EducatedGuessClassifier only works for categorical target features. 

The EducatedGuessClassifier is very simple:
* **Training:** Simply calculate the distribution across the target levels in the training dataset. And store these as a map.
* **Prediction:** When a new prediction needs to draw a random value from the distribution defined based on the training dataset. 

**NOTE THAT THE EDUCATEDGUESSCLASSIFIER IS A TERRIBLE MODEL AND IS ONLY USED AS A VERY SIMPLE DEMONSTRATION OF HOW TO IMPLEMENT AN ML ALGORITHM IN SCIKIT-LEARN**

## Import Packages Etc

In [4]:
from IPython.display import display, HTML, Image

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn import metrics


%matplotlib inline
#%qtconsole

## Define EducatedGuessClassifier

Define and test out the EducatedGuessClassifier class. To build a scikit-learn classifier we extend from the **BaseEstimator** and **ClassifierMixin** classes and implement the **init**, **fit**, **predict**, and **predict_proba** methods.

### Define the EducatedGuessClassifier Class

In [17]:
# Create a new classifier which is based on the sckit-learn BaseEstimator and ClassifierMixin classes
class EducatedGuessClassifier(BaseEstimator, ClassifierMixin):
    """The EducatedGuessClassifier is a very naive classification
    algorithm that calculates the distribution across classes in a
    training dataset and when asked to make a prediction returns a
    random class selected according to that distribution. The
    EducatedGuessClassifier only works for categorical target
    features. 
    - Training: Simply calculate the distribtion across the target
    levels in the trianing dataset. And store these as a map.
    - Prediction: When a new prediction is requested, draws a random value
    from the distribution defined based on the training dataset. 

    Parameters
    ----------
    add_noise string, optional (default = False)
    Whether or not a little bit of noise should be added to the
    distribution.

    Attributes
    ----------
    classes_ : array of shape = [n_classes] 
               The class labels (single output problem).
    distribution_: dict
               A dictionary of the probability of each class.

    Notes
    -----


    See also
    --------
    
    
    ----------

    Examples
    --------
    >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import cross_val_score
    >>> clf = EducatedGuessClassifier()
    >>> iris = load_iris()
    >>> cross_val_score(clf, iris.data, iris.target, cv=10)

    """
    
    # Constructor for the classifier object
    def __init__(self, add_noise = False):
        self.add_noise = add_noise

    # The fit function to train a classifier
    def fit(self, X, y):
        """Build a decision tree classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like or sparse matrix, shape = [n_samples, n_features]
            The training input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csc_matrix``.
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        Returns
        -------
        self : object
        """
            
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Count the number of occurrences of each class in the target vector (uses mupy unique function that returns a list of unique values and their counts)
        unique, counts = np.unique(y, return_counts=True)
        
        # Store the classes seen during fit
        self.classes_ = unique

        # Normalise the counts to sum to 1
        dist = counts/sum(counts)
            
        # If the add_noise attribute is true add a little noise to the distribution
        if(self.add_noise):
            for i in  range(len(dist)):
                dist[i] = dist[i] + dist[i]*random.uniform(-0.25, 0.25)
            # Renormalise the distribution
            dist = dist/sum(dist)
            
        # Create a new dictionary of classes and their normalised frequencies (the distribution)
        self.distribution_ = dict(zip(unique, dist))
        
        # Return the classifier
        return self

    # The predict function to make a set of predictions for a set of query instances
    def predict(self, X):
        """Predict class labels of the input samples X.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        Returns
        -------
        p : array of shape = [n_samples, ].
            The predicted class labels of the input samples. 
        """
        
        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an empty list to store the predictions made
        predictions = list()
        
        # Iterate through the query instances in the query dataset 
        for instance in X:
            
            #Generate a random class according to the learned distribution
            pred = random.choices(list(self.distribution_.keys()), list(self.distribution_.values()))
            
            predictions.append(pred[0])
            
        return np.array(predictions)
    
    
    # The predict function to make a set of predictions for a set of query instances
    def predict_proba(self, X):
        """Predict class probabilities of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        Returns
        -------
        p : array of shape = [n_samples, n_labels].
            The predicted class label probabilities of the input samples. 
        """

        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an array to store the prediction scores generated
        predictions = np.zeros((len(X), len(self.classes_)))

        # Iterate through the query instances in the query dataset 
        for idx, instance in enumerate(X):
            
            #Generate a random class according to the learned distribution
            pred = random.choices(list(self.distribution_.keys()), list(self.distribution_.values()))[0]

            # Always give the predicted class a probability of 0.9 and all other classes the remining probability mass  equally distributed.
            predictions[idx, ]= 0.1/(len(self.classes_) - 1)
            predictions[idx, list(self.classes_).index(pred)] = 0.9
            
        return predictions

### Test the EducatedGuessClassifier

Do a simple test of the EducatedGuessClassifier

In [18]:
a = np.array([[1,23,3,4], [5,6,7,8], [7,5,6,2], [4,9,12,43]])
y = np.array([1, 2, 2, 2])

In [19]:
my_model = EducatedGuessClassifier()

In [20]:
my_model.fit(a, y) 

EducatedGuessClassifier(add_noise=False)

In [21]:
my_model.distribution_

{1: 0.25, 2: 0.75}

In [22]:
q = np.array([[2,15,6,21], [8,9,7,6]])

In [23]:
my_model.predict(q)

array([2, 2])

In [24]:
my_model.predict_proba(q)

array([[0.1, 0.9],
       [0.1, 0.9]])

Fit a model to the iris dataset

In [25]:
from sklearn.datasets import load_iris
iris = load_iris()

clf = EducatedGuessClassifier()
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.3333333333333333, 1: 0.3333333333333333, 2: 0.3333333333333333}

Do simple Iris cross validation expeirment

In [40]:
clf = EducatedGuessClassifier()
cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.33333333, 0.06666667, 0.46666667, 0.2       , 0.33333333,
       0.26666667, 0.4       , 0.4       , 0.13333333, 0.2       ])

Fit a model to the iris dataset with noise added to the distribution

In [41]:
clf = EducatedGuessClassifier(add_noise = True)
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.4199488077842891, 1: 0.29607943035981144, 2: 0.28397176185589945}

In [42]:
from sklearn.datasets import load_iris
clf = EducatedGuessClassifier(add_noise = True)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.4       , 0.46666667, 0.26666667, 0.13333333, 0.4       ,
       0.33333333, 0.26666667, 0.33333333, 0.26666667, 0.33333333])

## Load & Partition Data

### Setup - IMPORTANT

Take only a sample of the dataset for fast testing

In [43]:
data_sampling_rate = 0.1

Setup the number of folds for all grid searches (should be 5 - 10)

In [44]:
cv_folds = 10

### Load & Partition Data

Load the dataset and explore it.

In [45]:
dataset = pd.read_csv('../lab2-fashion_mnist/fashionmnist/fashion-mnist_train.csv')
dataset = dataset.sample(frac=data_sampling_rate) #take a sample from the dataset so everyhting runs smoothly
num_classes = 10
classes = {0: "T-shirt/top", 1:"Trouser", 2: "Pullover", 3:"Dress", 4:"Coat", 5:"Sandal", 6:"Shirt", 7:"Sneaker", 8:"Bag", 9:"Ankle boot"}
display(dataset.head())

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
32516,1,0,0,0,0,0,0,0,0,0,...,88,0,0,0,0,0,0,0,0,0
32940,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50973,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7606,6,0,0,0,0,0,0,0,0,0,...,7,0,0,0,0,0,0,0,0,0
31788,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Isolate the descriptive features we are interested in

In [46]:
X = dataset[dataset.columns.difference(["label"])]
Y = np.array(dataset["label"])

In [47]:
X = X/255

In [48]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(X, Y, random_state=0, \
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, \
                                        random_state=0, \
                                        train_size = 0.5/0.7)



## Train and Evaluate a Simple Model

In [49]:
my_model = EducatedGuessClassifier(add_noise = True)
my_model.fit(X_train, y_train)

EducatedGuessClassifier(add_noise=True)

In [50]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train)

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy: 0.1
              precision    recall  f1-score   support

           0       0.08      0.08      0.08       293
           1       0.12      0.09      0.10       317
           2       0.10      0.08      0.09       296
           3       0.09      0.09      0.09       329
           4       0.10      0.11      0.11       296
           5       0.09      0.08      0.09       296
           6       0.12      0.14      0.13       260
           7       0.11      0.11      0.11       304
           8       0.09      0.12      0.10       315
           9       0.10      0.10      0.10       294

   micro avg       0.10      0.10      0.10      3000
   macro avg       0.10      0.10      0.10      3000
weighted avg       0.10      0.10      0.10      3000

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,22,26,29,39,31,23,31,23,46,23,293
1,24,28,20,38,39,30,27,36,43,32,317
2,30,17,25,33,36,24,29,33,36,33,296
3,21,21,32,31,46,28,42,34,47,27,329
4,35,20,26,29,33,35,36,26,31,25,296
5,21,23,21,38,30,25,41,27,39,31,296
6,27,20,21,30,20,27,37,27,37,14,260
7,35,24,22,32,24,26,29,34,44,34,304
8,31,32,25,39,33,32,27,32,37,27,315
9,30,13,32,36,37,31,22,32,33,28,294


In [51]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.10333333333333333
              precision    recall  f1-score   support

           0       0.08      0.08      0.08       168
           1       0.10      0.08      0.09       178
           2       0.08      0.07      0.08       148
           3       0.11      0.12      0.11       182
           4       0.12      0.15      0.13       182
           5       0.13      0.11      0.12       206
           6       0.08      0.09      0.08       176
           7       0.08      0.09      0.09       177
           8       0.14      0.14      0.14       207
           9       0.09      0.09      0.09       176

   micro avg       0.10      0.10      0.10      1800
   macro avg       0.10      0.10      0.10      1800
weighted avg       0.10      0.10      0.10      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,13,17,23,11,20,21,20,19,13,11,168
1,19,15,10,19,27,18,19,15,20,16,178
2,14,15,11,8,19,7,22,18,22,12,148
3,20,17,10,21,24,13,19,23,17,18,182
4,17,21,11,14,27,20,11,23,19,19,182
5,21,13,16,19,20,22,25,16,30,24,206
6,13,13,15,24,25,16,15,20,18,17,176
7,16,12,13,23,25,20,14,16,22,16,177
8,8,13,23,30,20,15,21,27,30,20,207
9,17,13,10,16,25,21,16,17,25,16,176


## Do a Cross Validation Experiment With Our Model

In [52]:
my_model = EducatedGuessClassifier()
scores = cross_val_score(my_model, X_train_plus_valid, y_train_plus_valid, cv=cv_folds, n_jobs=-1)
print(scores)

[0.08705882 0.12056738 0.12322275 0.08788599 0.09285714 0.09785203
 0.09785203 0.09330144 0.08633094 0.12019231]


## Do a Grid Search Through Distance Metrics

In [53]:
# Set up the parameter grid to seaerch
param_grid = [
 {'add_noise': [False, True]}
]

# Perform the search
my_tuned_model = GridSearchCV(EducatedGuessClassifier(), param_grid, cv=cv_folds, verbose = 2, n_jobs=-1)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)


Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of  20 | elapsed:    0.1s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    0.2s remaining:    0.0s


Best parameters set found on development set:
{'add_noise': False}
0.10071428571428571


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.3s finished


In [54]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.1
              precision    recall  f1-score   support

           0       0.06      0.07      0.07       168
           1       0.11      0.10      0.11       178
           2       0.07      0.09      0.08       148
           3       0.14      0.15      0.15       182
           4       0.08      0.08      0.08       182
           5       0.13      0.12      0.12       206
           6       0.11      0.09      0.10       176
           7       0.05      0.05      0.05       177
           8       0.13      0.12      0.13       207
           9       0.11      0.11      0.11       176

   micro avg       0.10      0.10      0.10      1800
   macro avg       0.10      0.10      0.10      1800
weighted avg       0.10      0.10      0.10      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,12,20,16,18,22,12,16,18,16,18,168
1,19,18,26,15,21,21,9,12,19,18,178
2,14,20,13,15,9,19,16,20,13,9,148
3,22,6,17,27,24,19,16,18,17,16,182
4,23,16,23,15,15,19,14,18,17,22,182
5,21,18,20,20,24,25,17,23,25,13,206
6,13,13,22,19,25,15,16,14,18,21,176
7,19,12,17,19,29,16,17,9,21,18,177
8,20,21,18,19,11,29,17,23,25,24,207
9,22,20,10,22,13,20,11,21,17,20,176


Demo the predict_proba function.

In [1]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict_proba(X_test)
_ = pd.DataFrame(y_pred).hist(figsize = (10,10))

NameError: name 'my_tuned_model' is not defined