 As this is a case study individual assignment, I agree and acknowledge that all code modified in this notebook is my own. I have not and will not collaborate with anyone on this assignment. If I have questions, I will ask the instructor or TAs.

In [2]:
# Please provide your name and agreement to the above statement as variables `name` (string) and `agree` (boolean)
# YOUR CODE HERE
name = "Lewis Blake"
agree = True

In [3]:
conditions = [isinstance(name, str), isinstance(agree, bool), agree]
for condition in conditions:
    if not condition:
        raise ValueError("Student has not agreed to work on this assignment alone and without collaboration or has not provided their name")

# Naive Classification Approaches


In this case study, we will build a custom classifier that takes some naive classification approaches. The approaches we'll take are:

- Guessing one class at all times
- Guessing the most common class at all times
- Guessing randomly based on the distribution of the classes
- Guessing randomly based on an equal chance of the classes


We're going to build a class, `NaiveClassifier`, that can fit and predict based on the above approaches. We will then try it out on a few datasets and see what results we get. This should help you understand the minimal performance you should expect out of your machine learning models.

The way `NaiveClassifier` should work is that we instantiate it with an `approach` and an optional `value` depending on the method.

Examples:

- always predict class 1 would be: `clf = NaiveClassifier(approach="always", value=1)`
- always predict most common class would be: `clf = NaiveClassifier(approach="most")`
- predict based on class distribution: `clf = NaiveClassifier(approach="distribution")`
- predict based on equal class distribution: `clf = NaiveClassifier(approach="equal")`

Note that as a naive classifier, this classifier is not using any of the features about the data. It is only using the patterns that it detects in the labels.

In addition, recall that the way a supervised learning classifier works is that it is:

- Initialized with its hyperparameters
- When running `.fit()`, it learns its model parameters such as the weight coefficients in linear regression.
- When running `.predict`, it applies the model parameters that it learned to the new data such as plugging in values of $x$ into the equation $y=mx+b$

In [4]:
import numpy as np
import scipy
from scipy import stats
import sklearn
from sklearn.metrics import accuracy_score, classification_report
import random

In [6]:
def select_most_common(labels):
    """Select the most common value in an iterable of labels
    
    Args:
        labels (iterable): An iterable of integers representing the labels of a dataset
    
    Returns:
        int: The most common element in the iterable
    """
    return max(set(labels), key = labels.count)

In [7]:
assert select_most_common([1,2,2,3,4,5]) == 2
assert select_most_common([1,1,1,1,1,1,2,2,2]) == 1

In this next portion, we will generate data following a distribution. Let's go over a simple example of how to do that. If we want to simulate a fair coin flip, we know that 50% of the time we should get heads, and 50% of the time we should get tails. We can create a simulated fair coin programmatically by generating a *uniform* random number between 0 and 1, and deciding that if it is less than 0.5, we get heads, and if it is greater than or equal to 0.5 we get tails. We can programatically represent the distribution of heads and tails as `[0.5, 0.5]` where heads is class 0 and tails is class 1. In the next function, you will write a function to generate data / predict from a distribution where the distribution is represented that way and can be of any length.

In [8]:
def predict_from_distribution(distribution):
    """Return one possible value given a distribution
    
    Args:
        distribution (list): A list of probabilities of each class index
        
    Returns:
        int: The index of the predicted class.
    """
    x = random.random()
    for i in range(len(distribution)):
        #x = random.random()
        if x < distribution[0]:
            return 0
        elif x < sum(distribution[0:i+1]):
            return i
        

In [9]:
# Example predictions
# You should see 10 results with about 5 0s, 1 1, and 4 2's.
# You can print val in order to see if it's being calculated correctly
[predict_from_distribution([0.5, 0.1, 0.4]) for i in range(10)]

[1, 2, 1, 0, 2, 0, 1, 2, 0, 2]

In [198]:
class NaiveClassifier:
    """A Naive Classifier that predicts classes using simple approaches.
    """
    
    def __init__(self, approach, value=None):
        """Initialize the NaiveClassifier
        
        Args:
            approach (str): One of "always", "most", "distribution", "equal"
            value (int, optional): Defaults to None. The value of the class to select if approach is "always"
        """
        assert approach in ["always", "most", "distribution", "equal"]
        self.approach = approach
        self.value = value

    def fit(self,X,y):
        """Fit to data and labels
        
        Args:
            X (iterable): The features of the data
            y (iterable): The labels of the data
        """
        if self.approach == "always":
            # If the user does not supply a inital value, set it to 0
            if self.value == None:
                self.value = 0
        elif self.approach == "most":
            # use the select_most_common() function previous written and set to most common.
            most_common = [select_most_common(y)]
            self.most_common = most_common
        elif self.approach == "distribution":
            dist = [y.count(i)/len(y) for i in range(len(set(y)))]
            #dist_fit = [predict_from_distribution(dist) for i in range(len(X))]
            self.dist = dist
        elif self.approach == "equal":
            #num_classes = len(set(y))
            #dist = [1/num_classes]*len(X)
            #equal_fit = [predict_from_distribution(dist) for i in range(len(X))]
            #self.dist = dist
            self.unique_labels = np.unique(y)
    def predict(self,X):
        """Predict the labels of a new set of datapoints
        
        Args:
            X (iterable): The data to predict
        """
        if self.approach == "always":
            # "always" returns self.value
            value = self.value
            pred = [value]*len(X)
            return pred
        elif self.approach == "most":
            most_common = self.most_common
            pred = most_common*len(X)
            return pred
        elif self.approach == "distribution":
            dist = self.dist
            pred = [predict_from_distribution(dist) for i in range(len(X))]
            return pred
        elif self.approach == "equal":
            #dist = self.dist
            #pred = [predict_from_distribution(dist) for i in range(len(X))]
            unique_labels = self.unique_labels
            pred = [random.choice(unique_labels) for i in range(len(X))]
            return pred

Let's create a few datasets that we'll use to analyze how a predictor would work with each of those approaches. Here are all the datasets we'll create:

- 2 classes equally distributed
- 2 classes with 0 at 90% and 1 at 10%
- 3 classes equally distributed
- 3 classes with 0 at 90%, 1 at 9% and 2 at 1%

In [184]:
# We will create the labels for each of the listed datasets with length n
# Create the listed datasets as binary_equal, binary_unequal, trinary_equal and trinary_unequal
n = 15000
features = np.zeros((n,3))

binary_equal = np.concatenate( ([0]*7500, [1]*7500), axis = None)
binary_unequal = np.concatenate( ([0]*13500, [1]*1500), axis = None)
trinary_equal = np.concatenate( ([0]*5000, [1]*5000, [2]*5000), axis = None)
trinary_unequal = np.concatenate( ([0]*13500, [1]*1350, [2]*150), axis = None)

np.random.shuffle(binary_equal)
np.random.shuffle(binary_unequal)
np.random.shuffle(trinary_equal)
np.random.shuffle(trinary_unequal)

numpy.ndarray

In [68]:
assert np.all(np.bincount(binary_equal) == np.array([7500,7500]))
assert np.all(np.bincount(binary_unequal) == np.array([13500,1500]))
assert np.all(np.bincount(trinary_equal) == np.array([5000,5000,5000]))
assert np.all(np.bincount(trinary_unequal) == np.array([13500,1350,150]))

In [69]:
datasets = [{
    "name": "Binary Classification Equally Distributed",
    "labels": binary_equal
},{
    "name": "Binary Classification 90:10",
    "labels": binary_unequal
},{
    "name": "3-Class Classification Equally Distributed",
    "labels": trinary_equal
},{
    "name": "3-Class Classification 90:9:1",
    "labels": trinary_unequal
}]

# Testing

Let's now test out our Naive Classifiers on the above datasets. We will be training and testing on the full dataset. Since the model is actually not a machine learning algorithm and this is just for educational purposes, it will not be an issue. We are just using this approach to learn what the naive model would have predicted even on the data it trained on.

In [202]:
# Create three classifers that predict always 0, 1, and 2
# Name them always_zero, always_one and always_two respectively

always_zero = NaiveClassifier(approach="always", value = 0)
always_one = NaiveClassifier(approach="always", value = 1)
always_two = NaiveClassifier(approach="always", value = 2)

In [203]:
assert always_zero.approach=="always"
assert always_zero.value == 0
assert always_one.approach=="always"
assert always_one.value == 1
assert always_two.approach=="always"
assert always_two.value == 2

In [204]:
# Create a classifer that predicts the most frequent class
# Name it most_est
most_est = NaiveClassifier(approach="most")

In [205]:
assert most_est.approach=="most"
most_est.fit([0,0,0,0,0], [0,1,1,1,0])
assert most_est.predict([0,0,0]) == [1, 1, 1]

In [206]:
# Create a classifer that predicts based on the distribution of the classes
# Name it dist_est
dist_est = NaiveClassifier(approach="distribution")

In [207]:
assert dist_est.approach == "distribution"
dist_est.fit([0,0,0,0,0], [0,0,1,1,1])
random.seed(0)
assert sum(dist_est.predict([0,0,0,0,0])) == 4

In [208]:
# Create a classifer that predicts equally any of the classes
# Name it equal_est
equal_est = NaiveClassifier(approach="equal")

In [209]:
assert equal_est.approach == "equal"
equal_est.fit([0,0,0,0,0], [0,1,1,1,1])
random.seed(0)
assert sum(equal_est.predict([0,0,0,0])) == 3

In [210]:
estimators = [
    {
        "name": "Always Zero",
        "estimator": always_zero
    },
    {
        "name": "Always One",
        "estimator": always_one
    },
    {
        "name": "Always Two",
        "estimator": always_two
    },
    {
        "name": "Most Common",
        "estimator": most_est
    },
    {
        "name": "Distribution Based",
        "estimator": dist_est
    },
    {
        "name": "Equally",
        "estimator": equal_est
    }
]

In [213]:
# For each dataset, apply each estimator and save the predictions as pred
for dataset in datasets:
    name = dataset["name"]
    labels = dataset["labels"]
    print("="*20)
    print(f"{name}")
    print("="*20)
    for est in estimators:
        estimator_name = est["name"]
        print("-"*20)
        print(f"Estimating with {estimator_name}")
        print("-"*20)
        # YOUR CODE HERE
        estimator = est["estimator"]
        estimator.fit(list(features), list(labels) )
        pred = estimator.predict(labels)
        print(f"Produced an accuracy score of {accuracy_score(labels, pred)} and the following report")
        print(classification_report(labels, pred))

Binary Classification Equally Distributed
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.5 and the following report
             precision    recall  f1-score   support

          0       0.50      1.00      0.67      7500
          1       0.00      0.00      0.00      7500

avg / total       0.25      0.50      0.33     15000

--------------------
Estimating with Always One
--------------------
Produced an accuracy score of 0.5 and the following report
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      7500
          1       0.50      1.00      0.67      7500

avg / total       0.25      0.50      0.33     15000

--------------------
Estimating with Always Two
--------------------
Produced an accuracy score of 0.0 and the following report
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      7500
          1       0.00      0.00    

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Produced an accuracy score of 0.5056 and the following report
             precision    recall  f1-score   support

          0       0.51      0.50      0.50      7500
          1       0.51      0.51      0.51      7500

avg / total       0.51      0.51      0.51     15000

Binary Classification 90:10
--------------------
Estimating with Always Zero
--------------------
Produced an accuracy score of 0.9 and the following report
             precision    recall  f1-score   support

          0       0.90      1.00      0.95     13500
          1       0.00      0.00      0.00      1500

avg / total       0.81      0.90      0.85     15000

--------------------
Estimating with Always One
--------------------
Produced an accuracy score of 0.1 and the following report
             precision    recall  f1-score   support

          0       0.00      0.00      0.00     13500
          1       0.10      1.00      0.18      1500

avg / total       0.01      0.10      0.02     15000

--------

In [217]:
# Please describe your conclusions based on the above results
# You must write at least 300 characters
# This portion is worth 100 points (20% of CS)
# Save your answer to conclusions
# YOUR CODE HERE
conclusions = "Just because a given metric is high on a test data set does not mean that the model is actually a good representation of the data. Testing a the model on a test data set is very important. If, for example, the F1-score is very different between the training and the test data, this is an indication that the model is suboptimal. Moreover, it is important to understand the spread of the data (i.e., how many labels we have for each class) to assess model validity."
print(conclusions)

Just because a given metric is high on a test data set does not mean that the model is actually a good representation of the data. Testing a the model on a test data set is very important. If, for example, the F1-score is very different between the training and the test data, this is an indication that the model is suboptimal. Moreover, it is important to understand the spread of the data (i.e., how many labels we have for each class) to assess model validity.


In [218]:
assert len(conclusions) > 300

## Feedback

In [220]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return "This Case Study could have been a lot easier if there was more initial direction about the expectations. For instance, necessitating that we use one package over the other or create our own functions for built-in ones was not initally clearly specified and resulted in a lot of time being wasted. The overall goal of this Case Study however is important and this was a good exercise in measuring baseline performance of machine learning models."