# Experiments on Version 2
## of Adaptive Dropout Probabilities via Tsetlin Machine-inspired Majority Voting

In this notebook I introduce the new version, and continue experiments on "tsetlin" dropout.

For background and history, check out the notebook titled "Initial Experiments". 

Version 2! This version addresses some of the problems that the first draft had, namely speed and variance. As a refresher, the initial version assigned each unit/input an individual dropout probability that was updated after each gradient update. The probability of being included in the network would decrease as the network's accuracy increased, and vice-versa. Specifically, for each classification in a batch, the probabilities of the units that were used in the classification were updated by the following formulae:

$$
{p}_{t+1} = p_{t} - \alpha (p_{t}-1)*exp(-p_{t})  
$$

For incorrect classification, and:

$$
{p}_{t+1} = p_{t} - \alpha (p_{t})*exp(p_{t}-1) 
$$

for correct classification. This remains true for version 2, however instead of updating the probabilites for every classification in the batch, version 2 counts the number of correct vs incorrect classifications, and updates the probabilities once in the direction of the majority. This reduces computation and variance while keeping the spirit of the original version. The key philosophy is to adapt the dropout probabilities w.r.t network performance, and this version still does that! If I really wanted to lean into the Tsetlin Machine inspiriation, I could add hyperparameter that changes the threshold from simple majority to some other ratio, but for now, let's look at this version. 

In code, it looks like this: 
Note the addition of the clipping (tor

In [1]:
import torch
import numpy as np
import pandas as pd
from torch import nn
import matplotlib.pyplot as plt


class TsetlinUnitDropout(nn.Module):  # Drops units rather than weights
    def __init__(self, in_size, init_prob, step_discount, clip=False, clip_min=None, clip_max=None):
        super().__init__()
        assert 0 <= init_prob < 1
        self.discount = step_discount
        self.pmin = clip_min
        self.pmax = clip_max
        self.clip = clip
        self.probabilities = torch.nn.Parameter(torch.tensor(np.full((1, in_size), init_prob, dtype="float32")),
                                                requires_grad=False)  # Should these be no grad parameters?
        self.not_dropped = []  # this one too? Also is there something better than empty list?

    def forward(self, inp):
        assert inp.shape[1] == self.probabilities.shape[1]
        dev = self.probabilities.device
        comparator = torch.rand(inp.shape[1]).to(dev)
        mask = (comparator < self.probabilities).float().to(dev) # compares rand nums to probability of being
        # included
        # Assign not_dropped indices of prob tensor whose unit WAS included in network (These will be updated at step)
        self.not_dropped = mask.nonzero(as_tuple=True)  # To be indexed with
        return mask * inp / self.probabilities  # inverted dropout scaling

    def tsetlin_update(self, correct_list):  # This implementation penalizes (increases dropout prob) if correct
            batch_size = len(correct_list)
            corr_distance = 2*correct_list.sum() - batch_size
            if corr_distance > 0: # decrease chance of inclusion (increase dropout prob) if correct
              self.probabilities[self.not_dropped] -=  (self.probabilities[self.not_dropped] * self.discount / torch.exp(
                    1 - self.probabilities[self.not_dropped]))

            elif corr_distance < 0:  # this number is negative (-= - = +) ie. increase chance of inclusion (decrease dropout prob)
                self.probabilities[self.not_dropped] -=  (self.discount * (self.probabilities[self.not_dropped] - 1) / torch.exp(
                    self.probabilities[self.not_dropped]))

            if self.clip:
              torch.clamp_(self.probabilities, self.pmin, self.pmax)