# Model Uncertainty Estimation

Hasktorch 0.2.0.0

Wouldn't it be nice if the model would also tell us which predictions are not reliable? Can this be done even on unseen data? The good news is yes, and even on new, completely unseen data.
And it is also simple to implement in practice.
A canonical example is in a medical setting. By measuring model uncertainty,
the doctor can learn how reliable is their AI-assisted patient's diagnosis.
This allows the doctor to make a better informed decision whether to trust
the model or not. And potentially save someone's life.

Today we build upon [Day 7](https://penkovsky.com/neural-networks/day7) and we continue our journey with Hasktorch:

1. We will introduce a Dropout layer.
1. We will compute on a graphics processing unit (GPU).
1. We will also show how to load and save models.
1. We will train with [Adam](https://penkovsky.com/neural-networks/day2) optimizer.
1. And finally we will talk about model uncertainty estimation.

## Dropout Layer

Neural networks, as any other model with many parameters, tend to overfit. By overfitting I mean
"[fail to fit to additional data or predict future observations reliably](https://en.wikipedia.org/wiki/Overfitting)". Let us consider a classical example below.

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/480px-Overfitting.svg.png" width="300" />
</center>
<!-- ![](https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg) -->

The green line is a decision boundary between created by an overfitted model.
We see that the model tries to memorize every possible data point.
However, it fails to generalize. To ameliorate the situation, we perform
so-called *regularization*. That is a technique that helps to prevent overfitting.
In the image above, the black line is a decision boundary of a regularized model.

One of regularization techniques for artificial neural networks is called
[dropout](https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
or [dilution](https://en.wikipedia.org/wiki/Dilution_(neural_networks)).
Its principle of operation is quite simple.
During neural network training, we randomly
disconnect a fraction of neurons with some probability.
It turns out that dropout conditioning results in more reliable
neural network models.

## A Neural Network with Dropout

The data structures `MLP` and `MLPSpec` remain unchanged.

In [None]:
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-} 

import Control.Monad ( forM_, forM, when, (<=<) )
import Control.Monad.Cont ( ContT (..) )
import GHC.Generics
import Pipes hiding ( (~>) )
import qualified Pipes.Prelude as P
import Text.Printf ( printf
                   , PrintfArg )
import Torch
import Torch.Serialize
import Torch.Typed.Vision ( initMnist, MnistData )
import qualified Torch.Vision as V
import Torch.Lens ( HasTypes (..)
                  , over 
                  , types )
import Prelude hiding ( exp )

data MLP = MLP
  { fc1 :: Linear,
    fc2 :: Linear,
    fc3 :: Linear
  }
  deriving (Generic, Show, Parameterized)

data MLPSpec = MLPSpec
  { i :: Int,
    h1 :: Int,
    h2 :: Int,
    o :: Int
  }
  deriving (Show, Eq)
  
instance Randomizable MLPSpec MLP where
  sample MLPSpec {..} =
    MLP
      <$> sample (LinearSpec i h1)
      <*> sample (LinearSpec h1 h2)
      <*> sample (LinearSpec h2 o)

(~>) :: (a -> b) -> (b -> c) -> a -> c
f ~> g = g. f

However, we will need to modify the `mlp` network to include
a Dropout layer. If we inspect
`dropout :: Double -> Bool -> Tensor -> IO Tensor`
type, we see that it accepts three arguments:
a `Double` probability of dropout,
a `Bool` that turns this layer on or off,
and a data `Tensor`.
Typically, we turn the dropout on during the training
and off during the inference stage.

However, the biggest distinction between let's say `relu`
function and `dropout` is that `relu` a *pure* function,
i.e. it does not have any 'side-effects'.
This means that every time when we call a pure function,
the result will be the same.
This is not the case with `dropout` that relies on an
(external) random number generator, and therefore returns
a new result each time.
Therefore, its outcome is an `IO Tensor`.

One has to pay a particular attention to those `IO` functions, 
because they can change the state in the external world.
This can be printing text on the screen,
deleting a file, or launching missiles.
Typically, we prefer to keep functions pure whenever possible,
as function purity improves the reasoning
about the program: It is a child's play to
refactor (reorganize) a program consisting only
of pure functions.

I find the so-called *do-notation* to be the most natural
way to combine both pure functions and those with side-effects.
The pure equations can be grouped under `let` keyword(s),
while the side-effects are summoned with a special `<-` notation.
This is how we integrate `dropout` in `mlp`.
Note that now the outcome of `mlp` also becomes an `IO Tensor`.

In [2]:
mlp :: MLP -> Bool -> Tensor -> IO Tensor
mlp MLP {..} isStochastic x0 = do
  -- This subnetwork encapsulates the composition
  -- of pure functions
  let sub1 =
          linear fc1
          ~> relu

          ~> linear fc2
          ~> relu

  -- The dropout is applied to the output
  -- of the subnetwork
  x1 <- dropout
          0.1   -- Dropout probability
          isStochastic  -- Activate Dropout when in stochastic mode
          (sub1 x0)  -- Apply dropout to
                     -- the output of `relu` in layer 2
              
  -- Another linear layer
  let x2 = linear fc3 x1
  
  -- Finally, logSoftmax, which is numerically more stable
  -- compared to simple log(softmax(x2))
  return $ logSoftmax (Dim 1) x2

## Computing on a GPU

To transfer data onto a GPU, we use `toDevice :: ... => Device -> a -> a`.
Below are helper methods to traverse data structures containing tensors
(e.g. `MLP`) to convert those between devices.

In [3]:
toLocalModel :: forall a. HasTypes a Tensor => Device -> DType -> a -> a
toLocalModel device' dtype' = over (types @Tensor @a) (toDevice device')

fromLocalModel :: forall a. HasTypes a Tensor => a -> a
fromLocalModel = over (types @Tensor @a) (toDevice (Device CPU 0))

Below is a shortcut to transfer data to `cuda:0` device, assuming the `Float` type.

In [4]:
toLocalModel' = toLocalModel (Device CUDA 0) Float 

The train loop is almost the same as in the previous post, except two changes.
First, we convert training data to GPU with `toLocalModel'`
(assuming that the model was already converted to GPU).
Second, `predic <- mlp model isTrain input` is an `IO` action.

In [5]:
trainLoop :: Optimizer o => MLP -> LearningRate -> o -> ListT IO (Tensor, Tensor) -> IO MLP
trainLoop model lr optimizer = P.foldM step begin done. enumerateData
  where
    isTrain = True
    step :: MLP -> ((Tensor, Tensor), Int) -> IO MLP
    step model args = do
      let ((input, label), iter) = toLocalModel' args
      predic <- mlp model isTrain input
      let loss = nllLoss' label predic
      -- Print loss every 100 batches
      when (iter `mod` 100 == 0) $ do
        putStrLn
          $ printf "Batch: %d | Loss: %.2f" iter (asValue loss :: Float)
      (newParam, _) <- runStep model optimizer loss lr
      return newParam
    done = pure
    begin = pure model

We slightly modify the `train` function to use Adam optimizer with `mkAdam`:
1. 0 is the iteration number, which is then increased by the optimizer.
2. We provide `beta1` and `beta2` values.
3. `flattenParameters net0` are needed to get the shapes of the trained parameters momenta. See also [Day 2](https://penkovsky.com/neural-networks/day2) for more details.

<!--
We also reduced the learning rate to 1e-5 because of Adam instability.
I guess that with batch normalization this won't be an issue anymore.
-->

In [6]:
train :: V.MNIST IO -> Int -> MLP -> IO MLP
train trainMnist epochs net0 =
    foldLoop net0 epochs $ \net' _ ->
      runContT (streamFromMap dsetOpt trainMnist)
      $ trainLoop net' 1e-5 optimizer. fst
  where
    dsetOpt = datasetOpts workers
    workers = 2
    -- Adam optimizer
    optimizer = mkAdam 0 beta1 beta2 (flattenParameters net0)
    beta1 = 0.9
    beta2 = 0.999

Here is a function to get model accuracy:

In [7]:
accuracy :: MLP -> ListT IO (Tensor, Tensor) -> IO Float
accuracy net = P.foldM step begin done. enumerateData
  where
    step :: (Int, Int) -> ((Tensor, Tensor), Int) -> IO (Int, Int)
    step (ac, total) args = do
      let ((input, labels), _) = toLocalModel' args
      -- Compute predictions
      predic <- let stochastic = False
                in argmax (Dim 1) RemoveDim 
                     <$> mlp net stochastic input
    
      let correct = asValue
                        -- Sum those elements
                        $ sumDim (Dim 0) RemoveDim Int64
                        -- Find correct predictions
                        $ predic `eq` labels
                        
      let batchSize = head $ shape predic
      return (ac + correct, total + batchSize)
      
    -- When done folding, compute the accuracy
    done (ac, total) = pure $ fromIntegral ac / fromIntegral total
    
    -- Initial errors and totals
    begin = pure (0, 0)
    
testAccuracy :: V.MNIST IO -> MLP -> IO Float
testAccuracy testStream net = do
    runContT (streamFromMap (datasetOpts 2) testStream) $ accuracy net. fst

Below we provide the MLP specification: number of neurons in each layer.

In [8]:
spec = MLPSpec 784 300 50 10  

## Saving and Loading the Model

Before we can save the model, we have to make the weight tensors dependent first:

In [9]:
save' :: MLP -> FilePath -> IO ()
save' net = save (map toDependent. flattenParameters $ net)

The inverse is true for loading a model. We also replace
parameters in a newly generate model with the once we
have just loaded:

In [10]:
load' :: FilePath -> IO MLP
load' fpath = do
  params <- mapM makeIndependent <=< load $ fpath
  net0 <- sample spec
  return $ replaceParameters net0 params

Finally, load the MNIST data:

In [11]:
(trainData, testData) <- initMnist "data"



To train a new model, remove `{-` and `-}` to uncomment the lines below.

In [12]:
{-
-- A train "loader"
trainMnistStream = V.MNIST { batchSize = 256, mnistData = trainData }
net0 <- toLocalModel' <$> sample spec

epochs = 5
net' <- train trainMnistStream epochs net0

-- Saving the trained model
save' net' "weights.bin"
-}

Load a pretrained model:

In [13]:
net <- load' "weights.bin"



Verify the model's accuracy:

In [14]:
-- A test "loader"
testMnistStream = V.MNIST { batchSize = 1000, mnistData = testData }

ac <- testAccuracy testMnistStream net
putStrLn $ "Accuracy " ++ show ac

Accuracy 0.9245

The accuracy is not tremendous, but it can be improved by introducing [batch norm](https://penkovsky.com/neural-networks/day4), [convolutional layers](https://penkovsky.com/neural-networks/day5), and training longer. Now, we are about to discuss model uncertainty estimation and this accuracy is good enough.

## Predictive Entropy

Model uncertainties are calculated as:

$$\mathbb{H}(y|\mathbf{x}) = -\sum_c p(y = c|\mathbf{x}) \log p(y = c|\mathbf{x}),$$

where $y$ is label, $\mathbf{x}$ – input image, $c$ – class, $p$ – probability.

We call $\mathbb{H}$ [predictive entropy](https://towardsdatascience.com/2-easy-ways-to-measure-your-image-classification-models-uncertainty-1c489fefaec8). And it is the very dropout
technique that helps us to estimate those uncertainties.
All we need to do is to collect several predictions in the stochastic mode
(i.e. dropout enabled)
and apply the formula from above.

In [15]:
predictiveEntropy :: Tensor -> Float
predictiveEntropy predictions =
  let epsilon = 1e-45
      a = meanDim (Dim 0) RemoveDim Float predictions
      b = Torch.log $ a + epsilon
  in asValue $ negate $ sumAll $ a * b

## Visualizing Softmax Predictions

To get a better feeling what model outputs look like,
it would be nice to visualize the softmax output
as a histogram or a bar chart.

In [16]:
-- Barchart inspired by https://github.com/morishin/ascii-horizontal-barchart/blob/master/src/chart.js
bar :: Floating a => RealFrac a => PrintfArg a => [String] -> [a] -> IO ()
bar lab xs = forM_ ys putStrLn
  where
    ys = let lab' = map (appendSpaces maxLen. Prelude.take maxLabelLen) lab
         in zipWith3 (printf "%s %s %.2f") lab' (showBar xs) xs
    appendSpaces maxN s = let l = length s
                          in s ++ replicate (maxN - l) ' '
    maxLen = Prelude.min maxLabelLen $ _findmax. map length $ lab
    maxLabelLen = 15
    
showBar :: Floating a => RealFrac a => [a] -> [String]
showBar xs =
  let maxVal = _findmax xs
      maxBarLen = 50
  in map (drawBar maxBarLen maxVal) xs

-- | Formats a bar string
--
-- >>> drawBar 100 1 100
-- "▉"
-- >>> drawBar 100 1.5 100
-- "▉▋"
-- >>> drawBar 100 2 100
-- "▉▉"
drawBar :: Floating a => RealFrac a => a -> a -> a -> String
drawBar maxBarLen maxValue value = bar1
  where 
    barLength = value * maxBarLen / maxValue
    wholeNumberPart = Prelude.floor barLength
    fractionalPart = barLength - fromIntegral wholeNumberPart
    
    bar0 = replicate wholeNumberPart $ _frac _maxFrac
    bar1 = if fractionalPart > 0
      then bar0 ++ [_frac $ Prelude.floor $ fractionalPart * (_maxFrac + 1)]
      else bar0 ++ ""
      
    _frac 0 = '▏'
    _frac 1 = '▎'
    _frac 2 = '▍'
    _frac 3 = '▋'
    _frac 4 = '▊'
    _frac _ = '▉'

    _maxFrac = 5
      
_findmax = foldr1 (\x y -> if x >= y then x else y)

For instance

In [17]:
bar ["apples", "oranges", "kiwis"] [50, 100, 25]

apples  ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 50.00
oranges ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100.00
kiwis   ▉▉▉▉▉▉▉▉▉▉▉▉▋ 25.00

Now we would like to display an image, the predictive entropy, the softmax output
look like, followed by prediction and ground truth.
To transform logSoftmax into softmax, we use the following identity:

$$e^{\ln(\rm{softmax}(x))} = \rm{softmax}(x),$$

that is `softmax = exp. logSoftmax`.

In [18]:
displayImage :: MLP -> (Tensor, Tensor) -> IO ()
displayImage model (testImg, testLabel) = do
  let repeatN = 20
      stochastic = True
  preds <- forM [1..repeatN] $ \_ -> exp  -- logSoftmax -> softmax
                                     <$> mlp model stochastic testImg
  pred0 <- mlp model (not stochastic) testImg
  let entropy = predictiveEntropy $ Torch.cat (Dim 0) preds
  -- Select only images with high entropy
  when (entropy > 0.9) $ do
      V.dispImage testImg
      putStr "Entropy "
      print entropy
      -- exp. logSoftmax = softmax
      bar (map show [0..9]) (asValue $ flattenAll $ exp pred0 :: [Float])
      putStrLn $ "Model        : " ++ (show. argmax (Dim 1) RemoveDim. exp $ pred0)
      putStrLn $ "Ground Truth : " ++ show testLabel

Show only those images the model is uncertain about, entropy > 0.9.

In [19]:
testMnistStream = V.MNIST {batchSize = 1, mnistData = testData}
forM_ [0 .. 200] $ displayImage (fromLocalModel net) <=< getItem testMnistStream

              
              
     +%       
     %        
     *        
    #-  +%%=  
    %  %%  %  
    % %+   #  
    % %    *  
    %  % :%   
    #*:=%#    
     -%=.     
              
              
Entropy 1.044228
0 ▉▏ 0.01
1 ▏ 0.00
2 ▋ 0.01
3 ▏ 0.00
4 ▉ 0.01
5 ▍ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.70
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.21
9 ▉▉▉▋ 0.05
Model        : Tensor Int64 [1] [ 6]
Ground Truth : Tensor Int64 [1] [ 6]
              
              
      .#%#.   
    %%+:      
     %        
     %..      
    ##-#%.    
         -%   
          :%  
           +  
    -     .%  
    @%+*%%+   
              
              
Entropy 1.2909155
0 ▏ 0.00
1 ▏ 0.00
2 ▍ 0.00
3 ▉▉▉▉▉▉▉▉ 0.07
4 ▏ 0.00
5 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.44
6 ▏ 0.00
7 ▍ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.47
9 ▉▏ 0.01
Model        : Tensor Int64 [1] [ 8]
Ground Truth : Tensor Int64 [1] [ 5]
              
              
              


Reflecting on softmax outputs above we can state that

1. Softmax output alone is not enough to estimate the model uncertainty. We can observe wrong predictions even when the margin between the top and second-best guess is large.
2. Sometimes prediction and ground truth coincide. So why the entropy is high? We actually need to inspect such cases in more details.

To illustrate the last point, let us take a closer look at a case with high entropy. By running several realizations of the stochatic model, we can verify if the model has any "doubt" by selecting different answers.

In [20]:
displayImage' :: MLP -> (Tensor, Tensor) -> IO ()
displayImage' model (testImg, testLabel) = do
  let repeatN = 10
  -- pred <- mlp model False testImg
  pred' <- forM [1..repeatN] $ \_ -> exp  -- logSoftmax -> softMax
                                     <$> mlp model True testImg
  pred0 <- mlp model False testImg
  let entropy = predictiveEntropy $ Torch.cat (Dim 0) pred'

  V.dispImage testImg
  putStr "Entropy "
  print entropy
  forM_ pred' ( \pred ->
      putStrLn "" 
      >> bar (map show [0..9]) (asValue $ flattenAll pred :: [Float]) )
  putStrLn $ "Model        : " ++ (show. argmax (Dim 1) RemoveDim. exp $ pred0)
  putStrLn $ "Ground Truth : " ++ show testLabel

In [21]:
(displayImage' (fromLocalModel net) <=< getItem testMnistStream) 11

              
              
     +%       
     %        
     *        
    #-  +%%=  
    %  %%  %  
    % %+   #  
    % %    *  
    %  % :%   
    #*:=%#    
     -%=.     
              
              
Entropy 1.1085687

0 ▎ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.90
7 ▏ 0.00
8 ▉▉▉▉▉▍ 0.10
9 ▏ 0.00

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▋ 0.01
1 ▏ 0.00
2 ▏ 0.00
3 ▎ 0.01
4 ▉▉▉▏ 0.05
5 ▏ 0.00
6 ▉▉▎ 0.04
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.86
9 ▉▎ 0.02

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▉▉▉▉▍ 0.04
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▉▉▉▉▉▉▉▉▉▉▏ 0.09
5 ▉▏ 0.01
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▋ 0.30
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.12
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉

Wow! The model sometimes "sees" digit 6, sometimes digit 8, and sometimes digit 9!
For the contrast, here is how predictions with low entropy typically look like.

In [22]:
(displayImage' (fromLocalModel net) <=< getItem testMnistStream) 0

              
              
              
              
   #%%*****   
      ::: %   
         %:   
        :%    
        #:    
       :%     
       %.     
      #=      
     :%.      
     =#       
Entropy 4.8037423e-4

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉

The model always "sees" digit 7. Note that the results we have provided are model-dependent. Therefore we also share our model for reproducibility. However, every realization of the
stochastic model might be still different, especially in those cases where the
entropy is high.

## Summary

I hope you are now convinced that model's uncertainty estimation is an invaluable tool. This simple technique is essential when applying deep learning for real-life decision making. This post also develops on how to use Hasktorch library in practice. Notably, it is very straightforward to run computations on a GPU. Overall, Hasktorch can be used for real-world deep learning. The code is well-structured and relies on a mature Torch library. On the other hand, it would be desirable to capture high-level patterns so that the user does not need to think about low-level concepts such as dependent and independent tensors, for example. The end user should be able to simply apply `save net "weights.bin"` and `mynet <- load "weights.bin"` without any indirections. The same reasoning applies to the `trainLoop`, i.e. the user does not need to reinvent it every time. Eventually, a higher-level package on top of Hasktorch should capture the best practices, similar to [PyTorch Lightning](https://www.pytorchlightning.ai/) or [fast.ai](https://github.com/fastai/fastai).

Now your turn: explore image recognition with [AlexNet](https://github.com/hasktorch/hasktorch/blob/master/examples/alexNet/AlexNet.hs) convolutional network and have fun! 

## Learn More

* [Improving neural networks by preventing
co-adaptation of feature detectors](https://arxiv.org/pdf/1207.0580.pdf)
* [Dropout: A Simple Way to Prevent Neural Networks from
Overfitting](https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
* [Tutorial: Dropout as Regularization and Bayesian Approximation](https://xuwd11.github.io/Dropout_Tutorial_in_PyTorch/)
* [Two Simple Ways To Measure Your Model’s Uncertainty](https://towardsdatascience.com/2-easy-ways-to-measure-your-image-classification-models-uncertainty-1c489fefaec8)
* [Uncertainty in Deep Learning, Yarin Gal](http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf)
* [AlexNet example in Hasktorch](https://github.com/hasktorch/hasktorch/tree/master/examples/alexNet)