In [None]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1920,
        'height': 1080,
        'scroll': True,
})

# Week 9 (Monday), AST 8581 / PHYS 8581 / CSCI 8581 / STAT 8581: Big Data in Astrophysics

### Michael Coughlin <cough052@umn.edu>

With contributions totally ripped off from Dima Duev (Weights and Biases), Lex Friedman (MIT), Michael Steinbach (UMN), Nico Adams (UMN), and Jie Ding (UMN)



# Where are we headed?

Foundations of Data and Probability -> Statistical frameworks (Frequentist vs Bayesian) -> Estimating underlying distributions -> Analysis of Time series (periodicity) -> Analysis of Time series (variability) -> Analysis of Time series (stochastic processes) -> Gaussian Processes -> Decision Trees / Regression -> Dimensionality Reduction  -> Principle Component Analysis -> Clustering / Density Estimation / Anomaly Detection -> Supervised Learning -> <b> Deep Learning </b> -> Introduction to Databases - SQL -> Introduction to Databases - NoSQL -> Introduction to Multiprocessing -> Introduction to GPUs -> Unit Testing

<img src='img/fig-dl.png' width=600>

<img src='img/dl6.png' width=600>

<img src='img/dl7.png' width=600>

<img src='img/dl8.png' width=600>

<img src='img/dl13.png' width=600>

<img src='img/dl11.png' width=500>

<img src='img/dl12.png' width=500>

# Recipe / Practical advice from one of DL godfathers Andrej Karpathy

Highly recommended: [Andrej Karpathy's blog post](http://karpathy.github.io/2019/04/25/recipe/)

- Neural net training is a leaky abstraction

```bash
>>> your_data = # plug your awesome dataset here
>>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)
# conquer world here
```

- Neural net training fails silently

Lots of ways to screw things up -> many paths to pain and suffering


## Become one with the data

- probably, the most important and time consuming step
- visualize as much as you can
- check normalizations
    
The neural net is effectively a compressed/compiled version of your dataset, you'll be able to look at your network (mis)predictions and understand where they might be coming from. And if your network is giving you some prediction that doesn't seem consistent with what you've seen in the data, something is off.

#### "If writing your neural net code was like training one, you’d want to use a very small learning rate and guess and then evaluate the full test set after every iteration."

## Set up the end-to-end training/evaluation skeleton

Start out as simple as possible, e.g. a linear classifier, or a very tiny ConvNet. We’ll want to train it, visualize the losses, any other metrics (e.g. accuracy), model predictions, and perform a series of ablation experiments with explicit hypotheses along the way.

### Tips & tricks for this stage:

- **fix random seed**. Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane.
- **simplify**. Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage. Data augmentation is a regularization strategy that we may incorporate later, but for now it is just another opportunity to introduce some dumb bug.
- **add significant digits to your eval**. When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
- **verify loss @ init**. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.
- **init well.** Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.
- **human baseline.** Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it. Alternatively, annotate the test data twice and for each example treat one annotation as prediction and the second as ground truth.
- **input-indepent baseline.** Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?
- **overfit one batch.** Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero). I also like to visualize in the same plot both the label and the prediction and ensure that they end up aligning perfectly once we reach the minimum loss. If they do not, there is a bug somewhere and we cannot continue to the next stage.
- **verify decreasing training loss.** At this stage you will hopefully be underfitting on your dataset because you’re working with a toy model. Try to increase its capacity just a bit. Did your training loss go down as it should?
- **visualize just before the net.** The unambiguously correct place to visualize your data is immediately before your y_hat = model(x) (or sess.run in tf). That is - you want to visualize exactly what goes into your network, decoding that raw tensor of data and labels into visualizations. This is the only “source of truth”. I can’t count the number of times this has saved me and revealed problems in data preprocessing and augmentation.
- **visualize prediction dynamics.** I like to visualize model predictions on a fixed test batch during the course of training. The “dynamics” of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network “struggle” to fit your data if it wiggles too much in some way, revealing instabilities. Very low or very high learning rates are also easily noticeable in the amount of jitter.
- **use backprop to chart dependencies.** Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I’ve come across a few times is that people get this wrong (e.g. they use view instead of transpose/permute somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example i, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the i-th input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.
- **generalize a special case.** This is a bit more of a general coding tip but I’ve often seen people create bugs when they bite off more than they can chew, writing a relatively general functionality from scratch. I like to write a very specific function to what I’m doing right now, get that to work, and then generalize it later making sure that I get the same result. Often this applies to vectorizing code, where I almost always write out the fully loopy version first and only then transform it to vectorized code one loop at a time.

## Building a network


In the first example below, our input data are simply vectors, and our labels are scalars (1s and 0s): this is the easiest setup you will ever encounter. A type of
network that performs well on such a problem would be a simple stack of fully-connected (`Dense`) layers with `relu` activations: `Dense(16,
activation='relu')`

The argument being passed to each `Dense` layer (16) is the number of "hidden units" of the layer. What's a hidden unit? It's a dimension
in the representation space of the layer. Each such `Dense` layer with a `relu` activation implements
the following chain of tensor operations:

`output = relu(dot(W, input) + b)`

Having 16 hidden units means that the weight matrix `W` will have shape `(input_dimension, 16)`, i.e. the dot product with `W` will project the
input data onto a 16-dimensional representation space (and then we would add the bias vector `b` and apply the `relu` operation). You can
intuitively understand the dimensionality of your representation space as "how much freedom you are allowing the network to have when
learning internal representations". Having more hidden units (a higher-dimensional representation space) allows your network to learn more
complex representations, but it makes your network more computationally expensive and may lead to learning unwanted patterns (patterns that
will improve performance on the training data but not on the test data).

There are two key architecture decisions to be made about such stack of dense layers:

* How many layers to use.
* How many "hidden units" to chose for each layer.

In this problem, we will use two intermediate layers with 16 hidden units each,
and a third layer which will output the scalar prediction regarding the sentiment of the current review.
The intermediate layers will use `relu` as their "activation function",
and the final layer will use a sigmoid activation so as to output a probability
(a score between 0 and 1, indicating how likely the sample is to have the target "1", i.e. how likely the review is to be positive).
A `relu` (rectified linear unit) is a function meant to zero-out negative values,
while a sigmoid "squashes" arbitrary values into the `[0, 1]` interval, thus outputting something that can be interpreted as a probability.

Here's what our network looks like:

![3-layer network](https://s3.amazonaws.com/book.keras.io/img/ch3/3_layer_network.png)

And here's the Keras implementation:

In [None]:
from keras import models
from keras import layers
from keras import Input
from keras import Model

features_input = Input(shape=(209,), name='features')
dense = layers.Dense(256, activation='relu')(features_input)
dense = layers.Dense(16, activation='relu')(dense)
dense = layers.Softmax()(dense)
dense = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[features_input], outputs=dense)

## Overfitting

To find a good model takes two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss).

### Tips & tricks for this stage:

- **picking the model**. To reach a good training loss you’ll want to choose an appropriate architecture for the data. When it comes to choosing this my #1 advice is: Don’t be a hero. I’ve seen a lot of people who are eager to get crazy and creative in stacking up the lego blocks of the neural net toolbox in various exotic architectures that make sense to them. Resist this temptation strongly in the early stages of your project. I always advise people to simply find the most related paper and copy paste their simplest architecture that achieves good performance. E.g. if you are classifying images don’t be a hero and just copy paste a ResNet-50 for your first run. You’re allowed to do something more custom later and beat this.
- **adam is safe**. In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate. For ConvNets a well-tuned SGD will almost always slightly outperform Adam, but the optimal learning rate region is much more narrow and problem-specific. (Note: If you are using RNNs and related sequence models it is more common to use Adam. At the initial stage of your project, again, don’t be a hero and follow whatever the most related papers do.)
- **complexify only one at a time**. If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you’d expect. Don’t throw the kitchen sink at your model at the start. There are other ways of building up complexity - e.g. you can try to plug in smaller images first and make them bigger later, etc.
- **do not trust learning rate decay defaults**. If you are re-purposing code from some other domain always be very careful with learning rate decay. Not only would you want to use different decay schedules for different problems, but - even worse - in a typical implementation the schedule will be based current epoch number, which can vary widely simply depending on the size of your dataset. E.g. ImageNet would decay by 10 on epoch 30. If you’re not training ImageNet then you almost certainly do not want this. If you’re not careful your code could secretely be driving your learning rate to zero too early, not allowing your model to converge. In my own work I always disable learning rate decays entirely (I use a constant LR) and tune this all the way at the very end.


## Regularize

Now to regularize the model and gain some validation accuracy by giving up some of the training accuracy.

### Tips & tricks for this stage:

- **get more data. First**, the by far best and preferred way to regularize a model in any practical setting is to add more real training data. It is a very common mistake to spend a lot engineering cycles trying to squeeze juice out of a small dataset when you could instead be collecting more data. As far as I’m aware adding more data is pretty much the only guaranteed way to monotonically improve the performance of a well-configured neural network almost indefinitely. The other would be ensembles (if you can afford them), but that tops out after ~5 models.
- **data augment**. The next best thing to real data is half-fake data - try out more aggressive data augmentation.
- **creative augmentation**. If half-fake data doesn’t do it, fake data may also do something. People are finding creative ways of expanding datasets; For example, domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
- **pretrain**. It rarely ever hurts to use a pretrained network if you can, even if you have enough data.
- **stick with supervised learning**. Do not get over-excited about unsupervised pretraining. Unlike what that blog post from 2008 tells you, as far as I know, no version of it has reported strong results in modern computer vision (though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio).
- **smaller input dimensionality**. Remove features that may contain spurious signal. Any added spurious input is just another opportunity to overfit if your dataset is small. Similarly, if low-level details don’t matter much try to input a smaller image.
- **smaller model size**. In many cases you can use domain knowledge constraints on the network to decrease its size. As an example, it used to be trendy to use Fully Connected layers at the top of backbones for ImageNet but these have since been replaced with simple average pooling, eliminating a ton of parameters in the process.
- **decrease the batch size**. Due to the normalization inside batch norm smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale & offset “wiggles” your batch around more.
- **drop**. Add dropout. Use dropout2d (spatial dropout) for ConvNets. Use this sparingly/carefully because dropout does not seem to play nice with batch normalization.
- **weight decay**. Increase the weight decay penalty.
- **early stopping**. Stop training based on your measured validation loss to catch your model just as it’s about to overfit.
- **try a larger model**. I mention this last and only after early stopping but I’ve found a few times in the past that larger models will of course overfit much more eventually, but their “early stopped” performance can often be much better than that of smaller models.

Finally, to gain additional confidence that your network is a reasonable classifier, I like to visualize the network’s first-layer weights and ensure you get nice edges that make sense. If your first layer filters look like noise then something could be off. Similarly, activations inside the net can sometimes display odd artifacts and hint at problems.

## Tune

You should now be “in the loop” with your dataset exploring a wide model space for architectures that achieve low validation loss. A few tips and tricks for this step:

### Tips & tricks for this stage:

- ** random over grid search**. For simultaneously tuning multiple hyperparameters it may sound tempting to use grid search to ensure coverage of all settings, but keep in mind that it is best to use random search instead. Intuitively, this is because neural nets are often much more sensitive to some parameters than others. In the limit, if a parameter a matters but changing b has no effect then you’d rather sample a more throughly than at a few fixed points multiple times.
- **hyper-parameter optimization**. There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.

## Squeeze out the juice

Once you find the best types of architectures and hyper-parameters you can still use a few more tricks to squeeze out the last pieces of juice out of the system:

### Tips & tricks for this stage:

- **ensembles**. Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything. If you can’t afford the computation at test time look into distilling your ensemble into a network using dark knowledge.
- **leave it training**. I’ve often seen people tempted to stop the model training when the validation loss seems to be leveling off. In my experience networks keep training for unintuitively long time. One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).


## Overfitting and underfitting

Overfitting happens in every single machine learning 
problem. Learning how to deal with overfitting is essential to mastering machine learning.

The fundamental issue in machine learning is the tension between optimization and generalization. "Optimization" refers to the process of 
adjusting a model to get the best performance possible on the training data (the "learning" in "machine learning"), while "generalization" 
refers to how well the trained model would perform on data it has never seen before. The goal of the game is to get good generalization, of 
course, but you do not control generalization; you can only adjust the model based on its training data.

At the beginning of training, optimization and generalization are correlated: the lower your loss on training data, the lower your loss on 
test data. While this is happening, your model is said to be _under-fit_: there is still progress to be made; the network hasn't yet 
modeled all relevant patterns in the training data. But after a certain number of iterations on the training data, generalization stops 
improving, validation metrics stall then start degrading: the model is then starting to over-fit, i.e. is it starting to learn patterns 
that are specific to the training data but that are misleading or irrelevant when it comes to new data.

To prevent a model from learning misleading or irrelevant patterns found in the training data, _the best solution is of course to get 
more training data_. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution 
is to modulate the quantity of information that your model is allowed to store, or to add constraints on what information it is allowed to 
store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most 
prominent patterns, which have a better chance of generalizing well.

The processing of fighting overfitting in this way is called _regularization_.

## Fighting overfitting: Reducing the network's size


The simplest way to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is 
determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is 
often referred to as the model's "capacity". Intuitively, a model with more parameters will have more "memorization capacity" and therefore 
will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any 
generalization power. For instance, a model with 500,000 binary parameters could easily be made to learn the class of every digits in the 
MNIST training set: we would only need 10 binary parameters for each of the 50,000 digits. Such a model would be useless for classifying 
new digit samples. Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge 
is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn this mapping as easily, and thus, in 
order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets 
-- precisely the type of representations that we are interested in. At the same time, keep in mind that you should be using models that have 
enough parameters that they won't be underfitting: your model shouldn't be starved for memorization resources. There is a compromise to be 
found between "too much capacity" and "not enough capacity".

Unfortunately, there is no magical formula to determine what the right number of layers is, or what the right size for each layer is. You 
will have to evaluate an array of different architectures (on your validation set, not on your test set, of course) in order to find the 
right model size for your data. The general workflow to find an appropriate model size is to start with relatively few layers and 
parameters, and start increasing the size of the layers or adding new layers until you see diminishing returns with regard to the 
validation loss.

## Adding weight regularization


You may be familiar with _Occam's Razor_ principle: given two explanations for something, the explanation most likely to be correct is the 
"simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some 
training data and a network architecture, there are multiple sets of weights values (multiple _models_) that could explain the data, and 
simpler models are less likely to overfit than complex ones.

A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer 
parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity 
of a network by forcing its weights to only take small values, which makes the distribution of weight values more "regular". This is called 
"weight regularization", and it is done by adding to the loss function of the network a _cost_ associated with having large weights. This 
cost comes in two flavors:

* L1 regularization, where the cost added is proportional to the _absolute value of the weights coefficients_ (i.e. to what is called the 
"L1 norm" of the weights).
* L2 regularization, where the cost added is proportional to the _square of the value of the weights coefficients_ (i.e. to what is called 
the "L2 norm" of the weights). L2 regularization is also called _weight decay_ in the context of neural networks. Don't let the different 
name confuse you: weight decay is mathematically the exact same as L2 regularization.

In Keras, weight regularization is added by passing _weight regularizer instances_ to layers as keyword arguments. Let's add L2 weight 
regularization to our movie review classification network:

In [None]:
from keras import models
from keras import layers
from keras import Input
from keras import Model
from keras import regularizers

features_input = Input(shape=(209,), name='features')
dense = layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001))(features_input)
dense = layers.Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.001))(dense)
dense = layers.Softmax()(dense)
dense = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[features_input], outputs=dense)


`l2(0.001)` means that every coefficient in the weight matrix of the layer will add `0.001 * weight_coefficient_value` to the total loss of 
the network. Note that because this penalty is _only added at training time_, the loss for this network will be much higher at training 
than at test time.

Here's the impact of our L2 regularization penalty:

## Adding dropout


Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his 
students at the University of Toronto. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. setting to zero) a number of 
output features of the layer during training. Let's say a given layer would normally have returned a vector `[0.2, 0.5, 1.3, 0.8, 1.1]` for a 
given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. `[0, 0.5, 
1.3, 0, 1.1]`. The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test 
time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to 
balance for the fact that more units are active than at training time.

Consider a Numpy matrix containing the output of a layer, `layer_output`, of shape `(batch_size, features)`. At training time, we would be 
zero-ing out at random a fraction of the values in the matrix:

In [None]:
# FOR DEMONSTRATION PURPOSES ONLY: NOT MEANT TO BE RUN
# At training time: we drop out 50% of the units in the output
layer_output *= np.randint(0, high=2, size=layer_output.shape)


At test time, we would be scaling the output down by the dropout rate. Here we scale by 0.5 (because we were previous dropping half the 
units):

In [None]:
# At test time:
layer_output *= 0.5


Note that this process can be implemented by doing both operations at training time and leaving the output unchanged at test time, which is 
often the way it is implemented in practice:

In [None]:
# At training time:
layer_output *= np.randint(0, high=2, size=layer_output.shape)
# Note that we are scaling *up* rather scaling *down* in this case
layer_output /= 0.5


This technique may seem strange and arbitrary. Why would this help reduce overfitting? Geoff Hinton has said that he was inspired, among 
other things, by a fraud prevention mechanism used by banks -- in his own words: _"I went to my bank. The tellers kept changing and I asked 
one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation 
between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each 
example would prevent conspiracies and thus reduce overfitting"_.

The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that are not significant (what 
Hinton refers to as "conspiracies"), which the network would start memorizing if no noise was present. 

In Keras you can introduce dropout in a network via the `Dropout` layer, which gets applied to the output of layer right before it, e.g.:

In [None]:
model.add(layers.Dropout(0.5))

In [None]:
from keras import models
from keras import layers
from keras import Input
from keras import Model
from keras import regularizers

features_input = Input(shape=(209,), name='features')
dense = layers.Dense(256, activation='relu')(features_input)
dense = layers.Dropout(0.35)(dense)
dense = layers.Dense(16, activation='relu')(dense)
dense = layers.Dropout(0.25)(dense)
dense = layers.Softmax()(dense)

dense = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[features_input], outputs=dense)


Lastly, we need to pick a loss function and an optimizer. Since we are facing a binary classification problem and the output of our network
is a probability (we end our network with a single-unit layer with a sigmoid activation), is it best to use the `binary_crossentropy` loss.
It isn't the only viable choice: you could use, for instance, `mean_squared_error`. But crossentropy is usually the best choice when you
are dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory, that measures the "distance"
between probability distributions, or in our case, between the ground-truth distribution and our predictions.

Here's the step where we configure our model with the `rmsprop` optimizer and the `binary_crossentropy` loss function. Note that we will
also monitor accuracy during training.

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

We are passing our optimizer, loss function and metrics as strings, which is possible because `rmsprop`, `binary_crossentropy` and
`accuracy` are packaged as part of Keras. Sometimes you may want to configure the parameters of your optimizer, or pass a custom loss
function or metric function. This former can be done by passing an optimizer class instance as the `optimizer` argument:

In [None]:
from keras import optimizers

lr = 3e-4
beta_1 = 0.9
beta_2 = 0.999
epsilon = 1e-7
decay = 0.0
amsgrad = 3e-4
optimizer = optimizers.Adam(lr=lr, beta_1=beta_1, beta_2=beta_2,
                            epsilon=epsilon, decay=decay, amsgrad=amsgrad)

In [None]:
from keras import metrics

metrics = [metrics.TruePositives(name='tp'),
           metrics.FalsePositives(name='fp'),
           metrics.TrueNegatives(name='tn'),
           metrics.FalseNegatives(name='fn'),
           metrics.BinaryAccuracy(name='accuracy'),
           metrics.Precision(name='precision'),
           metrics.Recall(name='recall'),
           metrics.AUC(name='auc'),]

The latter can be done by passing function objects as the `loss` or `metrics` arguments:

In [None]:
from keras import losses

model.compile(optimizer=optimizer,
              loss=losses.binary_crossentropy,
              metrics=metrics)

In [None]:
print(model.summary())

# In-class warm-up: Classifying Type 1a Supernovae Spectra

In this problem, we will be using Deep Learning to differentiate thermonuclear supernovae (SNe Ia) based on very low-resolution (R$\sim100$) data from other transients using a set of spectra from the SED Machine on the Palomar 60 inch; this telescope and observing system has a dedicated program studying bright transients in our local Universe (https://arxiv.org/abs/1910.12973). A significant fraction of those are SNe Ia, which is an explosion that occurs in binary systems where one of the stars is a white dwarf. The other star can be anything from a giant star to an even smaller white dwarf. The goal of the problem is fully automated classification of SNe Ia with a very low false-positive rate (FPR) so that human intervention can be greatly reduced in large-scale SN classification efforts.

Historically, classifications have been based on manual matching of observed spectra to spectral templates, along with careful inspection of each obtained spectrum. This makes classification of thousands of SNe a very time-consuming endeavor. In this problem, we will use a deep-learning based method optimized to identify SNe Ia using SEDM spectra (and potentially determine their redshifts without any human interaction). The data set below comes from SEDM data obtained by the Bright Transient Survey between 2018 March to 2020 March.

### Load in data

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
from astropy.table import Table
from astropy.coordinates import Angle
from astropy.time import Time
import astropy.units as u
import matplotlib.pyplot as plt

INFILE='./data/SEDM_ML_sample.txt'
names = ('name','type','z','z_host','JD_spec','JD_max','phase','specfile','secure_flag')
df = pd.read_csv(INFILE, names=names)
T = Table.from_pandas(df)

T=T[T["secure_flag"]==1] # This selects the published 2018 sample as the validation set

def clean_list(tab):
    tab=tab[tab["type"]!='-']
    tab=tab[tab["type"]!='Iax']
    tab=tab[tab["type"]!='Ia\?']
    tab=tab[tab["type"]!='Ic\?']
    tab=tab[tab["type"]!='Ib/c\?']
    tab=tab[tab["type"]!='duplicate']
    tab=tab[tab["type"]!='ambiguous']

    types = []
    for row in tab:
        if "Ia" in row["type"]:
            types.append(1)
        else:
            types.append(0)
    tab["type"] = types

    return tab

T = clean_list(T)
T=T[T["JD_spec"]-Time('2020-03-01T00:00:00', format='isot', scale='utc').jd<0]

### Pre-process data

In [None]:
from scipy.interpolate import interp1d
from statsmodels.nonparametric.smoothers_lowess import lowess

wls=np.arange(3800,9150,25.6)

def preprocess_list(tab):

    Signals = np.zeros((len(tab), len(wls)))

    print('Preprocessing spectra')
    for ii, row in enumerate(tab):
        if np.mod(ii,500) == 0:
            print('Analyzing %d/%d' % (ii, len(tab)))

        spectrafile = './data/spectra/%s' % row["specfile"]
        spec = np.loadtxt(spectrafile)
        f = interp1d(spec[:,0], spec[:,1], fill_value="extrapolate")
        ispec = f(wls)
        ispec = ispec / np.median(ispec)
        vals = lowess(ispec, wls, frac=0.3)
        smoothed_ispec = vals[:,1]

        spe = ispec/smoothed_ispec
        spe[~np.isfinite(spe)]=0
        Signals[ii,:]=ispec

    return Signals

Spectra = preprocess_list(T)

### Plot up some spectra

In [None]:

cnt1 = 0
cnt0 = 0
for ii, row in enumerate(T):
    if (row["type"] == 1) and (cnt1 < 10):
        plt.plot(wls, Spectra[ii,:], 'k-')
        cnt1 = cnt1 + 1
    if (row["type"] == 0) and (cnt0 < 10):
        plt.plot(wls, Spectra[ii,:], 'b--')
        cnt0 = cnt0 + 1
plt.xlim([4000, 10000])
plt.ylim([-1,5])

## Data set preparation

We now prepare our data by determining the positive and negative examples, and then split our data into training, testing, and validation sets, preparing for training. 

In [None]:
from sklearn.model_selection import train_test_split
import tensorflow as tf

def threshold(a, t: float = 0.5):
    b = np.zeros_like(a)
    b[np.array(a) > t] = 1
    return b

shuffle_buffer_size = 256
batch_size = 25
epochs=100
random_state=42

test_size = 0.1
val_size = 0.1

target = np.asarray(list(map(int, threshold(T["type"], t=0.5))))
target = np.expand_dims(target, axis=1)
neg, pos = np.bincount(target.flatten())
total = neg + pos

print(f'Examples:\n  Total: {total}\n  Positive: {pos} ({100 * pos / total:.2f}% of total)\n')

df = pd.DataFrame(Spectra)

w_pos = np.where(np.rint(target) == 1)[0]
index_pos = df.loc[w_pos].index
w_neg = np.where(np.rint(target) == 0)[0]
index_neg = df.loc[w_neg].index

ds_indexes = index_pos.to_list() + index_neg.to_list()

# Train/validation/test split (we will use an 81% / 9% / 10% data split by default):
train_indexes, test_indexes = train_test_split(ds_indexes, shuffle=True,
                                               test_size=test_size, random_state=random_state)
train_indexes, val_indexes = train_test_split(train_indexes, shuffle=True,
                                              test_size=val_size, random_state=random_state)


# load/compute feature norms:
norms = np.linalg.norm(df.loc[ds_indexes, :], axis=0)

for idx in ds_indexes:
    df.loc[idx, :] = df.loc[idx, :] / norms

# make tf.data.Dataset's:
train_dataset = tf.data.Dataset.from_tensor_slices(({'features': df.loc[train_indexes].values}, target[train_indexes]))
val_dataset = tf.data.Dataset.from_tensor_slices(({'features': df.loc[val_indexes].values}, target[val_indexes]))
test_dataset = tf.data.Dataset.from_tensor_slices(({'features': df.loc[test_indexes].values}, target[test_indexes]))

train_dataset = train_dataset.shuffle(shuffle_buffer_size).batch(batch_size).repeat(epochs)
val_dataset = val_dataset.batch(batch_size).repeat(epochs)
test_dataset = test_dataset.batch(batch_size)

## Callbacks

Callbacks are special utilities or functions that are executed during training at given stages of the training procedure. Callbacks can help you prevent overfitting, visualize training progress, debug your code, save checkpoints, generate logs, create a TensorBoard, etc.

In [None]:
import os
import datetime
from tqdm.keras import TqdmCallback
from keras import callbacks

callbacks_list = []

# early stopping
        
# halt training if no gain in <validation loss> over <patience> epochs
monitor = 'val_loss'
patience = 10
restore_best_weights = True
early_stopping_callback = callbacks.EarlyStopping(monitor=monitor,
                                                  patience=patience,
                                                  restore_best_weights=restore_best_weights)
callbacks_list.append(early_stopping_callback)

# logs for TensorBoard:
log_tag = f'{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'
logdir_tag = os.path.join('logs', log_tag)
tensorboard_callback = callbacks.TensorBoard(os.path.join(logdir_tag, log_tag),
                                            histogram_freq=1)
callbacks_list.append(tensorboard_callback)

callbacks_list.append(TqdmCallback(verbose=1))

We will now train our model for 100 epochs (100 iterations over all samples in the `x_train` and `y_train` tensors), in mini-batches of 25
samples. At this same time we will monitor loss and accuracy on the validation samples that we set apart. This is done by passing the
validation data as the `validation_data` argument:

In [None]:
from tqdm.keras import TqdmCallback

steps_per_epoch_train = len(train_indexes) // batch_size - 1
steps_per_epoch_val = len(val_indexes) // batch_size - 1

history = model.fit(train_dataset,
                    epochs=epochs,
                    steps_per_epoch=steps_per_epoch_train,
                    batch_size=batch_size,
                    validation_data=val_dataset,
                    validation_steps=steps_per_epoch_val,
                    callbacks=callbacks_list)

On CPU, this will take less than a second per epoch -- training is over in less than a minute. At the end of every epoch, there is a slight pause
as the model computes its loss and accuracy on the samples of the validation data.

Note that the call to `model.fit()` returns a `History` object. This object has a member `history`, which is a dictionary containing data
about everything that happened during training. Let's take a look at it:

In [None]:
history_dict = history.history
history_dict.keys()

It contains 4 entries: one per metric that was being monitored, during training and during validation. Let's use Matplotlib to plot the
training and validation loss side by side, as well as the training and validation accuracy:

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epoch = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epoch, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epoch, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epoch, acc_values, 'bo', label='Training acc')
plt.plot(epoch, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.show()


The dots are the training loss and accuracy, while the solid lines are the validation loss and accuracy. Note that your own results may vary
slightly due to a different random initialization of your network.

Let's evaluate it on our test data:

In [None]:
results = model.evaluate(test_dataset)

In [None]:
results

Our fairly naive approach achieves an accuracy of 88%. With state-of-the-art approaches, one should be able to get close to 95%.

In [None]:
predictions = model.predict(test_dataset)
labels = np.concatenate([y for x, y in test_dataset], axis=0)

pt = np.vstack((predictions.T, labels.T)).T
pt_thresholded = np.rint(pt)
w = np.logical_xor(pt_thresholded[:, 0], pt_thresholded[:, 1])

print(len(w), np.sum(w), np.sum(w)/len(w))

idx = np.where(labels == 0)[0]
a_heights, a_bins = np.histogram(predictions[idx],bins=20)
idx = np.where(labels == 1)[0]
b_heights, b_bins = np.histogram(predictions[idx], bins=a_bins)

width = (a_bins[1] - a_bins[0])/3

plt.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label="Non-1a")
plt.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen', label="1a")
plt.legend()

## Conclusions


Here's what you should take away from this example:

* There's usually quite a bit of preprocessing you need to do on your raw data in order to be able to feed it -- as tensors -- into a neural
network. In the case of sequences of words, they can be encoded as binary vectors -- but there are other encoding options too.
* Stacks of `Dense` layers with `relu` activations can solve a wide range of problems (including sentiment classification), and you will
likely use them frequently.
* In a binary classification problem (two output classes), your network should end with a `Dense` layer with 1 unit and a `sigmoid` activation,
i.e. the output of your network should be a scalar between 0 and 1, encoding a probability.
* With such a scalar sigmoid output, on a binary classification problem, the loss function you should use is `binary_crossentropy`.
* The `rmsprop` optimizer is generally a good enough choice of optimizer, whatever your problem. That's one less thing for you to worry
about.
* As they get better on their training data, neural networks eventually start _overfitting_ and end up obtaining increasingly worse results on data
never-seen-before. Make sure to always monitor performance on data that is outside of the training set.


The most common ways to prevent overfitting in neural networks:

* Getting more training data.
* Reducing the capacity of the network.
* Adding weight regularization.
* Adding dropout.


# In-class exercise: Train a set of gravitational-wave noise transients

In [None]:
from keras.utils import np_utils

# Load in the training set data that is in the form of a pickle file
trainingset_df = pd.read_pickle("data/50_images_each_class.pkl")

# randomly shuffle the training data
trainingset_df = trainingset_df.sample(n=len(trainingset_df))

# convert class names to integers

classes = dict(enumerate(sorted(trainingset_df.true_label.unique())))
classes = dict((str(v),k) for k,v in classes.items())

trainingset_df["true_label_integer"] = trainingset_df.true_label.apply(lambda x: classes[x])

# We need to take the numpy array and reshape it into
# (N samples, width (of each samples), height (of each sample), number of features each samples has)
img_rows =140
img_cols = 170
reshape_order = (-1, img_rows, img_cols, 1)
training_data = np.vstack(trainingset_df["1.0.png"]).reshape(reshape_order)

# convert integer labels to a categorical matrix.
trainingset_labels = np.vstack(trainingset_df.true_label_integer.values)

In [None]:
# Display an example of each class
fig, axes =  plt.subplots(5, 5, figsize=(16,16))

for class_name, ax  in zip(trainingset_df.true_label.unique(), axes.flatten()):
  class_idx = np.where(trainingset_df["true_label"] == class_name)[0][0]
  ax.imshow(training_data[class_idx, :, :, 0])

## Build your own model: how well can you do?

In [None]:
# Create a model
from tensorflow.keras import layers, models
import tensorflow as tf

model = models.Sequential()
model.add(layers.Conv2D(16, (5, 5), activation='relu', input_shape=training_data.shape[1:]))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(16, (5, 5), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(32, (5, 5), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(trainingset_df.true_label.unique().size, activation='softmax'))

model.summary()

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])


history = model.fit(training_data, trainingset_labels, epochs=50,
                    validation_split=0.1, batch_size=100,)

In [None]:
from matplotlib import pyplot
pyplot.plot(history.history['accuracy'], label='accuracy')
pyplot.plot(history.history['val_accuracy'], label = 'val_accuracy')
pyplot.xlabel('Epoch')
pyplot.ylabel('Accuracy')
pyplot.ylim([0, 1])
pyplot.legend(loc='lower right')

#test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
