## Regularization for sparsity

**Zeroing out coefficients can help with performance, especially with large models and sparse inputs**

|Action|Impact|
|:-|:-|
|Fewer coefficients to store/load|Reduce memory,model,size|
|Fewer multiplications needed|Increase prediction speed|

L2 regularization only makes weights small, not zero

Feature crosses lead to lots of input nodes, so having zero weights is especially important

L0-norm(the count of non-zero weights) is an NP-hard, non-convex optimization problem

![](L1_norm.png)

**Elastic nets combine the feature selection of L1 regularization with the generalizability of L2 regularization**

This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization.

#### L1 Regularization Quiz

What does L1 regularization tends to do to a model's low predictive features' parameter weights?

**C. Have zero values**

### Lab: L1 Regularization

https://goo.gl/281mPF

Try training with and without L1 regularization. What’s the difference?

regularization=L1 dataset=circle
![](regularization=L1_dataset=circle.png)

### Lab Solution: L1 Regularization
![](lab_L1.png)

## Logistic regression

### Logistic Regression

![](logistic_regression.png)

**The output of Logistic Regression is a calibrated probability estimate**

the sigmoid function is the cumulative distribution function of the logistic probability distribution whose quantile function is the inverse of the logit which models the log odds

Useful because we can cast binary classification problems into probabilistic problems

![](cross-entropy.png)
![](regularization_is_important.png)
![](logistic_regression_regularization_quiz.png)

**Often we do both regularization and early stopping to counteract overfitting**

In many real-world problems, the probabilty is not enough; we need to make a binary decision

![](ROC_curve.png)
To create a curve, we would pick each possible decision threshold and re-evaluate. Each threshold value creates a single point but by evaluating many thresholds eventually a curve is formed.

Each model would create a different ROC curve. How can  we use these curves to compare relative performance of our models when we don't know exactly what decision threshold we want to use?
![](AUC.png)

#### Logistic Regression predictions should be unbiased

**average of predictions == average of observations**

Look for bias in slices of data, this can guide improvements
![](bucketed_bias.png)

## Introduction to Neural Networks

### Neural Networks

Combine features as  an alternative to feature crossing
![](adding_non_linearity.png)

#### Non-linearity Quiz

Why is it important adding non-linear activation functions to neural networks?

** Stops the layers from collapsing back into just a linear model**

Our favorite non-linearity is the Rectified Linear Unit

There are many different ReLU variants


#### Neural Network Complexity Quiz

Neural networks can be arbitrarily complex. To increase hidden dimensions, I can add _______. To increase function composition, I can add _______. If I have multiple labels per example, I can add _______.

**Neurons, layers, outputs**

### Lab: Neural Networks Playground

https://goo.gl/2eig4q

https://goo.gl/wXbGDW

https://goo.gl/i9r55D

## Training Neural Networks

![](DNNRegressor.png)
![](three_common_failure.png)

#### Gradient Descent Debugging Quiz

Which of these is good advice if my model is experiencing exploding gradients?

Lower the learning rate

Add weight regularization

Add gradient clipping

Add batch normalization

See a doctor

Both C and D

A,C,D

**A,B,C,D**


![](dropout.png)

Dropout simulates emsemble learning

#### Dropout Quiz

Dropout acts as another form of ______. It forces data to flow down ______ paths so that there is a more even spread. It also simulates ______ learning. Don’t forget to scale the dropout activations by the inverse of the ______. We remove dropout during ______.

**Regularization, multiple, ensemble, keep probability, inference**



##  Multi-class Neural Networks

Use one softmax loss for all possible classes
```
logits = tf.matmul(...)  # shape = [batch_size, num_classes]
labels = ...             # one-hot encoding in [0, num_classes)
                         # shape = [batch_size, num_classes]
loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(  # shape = [batch_size]
        logits, labels)   
)
```

```
logits = tf.matmul(...)  # shape = [batch_size, num_classes]
labels = ...             # index in [0, num_classes]
                         # shape = [batch_size]
loss = tf.reduce_mean(
    tf.nn.sparse_softmax_cross_entropy_with_logits_v2(  # shape = [batch_size]
        logits, labels)   
)
```

Use softmax only when classes are mutually exclusive.
"Multi-Class, Single-Label Classification". An example may be a member of only one class

Are there multi-class where examples may **belong to more than one class**
```
tf.nn.sigmoid_cross_entropy_with_logits(
    logits, labels)        # shape = [batch_size, num_classes]
```

**If you have hundreds or thousands of classes, loss computation can become a significant bottleneck.** Need to evaluate every output node for every example
![](nce.png)

#### Softmax Quiz

For our classification output, if we have both mutually exclusive labels and probabilities, we should use ______. If the labels are mutually exclusive, but the probabilities aren’t, we should use ______. If our labels aren’t mutually exclusive, we should use ______.

I. tf.nn.sigmoid_cross_entropy_with_logits

II. tf.nn.sparse_softmax_cross_entropy_with_logits

III. tf.nn.softmax_cross_entropy_with_logits_v2

**III, II, I**