Currently at: https://developers.google.com/machine-learning/crash-course/reducing-loss/stochastic-gradient-descent

# Content

Crash course: https://developers.google.com/machine-learning/crash-course/
        

(source:  https://www.tensorflow.org/tutorials/keras/)

# Summary

0. Intro to ML


## 1. Framing
- Objectives
    - Refresh the fundamental machine learning terms.
    - Explore various uses of machine learning.
- Labels
- Features
- Examples
- Models
- Regression vs. classification
    
    
## 2. Descending into ML
- Objectives
    - Refresh your memory on line fitting.
    - Relate weights and biases in machine learning to slope and offset in line fitting.
    - Understand "loss" in general and squared loss in particular.
- Linear Regression
- Training and Loss
    - Training
    - Mean square error (MSE)


## 3. Reducing Loss
- Objectives
    - Discover how to train a model using an iterative approach.
    - Understand full gradient descent and some variants, including: mini-batch gradient descent, stochastic gradient descent
    - Experiment with learning rate.
- An Iterative Approach
- Gradient Descent: pick random value for w, then calculates the gradient of the loss curve at the starting point
- Learning Rate (also sometimes called step size)
- Optimizing Learning Rate
- Stochastic Gradient Descent
    - batch: The set of examples used in one iteration (that is, one gradient update) of model training.
    - batchsize: The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. 
    - Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. The term "stochastic" indicates that the one example comprising each batch is chosen at random.
    - Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random.
- Playground Exercise

## 4. First Steps with TF: toolkit
- Programming example
    - Create a synthetic feature that is the ratio of two other features
    - Use this new feature as an input to a linear regression model
    - Improve the effectiveness of the model by identifying and clipping (removing) outliers out of the input data
- Objectives
    - Learn how to create and modify tensors in TensorFlow.
    - Learn the basics of pandas.
    - Develop linear regression code with one of TensorFlow's high-level APIs.
    - Experiment with learning rate.
- Figure 1. TensorFlow toolkit hierarchy.
- tf.estimator API
- graph: Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor) as an operand to another operation. Use TensorBoard to visualize a graph.
- steps, which is the total number of training iterations. One step calculates the loss from one batch and uses that value to modify the model's weights once.
- Others
    - epoch: A full training pass over the entire data set such that each example has been seen once. Thus, an epoch represents N/batch size training iterations, where N is the total number of examples. = one forward pass and one backward pass of all the training examples
    - iteration: A single update of a model's weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.
    - batch size, which is the number of examples (chosen at random) for a single step. For example, the batch size for SGD is 1 = # of images/step
    - For instance if you have 20,000 images and a batch size of 100 then the epoch should contain 20,000 / 100 = 200 steps/iterations
    - As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of 2,000 images / (10 images / step) = 200 steps.
- total # of training examples = batch_size * steps
- Number of training examples in each period = batch_size * steps / periods

## 5. Generalization
- Objective
    - Develop intuition about overfitting.
    - Determine whether a model is good or not.
    - Divide a data set into a training set and a test set.
- DEF generalization: Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model.


## 6. Training and Test Sets
- Objectives
    - Examine the benefits of dividing a data set into a training set and a test set.
    

## 7. Validation
- Programming example
    - Use multiple features, instead of a single feature, to further improve the effectiveness of a model
    - Debug issues in model input data
    - Use a test data set to check if a model is overfitting the validation data
- Objectives
    - Understand the importance of a validation set in a partitioning scheme.
    
    
## 8. Representation
- Programming example: Feature Sets
    - Create a minimal set of features that performs just as well as a more complex feature set
- Objectives
    - Map fields from logs and protocol buffers into useful ML features.
    - Determine which qualities comprise great features.
    - Handle outlier features.
    - Investigate the statistical properties of a data set.
    - Train and evaluate a model with tf.estimator.
- DEF representation- The process of mapping data to useful features.

## 9. Feature Crosses
- Programming examples
    - Improve a linear regression model with the addition of additional synthetic features (this is a continuation of the previous exercise)
    - Use an input function to convert pandas DataFrame objects to Tensors and invoke the input function in fit() and predict() operations
    - Use the FTRL optimization algorithm for model training
    - Create new synthetic features through one-hot encoding, binning, and feature crosses
- Objectives
    - Build an understanding of feature crosses.
    - Implement feature crosses in TensorFlow.
- DEF feature cross- a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together.
- Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. Neural networks provide another strategy.

## 10. Regularization for Simplicity
- Objectives
    - Learn about trade-offs between complexity and generalizability.
    - Experiment with L2 regularization.
- L2 regulization: A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. 
    - Encourages weight values toward 0 (but not exactly 0)
    - Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.
- lambda: tune the overall impact of the regularization term by multiplying its value by a scalar known as lambda
    - The ideal value of lambda produces a model that generalizes well to new, previously unseen data. 
    - Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren't as large


## 11. Logistic Regression
- Objectives
    - Understand logistic regression.
    - Explore loss and regularization functions for logistic regression.
- **Linear vs Logistics Regression**: In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values. In logistic regression, the outcome (dependent variable) has only a limited number of possible values.
    - For instance, if X contains the area in square feet of houses, and Y contains the corresponding sale price of those houses, you could use linear regression to predict selling price as a function of house size. While the possible selling price may not actually be any, there are so many possible values that a linear regression model would be chosen.
    - If, instead, you wanted to predict, based on size, whether a house would sell for more than 200K, you would use logistic regression. The possible outputs are either Yes, the house will sell for more than 200K, or No, the house will not.


## 12. Classification
- **Skipped some parts**
- Programming example
    - Reframe the median house value predictor (from the preceding exercises) as a binary classification model
    - Compare the effectiveness of logisitic regression vs linear regression for a binary classification problem
- Objectives
    - Evaluating the accuracy and precision of a logistic regression model.
    - Understanding ROC Curves and AUCs.
- Accuracy
    - Precision: What proportion of positive identifications was actually correct?
    - Recall: What proportion of actual positives was identified correctly?
    

## 13. Regularization: Sparsity
- **Skipped some parts**
- Programming example
    - Calculate the size of a model
    - Apply L1 regularization to reduce the size of a model by increasing sparsity
- Objectives
    - Learn how to drive uninformative coefficient values to exactly 0, in order to save RAM.
    - Learn about other kinds of regularization besides L2.


## 14. Intro to Neural Nets
- Programmign example
    - Define a neural network (NN) and its hidden layers using the TensorFlow DNNRegressor class
    - Train a neural network to learn nonlinearities in a dataset and achieve better performance than a linear regression model
- Objectives
    - Develop some intuition about neural networks, particularly about: hidden layers, activation functions
    
## 15. Training Neural Networks
- Objectives
    - Develop some intuition around backpropagation.