# Clayton (cemellina@gmail.com) is a Machine Learning Expert at Yahoo
- Object recognition models
- Face recognition for celebrities
- Smart Cropping
- Aesthetic Prediction

## Structure of the Course
- What is a NN?
- How do we train a NN?
- What is it doing?
- What are the parts of a complete model?
- Tomorrow: a CNN and an RNN!
- Today: an image recognition system


## Metanotes about Running Workshops
- Jupyter is great
- Tons of time is wasted in Environment setup - do everything you can to avoid individual setup

## Installation Notes
- pip install scikit-learn _not_ pip install skit_learn
- change the backend from *theano* to *tensorflow* in ~/.keras/keras.json

***

### Session 1
- inputs weights summation bias activation function and output
- an **Epoch** is a full pass through the training data
- **model = data + structure + loss + optimizer** (the master formula)
- a **Batch** is a group of inputs that we process at the same time
- batch size matters
- an **Epoch** can have multiple **Batches**
- **optimizers** often set the learning rate (lr)
- **Deep Learning** is *Deep* because there are lots of layers
- 2 layers make a model *Deep*
- parameters: weights and biases (theta)
- *Layers* that aren't at the bounds are called **hidden** because we don't know what their values are
- Deep models are also cometimes called *Multilayer Perceptrons* (MLP)
- all MLPs are _fully connected_
- if you can draw a decent line between two "blobs" the dataset is said to be *Linearly Separable*
- weights are linear - activation functions are non-linear
- **RELU** (rectified linear unit) is the most popular activation function

### Session 2
- **Optimization**
- *Loss Surfaces* are often highly irregular
- *Learning Rate* (Alpha) is the step size or epsilon in your gradient descent algorithm
- Learning Rate is more important than Batch Size
- If the main knob you can tune is the Learning Rate, then the main plot you care to look at is the Loss Plot
- Backprop is an algorithm for efficiently calculating gradients for gradient descent
- Backprop is the chain rule in action
- Multinomial vs Binomial crossentropy (different kinds of loss functions)
- Mean squared error is probably the best loss function for regression type problems
- *Validation Data* = Test Data
- Deep Learning seems to have a bigger problem with overfitting than other techniques in ML
- The 3 best things to improve your models
    - Get more Data
    - Weight regularization (maybe because it helps you ignore outliers in the dataset - unclear)
    - Dropout

### Session 3
- **TensorFlow**
- **Broadcasting**: multiplying and adding tensors, matrices, vectors, and scalars
- **Graph**: Collections of Tensors and operations on those Tensors (like functions)
- **Placeholders**: the *arguments* to a graph (function)
- **Variables**: are parameters (like weights and biases - things you can move up and down) but they're not *Placeholders* (arguments)
- **Session**: Actually executes the model after construction (maybe important to distributed computation?)
- Use Cafe in Production (it's not a good tool to prototype or learn Deep Learning, but it is a good tool to deploy models)
- **Convolutional Neural Networks**
- *Classification* vs *Detection* (the image contains a cat vs here is the cat)
- *Convolve* a filter with an image
- Properties of Convolution to think about
    - Convolution has shared parameters
    - Convolution is a *sparse* operation, not everything connects to everything else
    - (it seems to me like convolution is about keeping structure and a kind of micro-scale ensembling)
    - Weights are 4 dimensional at every layer
    - **Stride** is a measure of how many *pixels* to step when incrementing across the source
    - **Padding** surrounding the inputs with zeros to handle strides that are too large to evenly fit in the filters
    - Not *Padding* is called **Same Padding**
    - The number of *filters* you use in a CNN roughly corresponds to teh number of layers in an MLP
    - A **Pooling** Layer also has a size (like a Convolutional layer - say 2x2) and it works like a reducer a Pooling Layer may output max for example or average
    - Pooling can be used to decrease the ouput dimensionality
    - A **Flatten** Layer will taken an input tensor and "unroll it" to make a vector
    - It's very common to use several Convolutional layers, then to flatten it out to hand it off to softmax for output, that pattern is what is usually referred to as a CNN
- Keras has some terminology duplication
    - border mode = padding
    - filter = kernal
    - stride = subsample


### Session 4
- Feature Maps
- **tSNE** a non-linear dimensionality reduction technique
- *PCA* (Principle Component Analysis) is a linear dimensionality reduction technique
- **RNNs**
- make good sequential classifiers
- Text is a great example - it's all sequential
- Timeseries another
- One Hot encode vocabulary.
- This is also called embedding
- RNN's have a hard time remembering, the problem of **Long-Term Dependencies**
- **LSTMs** (Long Short-Term Memory)
- *Element-Wise* per element application of an activation function (the sigmoid activation function) to each element in a vector
- **Forget Gates** allow models to throw away information they don't think they need anymore
- These things commonly use *Tanh* activation functions
- Future Learning
    - Do projects, Play
    - Find a friend
    - Take advantage of resources
        - Keras has a great set of examples
        - Keras documentation has lots of links to good papers
        - There's a Udacity course for TensorFlow that's pretty good
        - Stanford course on Deep Learning for Computer Vision

---

## Links
- [TensorFlow Playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.73305&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false)
- [Activation Functions](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions)
- [ICML Papers](http://icml.cc/2016/?page_id=1649)
- [Tensor Board](https://www.tensorflow.org/versions/r0.9/how_tos/summaries_and_tensorboard/index.html)

## Personal Notes
- There's no real complexity here, it's all just jargon
- Just because you can use equations to represent something, doesn't mean that equations are the best way to represent it. Such is Machine Learning...
- Though, writing down an equation with Greek letters is a good way to intimidate people
- I now think I understand everything from the Party last Saturday
- I can use these Jupyter notebooks with Morgan
- Anyone can do this
- Science is hard, Machine Learning is easy
- It kind of seems like ML libraries were created by mathematicians, not engineers. They're recapitulating a ton of stuff, which in practice looks a lot like needlessly renaming things