Add notes from nnml

jaidevd · Jul 7, 2016 · 13e4700 · 13e4700
1 parent f38e767
commit 13e4700
Show file tree

Hide file tree

Showing 3 changed files with 149 additions and 0 deletions.
diff --git a/notes/nnml/nnml_lecture2.md b/notes/nnml/nnml_lecture2.md
@@ -0,0 +1,39 @@
+Perceptron Learning
+===================
+
+* If the ouput is correct, leave the weights alone.
+* If the output is incorrectly negative, add the input vector to the weight vector
+* If the output is incorrectly positive, subtract the input vector from the output vector
+
+Weight Space
+-------------
+
+* Weight vectors should lie on the same side of the decision plane as the direction in which the training vector points.
+
+Why the learning works
+-----------------------
+
+$d_{a}$ is the current weight vector.
+$d_{b}$ is any _feasible_ weight vector.
+
+Learning, then, consists of minimizing $d_{a}^{2} + d_{b}^{2}$, i.e. bringing the current vector as close as possible to the feasible vectors.
+
+There are infinitely many feasible vectors.
+
+So we define "generously feasible" weight vectors. These are vectors that get all the samples right by at least the size of the input.
+
+The squared distance to all generously feasible weights is always decreased by at least the squared length of the input sample, i.e. the update vector.
+
+Informal proof
+--------------
+* Each time the perceptron makes a mistake, the current weight vector moves closer to the generously feasible region.
+* It moves by at least the squared length of the input vector.
+* So if the input vectors aren't infinitesimally small, the weight vector will eventually reach this region, if it exists.
+
+Limitations of Perceptrons
+-----------------------
+
+* If unlimited handcrafted feature engineering is allowed, a Perceptron can do almost anything. In the long run, you need to _learn_ features.
+* Handcrafting features almost never generalizes
+* ANN research came to a halt because perceptrons were shown to be limited - _Group Invariance Theorem_
+* 
diff --git a/notes/nnml/nnml_lecture3.md b/notes/nnml/nnml_lecture3.md
@@ -0,0 +1,83 @@
+Learning Weights of a Linear Neuron
+=====================
+
+* In a perceptron, the weights are getting closer to a good set of weights - to a generously feasible set of weights
+* In a linear neuron, the outputs are getting closer to the target outputs
+
+* MLP's should never have been called that!
+
+For MLPs, the "proof" of convergence (or of learning, since there may be no convergence), consists of showing that the output values get closer to target values, instead of weights getting closer to weights.
+
+In perceptrons the outputs may get further away from target outputs, even if the learning is perfect.
+
+* Simplest linear neuron is one with a squared error:
+
+$y = \sum_{i}w_{i}x_{i} = \mathbf{w^{T}x}$
+
+Why don't we just solve it analytically?
+
+* We'd like to understand what neurons are doing, and they might not be solving a symbolic equation.
+* We want a method to generalize to mulitple layers
+
+Delta Rule: $ \Delta w_{i} = \epsilon x_{i}(t - y)$
+
+Derivation of Delta Rule
+------------------------
+
+Error $$ E = \frac{1}{2}\sum_{n\in Training} (t_{n} - y_{n})^{2} $$
+
+Deriving $E$ w.r.t. weights:
+
+$$ \frac{\partial E}{\partial w_{i}} = -\sum_{n}x_{i}^{n}(t_{n} - y_{n}) $$
+
+Multiply both sides by $-\epsilon$, and,
+
+$$
+\Delta w_{i} = -\epsilon\frac{\partial E}{\partial w_{i}} = \sum_{n}\epsilon x_{i}^{n}(t_{n} - y_{n})
+$$
+
+Learning can be very slow if two input dimensions are highly correlated. How is this? How can this be verified?
+
+Eg: If we keep getting the same quantities of ketchup, chips and fish, we can't tell what contributes how much to the total bill. In other words, there's no new data to learn from. If they're almost equal, the learning is slow.
+
+* Online delta rule is similar to the perceptron learning algo
+
+
+Error Surface of a Linear Neuron
+===============================
+
+* In perceptrons, you don't need an error surface? A space made up of weights will do?
+* For linear neurons, the error surface is always a quadratic bowl.
+* Perceptron and online delta, weights move perpendicular to training case constraints. For batch learning on linear neurons they move perpendicular to the contours of the error surface!
+
+Learning the Weights of a Logistic Output Neuron
+====================================
+
+Just multiply the activation, or output, by the _logit_
+
+$$ y = \frac{1}{1 + e^{-z}}$$
+
+where $z = b +\sum_{i}x_{i}w_{i}$
+
+Therefore
+
+$$ \frac{\partial z}{\partial w_{i}} = x_{i}$$
+
+and 
+
+$$ \frac{\partial z}{\partial x_{i}} = w_{i}$$
+
+Now,
+
+$$ \frac{dy}{dz} = y(1-y)$$
+
+Therefore
+
+$$ \frac{\partial y}{\partial w_{i}} = \frac{\partial z}{\partial w_{i}} \times \frac{dy}{dz} = x_{i}y(1-y)$$
+
+
+THE FUCKING BACKPROP ALGORITHM
+==============================
+
+* Instead of using activities of hidden units for training, use error derivatives of them
+* 
diff --git a/notes/nnml/nnml_lecture4.md b/notes/nnml/nnml_lecture4.md
@@ -0,0 +1,27 @@
+A relational learning task
+-----------------------
+
+* Given a large set of triples that come from some family trees, figure out irregularities
+* (x has-mom y) & (y has-husband z) => (x has-father y)
+
+
+The Softmax Output Function
+---------------------------
+
+* Squared error measure drawbacks:
+	- If the desired output is 1 and actual output is 1e-9, then almost no gradient for the logistic unit to fix the error
+	- If assigning probabilities to mutually exclusive classes, outputs should sum to 1.
+* Force outputs to represent a probability distribution across discrete alternatives
+
+$$ y_{i} = \frac{e^{z_{i}}}{\sum_{j\in group}e^{z_{i}}} $$
+
+and
+
+$$ \frac{\partial y_{i}}{\partial z_{i}} = y_{i}(1 - y_{i}) $$
+
+* Cost function: negative log probability of the right answers
+
+$$ C = -\sum_{j}t_{j} log y_{j} $$
+
+* The steepness of dC/dy exactly balances the flatnes of dy/dz
+