upload notes

zhudotexe · Apr 28, 2020 · 90f90ca · 90f90ca
1 parent d7548f0
commit 90f90ca
Show file tree

Hide file tree

Showing 21 changed files with 366 additions and 1 deletion.
diff --git a/_static/linearmodels/ex1.png b/_static/linearmodels/ex1.png
diff --git a/_static/linearmodels/ex2.png b/_static/linearmodels/ex2.png
diff --git a/_static/naivebayes/ex1.png b/_static/naivebayes/ex1.png
diff --git a/_static/perceptron/ex1.png b/_static/perceptron/ex1.png
diff --git a/_static/perceptron/ex2.png b/_static/perceptron/ex2.png
diff --git a/_static/perceptron/ex3.png b/_static/perceptron/ex3.png
diff --git a/_static/perceptron/proof1.png b/_static/perceptron/proof1.png
diff --git a/_static/perceptron/proof2.png b/_static/perceptron/proof2.png
diff --git a/_static/perceptron/proof3.png b/_static/perceptron/proof3.png
diff --git a/_static/perceptron/proof4.png b/_static/perceptron/proof4.png
diff --git a/_static/perceptron/proof5.png b/_static/perceptron/proof5.png
diff --git a/_static/perceptron/proof6.png b/_static/perceptron/proof6.png
diff --git a/_static/perceptron/proof7.png b/_static/perceptron/proof7.png
diff --git a/_static/perceptron/proof8.png b/_static/perceptron/proof8.png
diff --git a/index.rst b/index.rst
@@ -13,6 +13,11 @@ Welcome to cse142-notes's documentation!
    intro
    regression
    prob
+   inst
+   naivebayes
+   perceptron
+   linearmodels
+   tree
 
 
 

diff --git a/inst.rst b/inst.rst
@@ -0,0 +1,97 @@
+Instance-Based Learning
+=======================
+aka nearest neighbor methods, non-parametric, lazy, memory-based, or case-based learning
+
+In instance-based learning, there is no parametric model to fit
+e.g. K-NN, some density estimation, locally weighted linear regression
+
+Nearest Neighbor
+----------------
+
+- instances :math:`\mathbf{x}` are vectors of real numbers
+- store the *m* training examples :math:`(\mathbf{x} ^{(1)}, y ^{(1)}) .. (\mathbf{x} ^{(m)}, y ^{(m)})`
+- to predict on new :math:`\mathbf{x}`, find stored :math:`\mathbf{x} ^{(i)}` closest to :math:`\mathbf{x}` and predict :math:`y ^{(i)}`
+    - def'n *closest*: the sample with the minimized square distance
+    - different metrics for distance can be used
+- Voronoi diagram: can detail the decision boundaries in 2D space
+
+Note: it's important to use the right distance metric! If different dimensions have different scales (e.g. a dimension
+from 0-1 vs. a dimension from 1k-1M), the smaller feature can become irrelevant. Similarly, adding irrelevant features
+is problematic, as are highly correlated attributes
+
+.. note::
+    Example. 
+
+    Let :math:`x_1 \in [0, 1]` determine class: :math:`y = 1 \iff x_1 > 0.3`
+
+    Consider predicting the datapoint :math:`(0, 0)` given the data:
+
+    - :math:`(0.1, x_2)` labeled 0
+    - :math:`(0.5, x'_2)` labeled 1
+    - where :math:`x_2, x'_2` are random draws from :math:`[0, 1]`
+
+    What is the probability of mistake?
+
+    If :math:`(0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 )`, :math:`(0, 0)` will be misclassified
+
+    therefore the probability is :math:`P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ))`.
+
+    (note: this formula may not have been copied correctly)
+
+    .. math::
+        & = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ), x_2 = x) dx \\
+        & = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ) | x_2 = x) * f_{x_2}(x) dx \\
+        & = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2 - 0.5 ^2) > (x\prime _2 ^2 )) * 1 dx \\
+        & = \int_{x=0}^1 P((x\prime _2  ) < \sqrt{x_2^2 - 0.24} | x_2 = x) dx \\
+        & = \int_{x=0}^1 \sqrt{x - 0.24} dx \\
+        & \approx 0.275
+
+There are some tricks, though:
+
+- normalize attributes (e.g. mean 0, var 1 gaussian distribution)
+- use a "mutual information" component :math:`w_j` on the *j* th component
+    - :math:`dist(x, x') = \sum_j w_j (x_j - x'_j)^2`
+    - :math:`w_j = I(x_j, y)`
+- Mahalanobis distance - a covariance matrix
+
+**Curse of Dimensionality**
+
+As the number of attributes goes up, so does the "volume" - you need exponenitally many more points to cover the
+training space
+
+
+K-d Trees
+^^^^^^^^^
+
+We can greatly speed up finding the nearest neighbor by organizing a tree
+
+- like BST, but organized around dimensions
+- each node tests a single dimension against the threshold (median)
+- can use highest variance dimension or cycle through dimensions
+- growing a good tree can be expensive
+
+Noise
+^^^^^
+Noise causes a problem in NN - if the nearest neighbor is noisy, there will be a misprediction.
+
+So how do we make it robust against noise?
+
+K-Nearest Neighbors
+-------------------
+In K-NN, we find the closest K points and predict given their majority vote
+
+Given the law of large numbers and infinite data points and k = infinity, this should theoretically always be correct.
+
+Nonparametric Regression
+------------------------
+
+- sometimes called "smoothing models"
+- emphasize nearby points, e.g.
+    - predict nearest neighbor
+    - predict with distance-weighted average of labels
+    - predict with locally weighted linear regression
+        - divide into *h* bins, linreg on each bin
+
+.. note::
+    Both for kNN and bins, choosing *k* and *h* are important - when they are small, there is little bias
+    but high variance (undersmoothing). When they are large, there's a large bias but little variance (oversmoothing).
diff --git a/linearmodels.rst b/linearmodels.rst
@@ -0,0 +1,61 @@
+Linear Models
+=============
+
+If your data is linearly separable, perceptron will find you a separating hyperplane.
+
+But what if my data isn't linearly separable?
+
+- perceptron will find a hyperplane that makes some errors
+- what about a hyperplane that makes a *minimal* amount of errors?
+
+Minimum Error Hyperplane
+------------------------
+
+The error of a linear model :math:`(\mathbf{w}, b)` for an instance :math:`(\mathbf{x_n}, y_n)` is:
+
+.. math::
+    \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]
+
+where :math:`\mathbf{1}` is an indicator function that returns 1 on incorrect prediction and 0 on correct
+(0-1 loss)
+
+Based on this, we can make an objective to minimize the minimum error hyperplane:
+
+.. math::
+    \min_{\mathbf{w}, b} \sum_n \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]
+
+This is ERM: **empirical risk minimization**.
+
+But there are problems:
+
+- the loss fcn is not convex
+- not differentiable
+
+Alternatives to 0-1 Loss
+^^^^^^^^^^^^^^^^^^^^^^^^
+We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper
+bound of the objective pushes down the real objective.
+
+Given :math:`y, a` (label, activation):
+
+- 0/1: :math:`l^{(0/1)}(y, a) = 1[ya \leq 0]`
+- hinge: :math:`l^{(hin)}(y, a) = \max\{0, 1-ya\}`
+- logistic: :math:`l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])`
+- exponential: :math:`l^{(exp)}(y, a) = \exp[-ya]`
+
+.. image:: _static/linearmodels/ex1.png
+    :width: 450
+
+These are all convex functions and can be minimized using SGD - except for hinge loss at point 1.
+
+Sub-gradient Descent
+^^^^^^^^^^^^^^^^^^^^
+How do we minimize a non-differentiable function?
+
+- apply GD anyway, where it exists
+- at non-diff points, use a sub-gradient
+- sub-gradient of :math:`f(z)` at a point :math:`z'` is the set of all lines that touch :math:`f(z)` at :math:`z'` but lie below :math:`f(z)`
+- at diff points, the sub gradient is the gradient
+
+.. image:: _static/linearmodels/ex2.png
+    :width: 450
diff --git a/naivebayes.rst b/naivebayes.rst
@@ -0,0 +1,48 @@
+Naive Bayes
+===========
+
+TLDR: predict the likelihood of the label, given features
+
+.. math::
+    & \arg \max_y P(y | \mathbf{x}) \\
+    & = \arg \max_y P(\mathbf{x} | y) \frac{P(y)}{P(\mathbf{x})} \\
+    & = \arg \max_y P(\mathbf{x} | y) P(y)
+
+Naive independence assumption: the attributes are conditionally independent given *y*, i.e.
+
+.. math::
+    P(\mathbf{x} | y) = \prod_j P(x_j | y)
+
+So, we predict the label *y* that maximizes
+
+.. math::
+    P(y) \prod_j P(x_j | y)
+
+This uses a *generative* model: pick *y* then generate **x** based on *y*
+
+To implement naive bayes, we need to **estimate**:
+
+- :math:`P(y)` distribution
+- for each class *y*, for each feature :math:`x_j`, need :math:`P(x_j | y)` distributions
+
+all of these features are 1-dimensional - the combination of them is the model
+
+.. image:: _static/naivebayes/ex1.png
+
+Issues
+^^^^^^
+
+- conditional independence is optimistic
+- what if an attribute-value pair is not in the training set?
+    - laplace smoothing / dummy data
+- continuous features: use gaussian or other density?
+- attributes for text classification?
+    - bag of words model
+
+NB for Text
+^^^^^^^^^^^
+
+- let :math:`V` be the vocabulary (all words/symbols in training docs)
+- for each class :math:`y`, let :math:`Docs_y` by the concatenation of all docs labelled *y*
+- for each word :math:`w` in :math:`V`, let :math:`\#w(Docs_y)` be the number of times :math:`w` occurs in :math:`Docs_y`
+- set :math:`P(w | y) = \frac{\#w(Docs_y) + 1}{|V| + \sum_w \#w(Docs_y)}` (Laplacian smoothing)
diff --git a/perceptron.rst b/perceptron.rst
@@ -0,0 +1,141 @@
+Perceptron
+==========
+
+Perceptron is a linear, online classification model
+
+Given a training set of pairs, it learns a linear decision boundary hyperplane - we assume labels are binary for now
+
+It's inspired by neurons: activation is a function of its inputs and weights. For example, the weighted sum activation:
+
+.. math::
+    activation = \sum_{i=1}^D w_ix_i
+
+Then, prediction can be something like ``a > 0 ? 1 : -1``.
+
+Additionally, we can add a bias term to account for a non-zero intercept:
+
+.. math::
+    a = [\sum_{i=1}^D w_ix_i] + b
+
+Linear Boundary
+---------------
+
+- a ``D-1`` dimensional hyperplane separates a ``D`` dimensional space into two half-spaces: positive and negative
+- this linear boundary has the form :math:`\mathbf{w} \cdot \mathbf{x} = 0`
+    - defined by **w**: the unit vector (often normalized) normal to any vector on the hyperplane
+- :math:`\text{proj}_w x` is how far away *x* is from the decision boundary
+    - when **w** is normalized to a unit vector, :math:`\mathbf{w} \cdot \mathbf{x} = \text{proj}_w x`.
+
+.. image:: _static/perceptron/ex1.png
+    :width: 450
+
+**With Bias**
+
+- When a bias is added, the linear boundary becomes :math:`\mathbf{w} \cdot \mathbf{x} + b = 0`
+    - this can be converted to the more general form :math:`\mathbf{w} \cdot \mathbf{x} = 0` by adding *b* to **w** and an always-1 feature to **x**
+
+Prediction
+----------
+Pretty simple:
+
+.. code:: py
+
+    def prediction(w, x, b):
+        return sign( w @ x + b )
+
+Training
+--------
+
+This is an error-driven model:
+
+1. initialize model to some weights and biases
+2. for each instance in training set:
+    1. use current **w** and *b* to predict a label :math:`\hat{y}`
+    2. if :math:`\hat{y} = y` do nothing
+    3. otherwise update **w** and *b* to do better
+3. goto 2
+
+.. image:: _static/perceptron/ex2.png
+    :width: 500
+
+Update in a little simpler notation:
+
+.. math::
+    \mathbf{w} & = \mathbf{w} + y \mathbf{x} \\
+    b & = b + y
+
+So what does it do? Let's look at the new activation after an update where a positive was incorrectly predicted as a negative label:
+
+.. image:: _static/perceptron/ex3.png
+
+So for the given example, the activation is improved by a factor of positive :math:`\sum_{i=1}^D x_i^2 + 1`, bringing
+the prediction closer to correctiveness for that one sample.
+
+We can also control the learning rate easily using a term :math:`\eta`:
+
+.. math::
+    \mathbf{w} = \mathbf{w} + y \eta \mathbf{x}
+
+Caveats
+^^^^^^^
+
+- the order of the training instances is important!
+    - e.g. all positives followed by all negatives is bad
+    - recommended to permute the training data after each iteration
+
+Example
+-------
+
+.. code:: text
+
+    x1 x2 y  wx  w (after update, if any)
+    -------------------------------------
+                 <0, 0>
+     1  3  +   0 <1, 3>
+     2  3  -  11 <-1, 0>
+    -3  1  +   3 <-1, 0>
+     1 -1  -  -1 <-1, 0>
+
+Convergence
+-----------
+We can define convergence as when going through the training data once, no updates are made
+
+If the training data is linearly separable, perceptron will converge - if not, it will never converge.
+
+How long perceptron takes to converge is based on how *easy* the dataset is - roughly, how separated from each
+other the two classes are (i.e. the higher the *margin* is, the easier the dataset is, where *margin* is the distance
+from the hyperplane to a datapoint)
+
+Proof
+^^^^^
+
+**Overview**
+
+.. image:: _static/perceptron/proof1.png
+
+**Steps**
+
+.. image:: _static/perceptron/proof2.png
+
+**Simplification 1**
+
+.. image:: _static/perceptron/proof3.png
+
+**Simplification 2**
+
+.. image:: _static/perceptron/proof4.png
+
+**Simplification 3**
+
+.. image:: _static/perceptron/proof5.png
+
+**Analysis Setup**
+
+.. image:: _static/perceptron/proof6.png
+
+.. image:: _static/perceptron/proof7.png
+
+**Finishing Up**
+
+.. image:: _static/perceptron/proof8.png
+
diff --git a/regression.rst b/regression.rst
@@ -378,5 +378,5 @@ We can extend logistic regression to multiple classes:
 .. note::
     Class :math:`\theta_K` is actually redundant, since :math:`p(class = K | \mathbf{x}) = 1 - \sum_{k=1}^{K-1} p(class = k | \mathbf{x})`.
 
-    
+
 
diff --git a/tree.rst b/tree.rst
@@ -0,0 +1,13 @@
+Decision Trees
+==============
+
+Let's take the example of whether or not to play tennis given 4 features - a binary classification question
+based on discrete features.
+
+To construct, pick a feature and split on it - then recursively build the tree top down
+
+Entropy
+-------
+
+Entropy of a set of examples S relative to a binary classification task is:
+