Skip to content

Commit

Permalink
upload notes
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed Apr 28, 2020
1 parent d7548f0 commit 90f90ca
Show file tree
Hide file tree
Showing 21 changed files with 366 additions and 1 deletion.
Binary file added _static/linearmodels/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/linearmodels/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/naivebayes/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/ex3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof3.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof4.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof5.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof6.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof7.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/perceptron/proof8.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Welcome to cse142-notes's documentation!
intro
regression
prob
inst
naivebayes
perceptron
linearmodels
tree



Expand Down
97 changes: 97 additions & 0 deletions inst.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
Instance-Based Learning
=======================
aka nearest neighbor methods, non-parametric, lazy, memory-based, or case-based learning

In instance-based learning, there is no parametric model to fit
e.g. K-NN, some density estimation, locally weighted linear regression

Nearest Neighbor
----------------

- instances :math:`\mathbf{x}` are vectors of real numbers
- store the *m* training examples :math:`(\mathbf{x} ^{(1)}, y ^{(1)}) .. (\mathbf{x} ^{(m)}, y ^{(m)})`
- to predict on new :math:`\mathbf{x}`, find stored :math:`\mathbf{x} ^{(i)}` closest to :math:`\mathbf{x}` and predict :math:`y ^{(i)}`
- def'n *closest*: the sample with the minimized square distance
- different metrics for distance can be used
- Voronoi diagram: can detail the decision boundaries in 2D space

Note: it's important to use the right distance metric! If different dimensions have different scales (e.g. a dimension
from 0-1 vs. a dimension from 1k-1M), the smaller feature can become irrelevant. Similarly, adding irrelevant features
is problematic, as are highly correlated attributes

.. note::
Example.

Let :math:`x_1 \in [0, 1]` determine class: :math:`y = 1 \iff x_1 > 0.3`

Consider predicting the datapoint :math:`(0, 0)` given the data:

- :math:`(0.1, x_2)` labeled 0
- :math:`(0.5, x'_2)` labeled 1
- where :math:`x_2, x'_2` are random draws from :math:`[0, 1]`

What is the probability of mistake?

If :math:`(0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 )`, :math:`(0, 0)` will be misclassified

therefore the probability is :math:`P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ))`.

(note: this formula may not have been copied correctly)

.. math::
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ), x_2 = x) dx \\
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ) | x_2 = x) * f_{x_2}(x) dx \\
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2 - 0.5 ^2) > (x\prime _2 ^2 )) * 1 dx \\
& = \int_{x=0}^1 P((x\prime _2 ) < \sqrt{x_2^2 - 0.24} | x_2 = x) dx \\
& = \int_{x=0}^1 \sqrt{x - 0.24} dx \\
& \approx 0.275
There are some tricks, though:

- normalize attributes (e.g. mean 0, var 1 gaussian distribution)
- use a "mutual information" component :math:`w_j` on the *j* th component
- :math:`dist(x, x') = \sum_j w_j (x_j - x'_j)^2`
- :math:`w_j = I(x_j, y)`
- Mahalanobis distance - a covariance matrix

**Curse of Dimensionality**

As the number of attributes goes up, so does the "volume" - you need exponenitally many more points to cover the
training space


K-d Trees
^^^^^^^^^

We can greatly speed up finding the nearest neighbor by organizing a tree

- like BST, but organized around dimensions
- each node tests a single dimension against the threshold (median)
- can use highest variance dimension or cycle through dimensions
- growing a good tree can be expensive

Noise
^^^^^
Noise causes a problem in NN - if the nearest neighbor is noisy, there will be a misprediction.

So how do we make it robust against noise?

K-Nearest Neighbors
-------------------
In K-NN, we find the closest K points and predict given their majority vote

Given the law of large numbers and infinite data points and k = infinity, this should theoretically always be correct.

Nonparametric Regression
------------------------

- sometimes called "smoothing models"
- emphasize nearby points, e.g.
- predict nearest neighbor
- predict with distance-weighted average of labels
- predict with locally weighted linear regression
- divide into *h* bins, linreg on each bin

.. note::
Both for kNN and bins, choosing *k* and *h* are important - when they are small, there is little bias
but high variance (undersmoothing). When they are large, there's a large bias but little variance (oversmoothing).
61 changes: 61 additions & 0 deletions linearmodels.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Linear Models
=============

If your data is linearly separable, perceptron will find you a separating hyperplane.

But what if my data isn't linearly separable?

- perceptron will find a hyperplane that makes some errors
- what about a hyperplane that makes a *minimal* amount of errors?

Minimum Error Hyperplane
------------------------

The error of a linear model :math:`(\mathbf{w}, b)` for an instance :math:`(\mathbf{x_n}, y_n)` is:

.. math::
\mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]
where :math:`\mathbf{1}` is an indicator function that returns 1 on incorrect prediction and 0 on correct
(0-1 loss)

Based on this, we can make an objective to minimize the minimum error hyperplane:

.. math::
\min_{\mathbf{w}, b} \sum_n \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0]
This is ERM: **empirical risk minimization**.

But there are problems:

- the loss fcn is not convex
- not differentiable

Alternatives to 0-1 Loss
^^^^^^^^^^^^^^^^^^^^^^^^
We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper
bound of the objective pushes down the real objective.

Given :math:`y, a` (label, activation):

- 0/1: :math:`l^{(0/1)}(y, a) = 1[ya \leq 0]`
- hinge: :math:`l^{(hin)}(y, a) = \max\{0, 1-ya\}`
- logistic: :math:`l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])`
- exponential: :math:`l^{(exp)}(y, a) = \exp[-ya]`

.. image:: _static/linearmodels/ex1.png
:width: 450

These are all convex functions and can be minimized using SGD - except for hinge loss at point 1.

Sub-gradient Descent
^^^^^^^^^^^^^^^^^^^^
How do we minimize a non-differentiable function?

- apply GD anyway, where it exists
- at non-diff points, use a sub-gradient
- sub-gradient of :math:`f(z)` at a point :math:`z'` is the set of all lines that touch :math:`f(z)` at :math:`z'` but lie below :math:`f(z)`
- at diff points, the sub gradient is the gradient

.. image:: _static/linearmodels/ex2.png
:width: 450
48 changes: 48 additions & 0 deletions naivebayes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Naive Bayes
===========

TLDR: predict the likelihood of the label, given features

.. math::
& \arg \max_y P(y | \mathbf{x}) \\
& = \arg \max_y P(\mathbf{x} | y) \frac{P(y)}{P(\mathbf{x})} \\
& = \arg \max_y P(\mathbf{x} | y) P(y)
Naive independence assumption: the attributes are conditionally independent given *y*, i.e.

.. math::
P(\mathbf{x} | y) = \prod_j P(x_j | y)
So, we predict the label *y* that maximizes

.. math::
P(y) \prod_j P(x_j | y)
This uses a *generative* model: pick *y* then generate **x** based on *y*

To implement naive bayes, we need to **estimate**:

- :math:`P(y)` distribution
- for each class *y*, for each feature :math:`x_j`, need :math:`P(x_j | y)` distributions

all of these features are 1-dimensional - the combination of them is the model

.. image:: _static/naivebayes/ex1.png

Issues
^^^^^^

- conditional independence is optimistic
- what if an attribute-value pair is not in the training set?
- laplace smoothing / dummy data
- continuous features: use gaussian or other density?
- attributes for text classification?
- bag of words model

NB for Text
^^^^^^^^^^^

- let :math:`V` be the vocabulary (all words/symbols in training docs)
- for each class :math:`y`, let :math:`Docs_y` by the concatenation of all docs labelled *y*
- for each word :math:`w` in :math:`V`, let :math:`\#w(Docs_y)` be the number of times :math:`w` occurs in :math:`Docs_y`
- set :math:`P(w | y) = \frac{\#w(Docs_y) + 1}{|V| + \sum_w \#w(Docs_y)}` (Laplacian smoothing)
141 changes: 141 additions & 0 deletions perceptron.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
Perceptron
==========

Perceptron is a linear, online classification model

Given a training set of pairs, it learns a linear decision boundary hyperplane - we assume labels are binary for now

It's inspired by neurons: activation is a function of its inputs and weights. For example, the weighted sum activation:

.. math::
activation = \sum_{i=1}^D w_ix_i
Then, prediction can be something like ``a > 0 ? 1 : -1``.

Additionally, we can add a bias term to account for a non-zero intercept:

.. math::
a = [\sum_{i=1}^D w_ix_i] + b
Linear Boundary
---------------

- a ``D-1`` dimensional hyperplane separates a ``D`` dimensional space into two half-spaces: positive and negative
- this linear boundary has the form :math:`\mathbf{w} \cdot \mathbf{x} = 0`
- defined by **w**: the unit vector (often normalized) normal to any vector on the hyperplane
- :math:`\text{proj}_w x` is how far away *x* is from the decision boundary
- when **w** is normalized to a unit vector, :math:`\mathbf{w} \cdot \mathbf{x} = \text{proj}_w x`.

.. image:: _static/perceptron/ex1.png
:width: 450

**With Bias**

- When a bias is added, the linear boundary becomes :math:`\mathbf{w} \cdot \mathbf{x} + b = 0`
- this can be converted to the more general form :math:`\mathbf{w} \cdot \mathbf{x} = 0` by adding *b* to **w** and an always-1 feature to **x**

Prediction
----------
Pretty simple:

.. code:: py
def prediction(w, x, b):
return sign( w @ x + b )
Training
--------

This is an error-driven model:

1. initialize model to some weights and biases
2. for each instance in training set:
1. use current **w** and *b* to predict a label :math:`\hat{y}`
2. if :math:`\hat{y} = y` do nothing
3. otherwise update **w** and *b* to do better
3. goto 2

.. image:: _static/perceptron/ex2.png
:width: 500

Update in a little simpler notation:

.. math::
\mathbf{w} & = \mathbf{w} + y \mathbf{x} \\
b & = b + y
So what does it do? Let's look at the new activation after an update where a positive was incorrectly predicted as a negative label:

.. image:: _static/perceptron/ex3.png

So for the given example, the activation is improved by a factor of positive :math:`\sum_{i=1}^D x_i^2 + 1`, bringing
the prediction closer to correctiveness for that one sample.

We can also control the learning rate easily using a term :math:`\eta`:

.. math::
\mathbf{w} = \mathbf{w} + y \eta \mathbf{x}
Caveats
^^^^^^^

- the order of the training instances is important!
- e.g. all positives followed by all negatives is bad
- recommended to permute the training data after each iteration

Example
-------

.. code:: text
x1 x2 y wx w (after update, if any)
-------------------------------------
<0, 0>
1 3 + 0 <1, 3>
2 3 - 11 <-1, 0>
-3 1 + 3 <-1, 0>
1 -1 - -1 <-1, 0>
Convergence
-----------
We can define convergence as when going through the training data once, no updates are made

If the training data is linearly separable, perceptron will converge - if not, it will never converge.

How long perceptron takes to converge is based on how *easy* the dataset is - roughly, how separated from each
other the two classes are (i.e. the higher the *margin* is, the easier the dataset is, where *margin* is the distance
from the hyperplane to a datapoint)

Proof
^^^^^

**Overview**

.. image:: _static/perceptron/proof1.png

**Steps**

.. image:: _static/perceptron/proof2.png

**Simplification 1**

.. image:: _static/perceptron/proof3.png

**Simplification 2**

.. image:: _static/perceptron/proof4.png

**Simplification 3**

.. image:: _static/perceptron/proof5.png

**Analysis Setup**

.. image:: _static/perceptron/proof6.png

.. image:: _static/perceptron/proof7.png

**Finishing Up**

.. image:: _static/perceptron/proof8.png

2 changes: 1 addition & 1 deletion regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -378,5 +378,5 @@ We can extend logistic regression to multiple classes:
.. note::
Class :math:`\theta_K` is actually redundant, since :math:`p(class = K | \mathbf{x}) = 1 - \sum_{k=1}^{K-1} p(class = k | \mathbf{x})`.



13 changes: 13 additions & 0 deletions tree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Decision Trees
==============

Let's take the example of whether or not to play tennis given 4 features - a binary classification question
based on discrete features.

To construct, pick a feature and split on it - then recursively build the tree top down

Entropy
-------

Entropy of a set of examples S relative to a binary classification task is:

0 comments on commit 90f90ca

Please sign in to comment.