-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
21 changed files
with
366 additions
and
1 deletion.
There are no files selected for viewing
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
Instance-Based Learning | ||
======================= | ||
aka nearest neighbor methods, non-parametric, lazy, memory-based, or case-based learning | ||
|
||
In instance-based learning, there is no parametric model to fit | ||
e.g. K-NN, some density estimation, locally weighted linear regression | ||
|
||
Nearest Neighbor | ||
---------------- | ||
|
||
- instances :math:`\mathbf{x}` are vectors of real numbers | ||
- store the *m* training examples :math:`(\mathbf{x} ^{(1)}, y ^{(1)}) .. (\mathbf{x} ^{(m)}, y ^{(m)})` | ||
- to predict on new :math:`\mathbf{x}`, find stored :math:`\mathbf{x} ^{(i)}` closest to :math:`\mathbf{x}` and predict :math:`y ^{(i)}` | ||
- def'n *closest*: the sample with the minimized square distance | ||
- different metrics for distance can be used | ||
- Voronoi diagram: can detail the decision boundaries in 2D space | ||
|
||
Note: it's important to use the right distance metric! If different dimensions have different scales (e.g. a dimension | ||
from 0-1 vs. a dimension from 1k-1M), the smaller feature can become irrelevant. Similarly, adding irrelevant features | ||
is problematic, as are highly correlated attributes | ||
|
||
.. note:: | ||
Example. | ||
|
||
Let :math:`x_1 \in [0, 1]` determine class: :math:`y = 1 \iff x_1 > 0.3` | ||
|
||
Consider predicting the datapoint :math:`(0, 0)` given the data: | ||
|
||
- :math:`(0.1, x_2)` labeled 0 | ||
- :math:`(0.5, x'_2)` labeled 1 | ||
- where :math:`x_2, x'_2` are random draws from :math:`[0, 1]` | ||
|
||
What is the probability of mistake? | ||
|
||
If :math:`(0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 )`, :math:`(0, 0)` will be misclassified | ||
|
||
therefore the probability is :math:`P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ))`. | ||
|
||
(note: this formula may not have been copied correctly) | ||
|
||
.. math:: | ||
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ), x_2 = x) dx \\ | ||
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2) > (0.5 ^2 + x\prime _2 ^2 ) | x_2 = x) * f_{x_2}(x) dx \\ | ||
& = \int_{x=0}^1 P((0.1 ^2 + x_2 ^2 - 0.5 ^2) > (x\prime _2 ^2 )) * 1 dx \\ | ||
& = \int_{x=0}^1 P((x\prime _2 ) < \sqrt{x_2^2 - 0.24} | x_2 = x) dx \\ | ||
& = \int_{x=0}^1 \sqrt{x - 0.24} dx \\ | ||
& \approx 0.275 | ||
There are some tricks, though: | ||
|
||
- normalize attributes (e.g. mean 0, var 1 gaussian distribution) | ||
- use a "mutual information" component :math:`w_j` on the *j* th component | ||
- :math:`dist(x, x') = \sum_j w_j (x_j - x'_j)^2` | ||
- :math:`w_j = I(x_j, y)` | ||
- Mahalanobis distance - a covariance matrix | ||
|
||
**Curse of Dimensionality** | ||
|
||
As the number of attributes goes up, so does the "volume" - you need exponenitally many more points to cover the | ||
training space | ||
|
||
|
||
K-d Trees | ||
^^^^^^^^^ | ||
|
||
We can greatly speed up finding the nearest neighbor by organizing a tree | ||
|
||
- like BST, but organized around dimensions | ||
- each node tests a single dimension against the threshold (median) | ||
- can use highest variance dimension or cycle through dimensions | ||
- growing a good tree can be expensive | ||
|
||
Noise | ||
^^^^^ | ||
Noise causes a problem in NN - if the nearest neighbor is noisy, there will be a misprediction. | ||
|
||
So how do we make it robust against noise? | ||
|
||
K-Nearest Neighbors | ||
------------------- | ||
In K-NN, we find the closest K points and predict given their majority vote | ||
|
||
Given the law of large numbers and infinite data points and k = infinity, this should theoretically always be correct. | ||
|
||
Nonparametric Regression | ||
------------------------ | ||
|
||
- sometimes called "smoothing models" | ||
- emphasize nearby points, e.g. | ||
- predict nearest neighbor | ||
- predict with distance-weighted average of labels | ||
- predict with locally weighted linear regression | ||
- divide into *h* bins, linreg on each bin | ||
|
||
.. note:: | ||
Both for kNN and bins, choosing *k* and *h* are important - when they are small, there is little bias | ||
but high variance (undersmoothing). When they are large, there's a large bias but little variance (oversmoothing). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
Linear Models | ||
============= | ||
|
||
If your data is linearly separable, perceptron will find you a separating hyperplane. | ||
|
||
But what if my data isn't linearly separable? | ||
|
||
- perceptron will find a hyperplane that makes some errors | ||
- what about a hyperplane that makes a *minimal* amount of errors? | ||
|
||
Minimum Error Hyperplane | ||
------------------------ | ||
|
||
The error of a linear model :math:`(\mathbf{w}, b)` for an instance :math:`(\mathbf{x_n}, y_n)` is: | ||
|
||
.. math:: | ||
\mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0] | ||
where :math:`\mathbf{1}` is an indicator function that returns 1 on incorrect prediction and 0 on correct | ||
(0-1 loss) | ||
|
||
Based on this, we can make an objective to minimize the minimum error hyperplane: | ||
|
||
.. math:: | ||
\min_{\mathbf{w}, b} \sum_n \mathbf{1} [y_n (\mathbf{w} \cdot \mathbf{x_n} + b) \leq 0] | ||
This is ERM: **empirical risk minimization**. | ||
|
||
But there are problems: | ||
|
||
- the loss fcn is not convex | ||
- not differentiable | ||
|
||
Alternatives to 0-1 Loss | ||
^^^^^^^^^^^^^^^^^^^^^^^^ | ||
We need to find an upper-bound to 0-1 loss that is convex, so that minimization is easy. Also, minimizing the upper | ||
bound of the objective pushes down the real objective. | ||
|
||
Given :math:`y, a` (label, activation): | ||
|
||
- 0/1: :math:`l^{(0/1)}(y, a) = 1[ya \leq 0]` | ||
- hinge: :math:`l^{(hin)}(y, a) = \max\{0, 1-ya\}` | ||
- logistic: :math:`l^{(log)}(y, a) = \frac{1}{\log 2} \log(1 + \exp[-ya])` | ||
- exponential: :math:`l^{(exp)}(y, a) = \exp[-ya]` | ||
|
||
.. image:: _static/linearmodels/ex1.png | ||
:width: 450 | ||
|
||
These are all convex functions and can be minimized using SGD - except for hinge loss at point 1. | ||
|
||
Sub-gradient Descent | ||
^^^^^^^^^^^^^^^^^^^^ | ||
How do we minimize a non-differentiable function? | ||
|
||
- apply GD anyway, where it exists | ||
- at non-diff points, use a sub-gradient | ||
- sub-gradient of :math:`f(z)` at a point :math:`z'` is the set of all lines that touch :math:`f(z)` at :math:`z'` but lie below :math:`f(z)` | ||
- at diff points, the sub gradient is the gradient | ||
|
||
.. image:: _static/linearmodels/ex2.png | ||
:width: 450 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Naive Bayes | ||
=========== | ||
|
||
TLDR: predict the likelihood of the label, given features | ||
|
||
.. math:: | ||
& \arg \max_y P(y | \mathbf{x}) \\ | ||
& = \arg \max_y P(\mathbf{x} | y) \frac{P(y)}{P(\mathbf{x})} \\ | ||
& = \arg \max_y P(\mathbf{x} | y) P(y) | ||
Naive independence assumption: the attributes are conditionally independent given *y*, i.e. | ||
|
||
.. math:: | ||
P(\mathbf{x} | y) = \prod_j P(x_j | y) | ||
So, we predict the label *y* that maximizes | ||
|
||
.. math:: | ||
P(y) \prod_j P(x_j | y) | ||
This uses a *generative* model: pick *y* then generate **x** based on *y* | ||
|
||
To implement naive bayes, we need to **estimate**: | ||
|
||
- :math:`P(y)` distribution | ||
- for each class *y*, for each feature :math:`x_j`, need :math:`P(x_j | y)` distributions | ||
|
||
all of these features are 1-dimensional - the combination of them is the model | ||
|
||
.. image:: _static/naivebayes/ex1.png | ||
|
||
Issues | ||
^^^^^^ | ||
|
||
- conditional independence is optimistic | ||
- what if an attribute-value pair is not in the training set? | ||
- laplace smoothing / dummy data | ||
- continuous features: use gaussian or other density? | ||
- attributes for text classification? | ||
- bag of words model | ||
|
||
NB for Text | ||
^^^^^^^^^^^ | ||
|
||
- let :math:`V` be the vocabulary (all words/symbols in training docs) | ||
- for each class :math:`y`, let :math:`Docs_y` by the concatenation of all docs labelled *y* | ||
- for each word :math:`w` in :math:`V`, let :math:`\#w(Docs_y)` be the number of times :math:`w` occurs in :math:`Docs_y` | ||
- set :math:`P(w | y) = \frac{\#w(Docs_y) + 1}{|V| + \sum_w \#w(Docs_y)}` (Laplacian smoothing) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
Perceptron | ||
========== | ||
|
||
Perceptron is a linear, online classification model | ||
|
||
Given a training set of pairs, it learns a linear decision boundary hyperplane - we assume labels are binary for now | ||
|
||
It's inspired by neurons: activation is a function of its inputs and weights. For example, the weighted sum activation: | ||
|
||
.. math:: | ||
activation = \sum_{i=1}^D w_ix_i | ||
Then, prediction can be something like ``a > 0 ? 1 : -1``. | ||
|
||
Additionally, we can add a bias term to account for a non-zero intercept: | ||
|
||
.. math:: | ||
a = [\sum_{i=1}^D w_ix_i] + b | ||
Linear Boundary | ||
--------------- | ||
|
||
- a ``D-1`` dimensional hyperplane separates a ``D`` dimensional space into two half-spaces: positive and negative | ||
- this linear boundary has the form :math:`\mathbf{w} \cdot \mathbf{x} = 0` | ||
- defined by **w**: the unit vector (often normalized) normal to any vector on the hyperplane | ||
- :math:`\text{proj}_w x` is how far away *x* is from the decision boundary | ||
- when **w** is normalized to a unit vector, :math:`\mathbf{w} \cdot \mathbf{x} = \text{proj}_w x`. | ||
|
||
.. image:: _static/perceptron/ex1.png | ||
:width: 450 | ||
|
||
**With Bias** | ||
|
||
- When a bias is added, the linear boundary becomes :math:`\mathbf{w} \cdot \mathbf{x} + b = 0` | ||
- this can be converted to the more general form :math:`\mathbf{w} \cdot \mathbf{x} = 0` by adding *b* to **w** and an always-1 feature to **x** | ||
|
||
Prediction | ||
---------- | ||
Pretty simple: | ||
|
||
.. code:: py | ||
def prediction(w, x, b): | ||
return sign( w @ x + b ) | ||
Training | ||
-------- | ||
|
||
This is an error-driven model: | ||
|
||
1. initialize model to some weights and biases | ||
2. for each instance in training set: | ||
1. use current **w** and *b* to predict a label :math:`\hat{y}` | ||
2. if :math:`\hat{y} = y` do nothing | ||
3. otherwise update **w** and *b* to do better | ||
3. goto 2 | ||
|
||
.. image:: _static/perceptron/ex2.png | ||
:width: 500 | ||
|
||
Update in a little simpler notation: | ||
|
||
.. math:: | ||
\mathbf{w} & = \mathbf{w} + y \mathbf{x} \\ | ||
b & = b + y | ||
So what does it do? Let's look at the new activation after an update where a positive was incorrectly predicted as a negative label: | ||
|
||
.. image:: _static/perceptron/ex3.png | ||
|
||
So for the given example, the activation is improved by a factor of positive :math:`\sum_{i=1}^D x_i^2 + 1`, bringing | ||
the prediction closer to correctiveness for that one sample. | ||
|
||
We can also control the learning rate easily using a term :math:`\eta`: | ||
|
||
.. math:: | ||
\mathbf{w} = \mathbf{w} + y \eta \mathbf{x} | ||
Caveats | ||
^^^^^^^ | ||
|
||
- the order of the training instances is important! | ||
- e.g. all positives followed by all negatives is bad | ||
- recommended to permute the training data after each iteration | ||
|
||
Example | ||
------- | ||
|
||
.. code:: text | ||
x1 x2 y wx w (after update, if any) | ||
------------------------------------- | ||
<0, 0> | ||
1 3 + 0 <1, 3> | ||
2 3 - 11 <-1, 0> | ||
-3 1 + 3 <-1, 0> | ||
1 -1 - -1 <-1, 0> | ||
Convergence | ||
----------- | ||
We can define convergence as when going through the training data once, no updates are made | ||
|
||
If the training data is linearly separable, perceptron will converge - if not, it will never converge. | ||
|
||
How long perceptron takes to converge is based on how *easy* the dataset is - roughly, how separated from each | ||
other the two classes are (i.e. the higher the *margin* is, the easier the dataset is, where *margin* is the distance | ||
from the hyperplane to a datapoint) | ||
|
||
Proof | ||
^^^^^ | ||
|
||
**Overview** | ||
|
||
.. image:: _static/perceptron/proof1.png | ||
|
||
**Steps** | ||
|
||
.. image:: _static/perceptron/proof2.png | ||
|
||
**Simplification 1** | ||
|
||
.. image:: _static/perceptron/proof3.png | ||
|
||
**Simplification 2** | ||
|
||
.. image:: _static/perceptron/proof4.png | ||
|
||
**Simplification 3** | ||
|
||
.. image:: _static/perceptron/proof5.png | ||
|
||
**Analysis Setup** | ||
|
||
.. image:: _static/perceptron/proof6.png | ||
|
||
.. image:: _static/perceptron/proof7.png | ||
|
||
**Finishing Up** | ||
|
||
.. image:: _static/perceptron/proof8.png | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Decision Trees | ||
============== | ||
|
||
Let's take the example of whether or not to play tennis given 4 features - a binary classification question | ||
based on discrete features. | ||
|
||
To construct, pick a feature and split on it - then recursively build the tree top down | ||
|
||
Entropy | ||
------- | ||
|
||
Entropy of a set of examples S relative to a binary classification task is: | ||
|