Skip to content

Commit

Permalink
notes lec apr 30
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed Apr 30, 2020
1 parent 90f90ca commit 168d13b
Show file tree
Hide file tree
Showing 3 changed files with 136 additions and 2 deletions.
Binary file added _static/tree/ex1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _static/tree/ex2.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
138 changes: 136 additions & 2 deletions tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,142 @@ based on discrete features.

To construct, pick a feature and split on it - then recursively build the tree top down

Entropy
Algorithm
---------

ID3 = Iterative Dichotomizer

.. image:: _static/tree/ex1.png

How do we choose the "best" feature?

- Our goal is to build a tree that is as small as possible - Occam's Razor
- select a hypothesis that makes minimal assumptions
- finding the minimal tree is NP-hard
- this algorithm is greedy-heuristic, but cannot guarantee optimality
- the main heuristic is the definition of "best"

"Best"
-------

Entropy of a set of examples S relative to a binary classification task is:
Example
^^^^^^^

Take the dataset:

.. code-block:: text
(A=0,B=0), -: 50 examples
(A=0,B=1), -: 50 examples
(A=1,B=0), -: 0 examples
(A=1,B=1), +: 100 examples
In this case, it's easy to tell we should split on A.

But what about:

.. image:: _static/tree/ex2.png

We prefer the right tree (split on A) because statistically it requires fewer decisions to get to a prediction. But
how do we quantify this?

We should choose features that split examples into "relatively pure" examples. This is based on information gain.

Entropy
^^^^^^^

Entropy of a set of examples S relative to a binary classification task is: (also denoted :math:`H(S)`)

.. math::
Entropy(S) = -p_+ \log_2(p_+) - p_- \log_2(p_-)
Where :math:`p_+` is the proportion of examples in S that belong to positive class, and vice versa

Generalized to K classes:

.. math::
Entropy(S) = \sum_{i=1}^K -p_i \log_2(p_i)
The lower the entropy, the less uncertainty there is in a dataset.

**Ex.** When all examples belong to one class, entropy is 0. When all examples are exactly split across all classes
(in binary case), entropy is 1.

Information Gain
^^^^^^^^^^^^^^^^
The information gain of an attribute :math:`a` is the expected reduction in entropy caused by partitioning on this
attribute:

.. math::
Gain(S, a) = Entropy(S) - \sum_{v \in values(a)} \frac{|S_v|}{|S|}Entropy(S_v)
where :math:`S_v` is the subset of a for which the value of attribute :math:`a` is :math:`v`.

This takes the averaged entropy of all partitions, and the entropy of each partition is weighted by its "contribution"
to the size of S.

Partitions of low entropy (low uncertainty/imbalanced splits) result in high gain

So this algorithm (ID3) makes statistics-based decisions that uses *all data*

Overfitting
-----------

- Learning a tree that perfectly classifies the data may not create the tree that best generalizes the training data
- may be noise
- might be making decisions based on very little data
- causes:
- too much variance in training data
- we split on features that are irrelevant

Pruning
^^^^^^^
Pruning is a way to remove overfitting

- remove leaves and assign majority label of parent to all items
- prune the children of S if:
- all children are leaves
- how much to prune?
- prune if the accuracy on the *validation set* does not decrease upon assigning the most frequent class label to all items at S

2 basic approaches:

- post-pruning: grow the full tree then remove nodes that don't have sufficient evidence
- pre-pruning: stop growing the tree at some point during construction when it is determined that there is not enough data to make reliable choices

Note that not all leaves in a pruned tree have to be pure. How do we decide the label at a non-pure leaf?

Random Forests
^^^^^^^^^^^^^^
Another way to avoid overfitting

- learn a collection of decision trees instead of a single one
- predict the label by aggregating all the trees in the forest
- voting for classification, averaging for regression

Continuous Attributes
---------------------
TLDR:

- split on feature that gives largest drop in MSE
- split continuous variables into discrete variables (small, medium, large, etc)
- what thresholds to use for these splits?
- e.g. sort examples on feature, for each ordered pair with different labels consider if there is a threshold
- midpoint or on the endpoint?

Missing Values
--------------
Options:

- throw away the instance
- fill in the most common value of the feature
- fill in with all possible values of the feature
- guess based on correlations with other features/labels

Other Issues
------------

- noisy labels
- two identical instances with conflicting labels
- attributes with different costs
- change information so that low cost attributes are preferred

0 comments on commit 168d13b

Please sign in to comment.