# Supervised Learning

## Classification vs Regression

Two types of supervised learning: Classification and Regression
C: Taking some input and mapping it to some discrete label.
R: More about continuous-valued functions. Mapping pictures of Michael to the length of his hair.
- MAPPING TO discrete / continuous output.

## Terminology

- Instances: Input (Vectors, sets of). Can be credit score, pixels.
- Concept: Function that maps inputs to outputs. (Like a concepts of 'what defines maleness')
- Target concept: ANSWER. The specific function we're trying to find out of all the possible concepts.
- Hypothesis: Class. The set of all possible concepts you're willing to entertain. Could be all possible functions but it might be hard to figure out which function is best given finite data.
    - Already now our hypothesis class is restricted to classification.

- Sample (Training set): Set of all input paired with output.
- Candidate: A concept you think might be the target concept.
- Testing set: Determine if candidate concept does a good job or not by testing it on the testing set. (Apply candidate concept on input and check predictions against labels.

- Testing set needs to be different from the training set else it's cheating.

# Decision Trees

E.g. of dating and choosing whether or not to go into a certain restaurant.
- Some features have to do with the restaurant and some have to do with things external to the restaurant (whether or not you're hungry)
- Some irrelevant features (number of cars parked across the country)

Consider the representation of a DT:
- Ask a series of questions and depending on the answers move from the root of the tree (top) along different paths down the tree.
- Leaves of the tree contain ANSWERs (output). Nodes have attributes (features).

### Algorithm
Thoughts:
- 20 questions example. Think about the ordering of questions.
- Goal in asking questions was to **further** narrow down possibilities as much as possible.
- That is, the usefulness of each question depends on the answers you have to the previous questions.
- DT vs 20 questions: with DT, can build entire flowchart at the start vs 20 questions asking interactively.

Recipe:
1. Pick the best attribute
    - Best: splitting the data roughly in half (say)
2. Asked question
3. Follow the path of the answer
4. Go to 1

UNTIL got an answer.

If an attribute node splits data into half but doesn't change distributions, it could arguably be bad because it doesn't help and only **contributes to overfitting**.

### Decision Trees: Expressiveness
e.g. Boolean A AND B.

A -> F -> leaf; No

    -> T
        -> B ->
            -> F -> leaf: No
            -> T -> leaf: Yes

The same if you switch A and B around.
Cause A and B are commutative: The play the same role in the function.

Also: OR, XOR (exclusive OR)
- Representations of a truth table.

### Size of DTs

For AND and OR, need two nodes. For XOR need three nodes. Scaled,

1. n-OR: If any of the n nodes is true, n-OR is true.
    - n nodes. Size of DT is linear, O(n).

2. n-XOR: Parity, e.g. pick odd parity. If the number of attributes that are true is odd, then True. Else False.
    - 2^n - 1 nodes. Size of DT is exponential, O(2^n).
    - Sub-trees are a version of XOR.

-> Want to look at more **any** questions than **parity** questions.

-> Can feature engineer to solve this. **The hardest problem is coming up with a good representation.**





Exactly how expressive is a decision tree?
- i.e. how many decision trees do we have to look at?
- e.g. n boolean attributes and output is boolean.

- Nodes: n!
- Truth table: 2^n rows.
    - How many ways are there to fill in the outputs? 2^n cells to fill, so 2^2^n.

n = 6 -> 2^2^6 is of order of magnitude 10^19.
- Decision trees are expressive.
- Need a smart way to search all DTs.

## ID3: Alg

Loop forever until solve problem:
- A <- best attribute
- Assign A as a decision attribute for NODE.
- For each value of A, create a descendant of NODE
- Sort training examples to leaves
- If examples perfectly classified, STOP
- Else iterate over leaves to find best attribute that will sort leaves

### Finding the Best attribute: Information gain.
Gain(S,A) = Entropy(S) - expected or avg entropy you'd have over each set of examples that you have with a particular value.

$$\max Gain(S,A) = Entropy(S) - \sum_v \frac{|S_v|}{|S|}Entropy(S_v)$$

S is collection of training examples you're looking at
A is the attribute

**Info Gain: Reduction in randomness of data based on knowing value of attribute.**

**Entropy: A measure of randomness.** 
- E.g. fair coin entropy is 1. -> No basis going into flipping the coin to guess if it's heads or tails.

#### Formula for Entropy
$$-\sum_v p(v)logp(v)$$

c.f. randomised optimisation later for more details.

Previously we said we preferred splits that were less random (lower entropy). We want there to be info gain 

## ID3 Bias: Inductive Bias

**Two kinds of biases we worry about when thinking about algorithms who search through space:**
- Restriction Bias: Hypothesis set that you care about (e.g. all decision trees. Not consider quadratic equations...)
- Preference Bias: What sorts of hypotheses from this hypothesis test that we prefer -> at the heart of inductive bias.

Inductive bias of ID3 algorithm
- Since making decision top-down, more likely to choose trees that have **good splits near the top** than not. Even if both represent the function that we care about.
- **Correct over incorrect**: Prefers ones that model the data better to ones that model the data worse.
- Prefers **shorter trees** to longer ones. Comes naturally from preference for good splits at the top.

## DTs: Other Considerations
1. What if we had **continuous attributes**?
    - Use ranges or '<20?' splits, binary search.
2. When do we stop?
    - You might think 'when everything is classified correctly.' BUT if there's **noise** :(
    - Or if we've run out of attributes (doesn't help when we have continuous attributes)
    - No overfitting (overfit by having a tree that's too big, violates Occam's Razor.)
        - CV?
        - Stop expanding tree once you reach a certain accuracy on a validation set
    - **Pruning** -> smaller tree. (vid 28)
        - Need to have **votes on output**.
3. Regression
    - Q: What are the splitting criteria?
        - Try to measure how mixed up things are using **variance**.
    - What would you do with leaves? (Output) -> Average? Local linear fit?

## Conclusion
- Representation
- ID3: A top-down learning algorithm
- Expressiveness of DTs
- Bias of ID3 (Inductive Bias)
- 'Best attributes' (Deciding on splits) Maximum information gain
- Dealing with overfitting e.g. using pruning.
