
# Introduction to Machine Learning
## Non-probabilistic Models for Classification
### July 2nd, 2019
### Instructors: Melanie Pradier Fernandez (Havard), Weiwei Pan (Harvard), Javier Zazo Ruiz (Harvard)


## How would you parametrize a ellipitical decision boundary?

<img src="./fig/fig1.png" style='height:300px;'>

We can say that the decision boundary is given by a ***quadratic function*** of the input:
$$
w_1x^2_1 + w_2x^2_2 + w_3 = 0
$$
We say that we can fit such a decision boundary using logistic regression with degree 2 polynomial features

## How would you parametrize an arbitrary complex decision boundary?

<img src="./fig/fig2.png" style='height:300px;'>

It's not easy to think of a function $g(x)$ can capture this decision boundary.

**GOAL:** Find models that can capture *arbitrarily complex* decision boundaries.

# Decision Trees

## Two Interpretations of Decision Trees
People in every walk of life have long been using ***decision trees*** (flow charts) for differentiating between classes of objects and phenomena:
<img src="./fig/fig3.png" alt="" style="height: 300px;"/>
Every flow chart tree corresponds to a partition of the input space by axis aligned lines or (hyper) planes.

Conversely, every such partition can be written as a flow chart tree.

## Fitting Decision Trees

We fit decision trees to training data using a **greedy algorithm**: we perform one 'best cut' at a time:
1. Start with an empty decision tree (undivided feature space)
2. Choose the 'optimal' predictor on which to split and choose the 'optimal' threshold value for splitting.
3. Recurse on on each new node until some stopping condition is met

Typically, we measure optimality by the **purity** (in terms of observed clases) of each region defined by the tree.

We usually stop until a pre-defined maximum depth is reached.

## Decision Tree Implementation in `sklearn`

In [None]:
#import the decision tree model
from sklearn.tree import DecisionTreeClassifier as DecisionTree

#instantiate a tree of depth 5
tree = DecisionTree(max_depth=5, criterion=’gini’)

#fit your tree to data
tree.fit(x_train, y_train)

#predict new labels using the fitted tree
tree.predcit(x_test)

## For Practitioners

A decision tree model, as implemented in `sklearn`, has several hyper-parameters. Two important ones being:

1. stopping condition (usually the max depth, but can also be minimum number of observations per region defined by tree)
2. splittin condition (what criteria of purity, in terms of observed classes, used to determine an 'optimal' split)

## Exercise: Compare Decision Tree to Logistic Regression 

**Application:** Remote sensing, i.e. classification of locations in satellite images.

## Deep vs Shallow Trees: The Bias Variance Trade Off

<img src="./fig/fig4.png" alt="" style="height: 500px;"/>

What is the advantages and disadvantages of shallow trees versus deep trees?

## Deep vs Shallow Trees: The Bias Variance Trade Off
If you randomly draw 50 sample from our toy data with non-linear boundary and fit a decision tree on each sample individually, their decision boundaries will look like the following:

<img src="./fig/fig5.png" alt="" style="height: 300px;"/>

What does this say about the variance of very deep trees? How does this impact the generalization of deep trees?

## Bagging to the Rescue

Yesterday, we saw that when we created ensembles of models with high variance and averaged their predictions, then the averaged prediction generalized better. This method is called **bagging**.

Just as we can ensemble polynomials, we can ensemble deep decision trees. Applying bagging to decision trees gives us a model called **random forest**.

# Ensemble Methods: Bagging (Random Forest)

## Random Forests
As we've seen:

1. a shallow decision tree is unable to capture complex decision boundaries (high bias)
2. a deep decision tree is overly sensitive to the noise in the data leading to overfitting (high variance)

A compromise for reducing variance is to fit a large number of sensitive models (deep trees) on the training data and then average the results (reducing the variance).

<img src="./fig/fig6.jpg" alt="" style="height: 300px;"/>

A **random forest** is the 'averaged model' of collection of deep decision trees each learned on a subset of the training data (each branch in each tree is also trained on a randomized set of input dimensions to reduce correlation between trees).

## Random Forest Implementation in `sklearn`

In [4]:
#import the random forest model
from sklearn.ensemble import RandomForestClassifier

#instantiate a forest of with 1000 trees, each of depth 1000
forest = RandomForestClassifier(n_estimators=1000, max_depth=1000)

#fit your forest to data
forest.fit(x_train, y_train)

#predict new labels using the fitted forest
forest.predcit(x_test)

## For Practitioners:

A random forest, as implemented in `sklearn`, has a number of hyper-parameters that effects model performance. Some important hyper-parameters are:

1. number of trees in the forest `n_estimators`
2. max depth of trees `max_depth`
3. hyper-parameters that determines the subsets of input dimensions to learn each branch and the subsets of training data to learn each tree

## Exercise: Compare Deep Trees to Random Forests

**Application:** Remote sensing, i.e. classification of locations in satellite images.

# kNN Classifiers

## kNN for Classification
As we mentioned before, you can use the k-Nearest-Neighbours model to perform classification instead of regression.


<img src="./fig/fig7.png" alt="" style="height: 300px;"/>

## Decision Boundaries of kNN Classifiers

What does the decision boundaries of kNN classifiers look like? Can kNN classifiers capture non-linear decision boundaries?
<table>
    <tr><td><img src="./fig/fig1.png" style='height:300px;'></td><td><img src="./fig/fig2.png" style='height:300px;'></td>
    <tr>
</table>

## kNN Classification Implementation in `sklearn`

In [None]:
#import the kNN classification model
from sklearn.neighbors import KNeighborsClassifier

#instantiate a kNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

#fit your classifier to data
knn.fit(x_train, y_train)

#predict new labels using the fitted classifier
knn.predict(x_test)

# Model Selection

## An Embarassment of Riches

Now we have seen four types of classifiers: logistic regression, decision tree, random forest, kNN. Each type can be customized in many ways (e.g. we can choose many different polynomials for the logistic regression boundary)

***Question:*** which model should we use?

***Answer:*** your choice of model should depend on the task and the dataset. You must
1. choose models based on sensible evaluation metrics
2. choose models using proper data set splitting procedure (train/validation/test)
3. choose models that best solves your real-life task!

## Exercise: Compare All Classification Models on a Real Task

**Application:** Automated cancer diagnosis - classify biopsy results as cancerous or non-cancerous.