## Side notes
_(code snippets, summaries, resources, etc.)_
- Provides extra explanation: [_Introduction to Boosting_ PDF by Udactiy](https://www.evernote.com/shard/s37/nl/1033921335/ea429564-f35b-4e92-81c0-9a7a1860fc06/) (Evernote)
    - Some notes below come from the PDF
- Can also see sections 10.1, 10.3 and 10.5 from Ch. 10 in 
    - [Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd Ed. by Trevor Hastie, Robert Tibshirani  & Jerome Friedman (2013)](https://www.evernote.com/shard/s37/nl/1033921335/9dbcbee9-a0b0-4aad-a95b-0acf6cebaf6d/) (Evernote)

# Ensemble Learning: Boosting

## Summary of topics covered

![summary](ensemble_learning_boosting_images/boosting_summary.png)

- Boosting is agnostic to the learner, so long as it is a weak learner.
- Looked at what error really means with respect to some underlying distribution $D$
- In practice, over time, as a boosting algorithm lowers its bias, its variance does not increase, but rather _decreases_ as well. (Sounds too good to be true, but it isn't!)

## AdaBoost algorithm

- An example of additive expansion'
- Originally designed for classification tasks (focus here), but can also be applied to regression
- AdaBoost is an _agnostic_ learner
    - Only requirement is that the base learner must consistently (with high probability) achieve greater performance than random guessing.

## Choosing a Weak Learner

__Definition:__ Weak learner
- Formal definition: $\forall_{D}$  $P_{D}[.] \leqslant \frac{1}{2} - \varepsilon$
    - i.e. has an expected error greater than half.
    - $\varepsilon$, often used in ML, is some really really small number

So long as you can consistently beat random guessing, any true boosting algorithm will be able to increase the accuracy of the final ensemble.

What weak learner you should choose is then a trade off between 3 factors:

1. The bias of the model. 
    - A lower bias is almost always better, but you don't want to pick something that will overfit (yes, boosting can and does overfit)
- The training time for the weak learner. 
    - Generally we want to be able to train a weak learner quickly, as we are going to be building a few hundred (or thousand) of them.
- The prediction time for our weak learner. 
    - If we use a model that has a slow prediction rate, our ensemble of them is going to be a few hundred times slower!

The classic weak learner is a decision tree. 
- By changing the maximum depth of the tree, you can control all 3 factors. 
- This makes them incredibly popular for boosting. 
- What you should be using depends on your individual problem, but decision trees is a good starting point.

NOTE: So long as the algorithm supports weighted data instances, any algorithm can be used for boosting. E.g. "A guest speaker at my University was boosting 5 layer deep neural networks for his work in computational biology."

From [StackOverflow](http://stackoverflow.com/questions/20435717/what-is-a-weak-learner): What is a weak learner?

## Bagging ensemble learning technique (example)

![ensemble learning example](ensemble_learning_boosting_images/ensemble_learning_example.png)

Bagging a.k.a. bootstrap aggregation

__Key:__
- Red data points are training data; green are testing data
- Dotted lines are components of the final regression
- Red line is average of third order polynomials run on random subsets
- Blue line is the result of a regression run once on all training data (for comparison)

__Method:__
- Pick 5 random subsets of 5 example points each (random with replacement)
- A 3rd order polynomial regression is trained on each subset
- Finally, the 5 regressions are averaged to produced a polynomial (also 3rd order)

__Result:__
- Averaging regression does a better job discovering underlying structure fo data
    - lessens likelihood of over-fitting by not being mislead by any individual data point (same reason for doing cross-validation)
    - don't get trapped by data that is wrong due to noise
    - "Averages out all the variances of the differences"
- In practice, bagging technique is particularly effective at avoiding over-fitting 

## Boosting technique

- Instead of choosing subsets randomly, emphasize the "hardest" examples
    - Done by using a weighted vote

__Definition of error:__
- $P_{D}[h(x) \neq c(x)]$
- i.e. the probability given the underlying distribution that the hypothesis will disagree with the true concept on some particular instance $x$
- Depends on the distributions of different types of data points
- Not the number of distinct possible mistakes but the number of times these mistakes occurs across the distribution of the data
- More common examples would be more important to learn

## Boosting in pseudocode

![boosting in pseudocode](ensemble_learning_boosting_images/boosting_in_pseudocode.png)

## Boosting formula

![boosting formula 1](ensemble_learning_boosting_images/boosting_formula_1.png)

- $z_{t}$ is "whatever normalization constant at time $t$ in order to make it all work out to be a distribution"
- Answers is depends, but if some other examples disagree, then this example agreeing will decrease $D_{t}(i)$, i.e. the distribution of $i$ at $t$.
- And vice versa, if there is at least one example that agrees, an example that disagrees will increase $D_{t}(i)$
- Mathematically represents the idea that if a particular example is wrong, it will be weighted higher, i.e. it is presumed to be harder.

![boosting formula 1](ensemble_learning_boosting_images/boosting_formula_2.png)



## Boosting Example: Three Little Boxes

![boosting example 1](ensemble_learning_boosting_images/boosting_example_1.png)

![boosting example 2](ensemble_learning_boosting_images/boosting_example_2.png)

## When Boosting Overfits

![boosting overfitting quiz](ensemble_learning_boosting_images/boosting_overfitting_quiz.png)

- If the underlying learner overfits and it will always overfit even during the boosting algorithm, the boosted algorithm will be overfit
    - A.N.N. is already prone to overfitting due to many parameters

### Example code of algorithm
From [forked GitHub repo](https://github.com/mdlynch37/AdaBoost)

In [20]:
import sys
sys.path.append('/Users/mdlynch37/projects/coding/data_science_algorithms/Classifiers')
from AdaBoost import usage, AdaBoost
# %run AdaBoost/AdaBoost.py

In [22]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
#  AdaBoost.py
#  
#  Copyright 2015 Overxflow 
#  
import sys
sys.path.append('/Users/mdlynch37/projects/coding/data_science_algorithms/Classifiers')

from Classifiers import K_nearest_neighbour
from Classifiers import Multinomial
from Classifiers import Perceptron
from Classifiers import KernelPerceptron
from math import log,floor,e
from os import system,name
try: import cPickle as pickle
except: import pickle
from sys import argv

def cls(): system(['clear','cls'][name == 'nt'])
    
def header():
	print """
   _   _   _   _   _   _   _   _  
  / \ / \ / \ / \ / \ / \ / \ / \ 
 ( A | d | a | B | o | o | s | t )
  \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ 
          By Overxfl0w13
"""

def footer(result_file=None): print "[END] Process finished without saving results.\n" if result_file==None else "[END] Process finished, saved classified in file "+result_file+". \n"

def usage():
	print """
    Usage: AdaBoost.py train_data_file iterations [classify] [test_data_file] [output_file] \n\n\
    \ttrain_data_file -> Name of file with train data\n\
    \titerations      -> process iterations\n\
    \tclassify        -> Optional [YES-NO], specifies if you want to classify test data\n\
    \ttest_data_file  -> Optional, only if you want to classify, specifies name of file with test data\n\
    \toutput_file     -> Optional, specifies destination file\n"
    """
def AdaBoost(samples,M):
	weight_samples    = [1.0/len(samples) for sample in samples]
	classifiers       = [K_nearest_neighbour,Multinomial,Perceptron,KernelPerceptron]
	classifiers_error = [0 for x in classifiers]
	final_classifier  = []
	for it in xrange(M):
		best_classifier       = K_nearest_neighbour # Random #
		index_best_classifier = 0 # Random #
		index_sample          = 0
		computed_classes = [[] for classifier in classifiers]
		for sample in samples:
			cclass = sample[1]
			sample = sample[0]
			index_classifier = 0
			for classifier in classifiers:
				computed_class = classifier.classify(samples,sample)
				computed_classes[index_classifier].append(computed_class)
				if computed_class != cclass: classifiers_error[index_classifier] += weight_samples[index_sample]
				index_classifier += 1
			index_sample += 1
		# Calcular el mejor clasificador (menor error) #
		min_error = min(classifiers_error)
		index_best_classifier = classifiers_error.index(min_error)
		best_classifier = classifiers[index_best_classifier]
		# Recalcular peso del clasificador #
		alpha_best_classifier = (1.0/2)*log((1-min_error)/(min_error+(1.0/10**20)))
		# Configurar clasificador de la iteracion actual #
		final_classifier.append((alpha_best_classifier,best_classifier))
		# Si el error > 0.5 parar #
		if min_error>0.5 or min_error==0:  print "[!] Min error with only 1 classifier.\n";  return final_classifier
		# Recalcular pesos de las muestras #
		index_sample = 0
		for sample in samples:
			cclass = sample[1]
			sample = sample[0]
			weight_samples[index_sample] = weight_samples[index_sample]*(e**(-cclass*alpha_best_classifier*computed_classes[index_best_classifier][index_sample]))
		# Normalizar pesos de las muestras #
		index_sample  = 0
		total_weights = sum(weight_samples) 
		weight_samples = map(lambda x:float(x)/sum(weight_samples),weight_samples)
	return final_classifier

def load_data(filename):
	try:
		with open(filename,'rb') as fd: obj = pickle.load(fd)
		fd.close()
		return obj
	except IOError as ie: print "[-] File",filename," doesn't exist.\n"; exit(0)
	
def save_object(object,dest):
	with open(dest,'wb') as fd: pickle.dump(object,fd,pickle.HIGHEST_PROTOCOL)
	fd.close()	
	
def classify_boost(final_classifier,samples,sample):
	val = 0
	for item in final_classifier: val = item[0]*item[1].classify(samples,sample)
	return -1 if val<0 else 1

def classify_file(final_classifier,samples,test_samples,output_file):
	with open(output_file,"w") as fd:	
		fd.write("""   _   _   _   _   _   _   _  
  / \ / \ / \ / \ / \ / \ / \ 
 ( R | e | s | u | l | t | s )
  \_/ \_/ \_/ \_/ \_/ \_/ \_/\r\n\r\n\r\n""")
		fd.write(stringify_classifier(final_classifier)+"\r\n")	
		for sample in test_samples: fd.write("Sample "+str(sample)+" classified in: "+str(classify_boost(final_classifier,samples,sample))+"\r\n")
	fd.close()
    	
def stringify_classifier(final_classifier):  
	st = " -> "
	for item in final_classifier: st += str(item[0])+"*"+item[1].__str__()+"(x)+"
	return st[:-1]
	
def __str__(final_classifier):
	print "Classifier\n".center(80)
	print "----------\n".center(80)
	st = stringify_classifier(final_classifier)
	print "\n"+st+"\n"
	
# if __name__ == "__main__":
    
# 	cls()
# 	header()
# 	if len(argv)<3: usage();exit(0)
# 	if len(argv)!=3:
# 		if len(argv)!=6 or argv[3].lower() not in ["yes","no"]: usage();exit()
# 	train_data_file = argv[1]
# 	iterations      = int(argv[2])
# 	if len(argv)>3:
# 		classify        = argv[3]
# 		test_data_file  = argv[4]
# 		output_file     = argv[5]
# 	train_samples = load_data(train_data_file)
# 	final_classifier = AdaBoost(train_samples,iterations)
# 	if len(argv)>=3 and argv[3].lower()=="yes":
# 		test_samples  = load_data(test_data_file) # Test with same train data, ... VERY OPTIMISTIC!! #
# 		classify_file(final_classifier,train_samples,test_samples,output_file)
# 		__str__(final_classifier)
# 		footer(output_file)
# 	else: 
# 		__str__(final_classifier)
# 		footer()