# Lesson 8

We will be examining (at a broad survey level) gradient boosting and neural networks.

**Goal**

After this lesson, you should:
1. Understand the basic concepts around *gradient boosting techniques*.
1. Understand the basic concepts surrounding *neural networks*.

In [1]:
import numpy as np
import pandas as pd
import sklearn as skl
import matplotlib.pyplot as plt
import seaborn as sea

# control the plotsize
plt.rcParams['figure.figsize'] = [10,5]

## Next Steps

The knowledge you've gained in this course is a good stepping stone towards learning more about machine learning and data science. I would recommend taking the time to go through this free online course by Andrew Ng. It explores the topics we've covered in more detail, examines the mathematics a little more closely, and offers insight into the world of machine learning:

https://www.coursera.org/learn/machine-learning

You can also hone your skills through kaggle.com competitions and by reading kaggle.com kernels.

## Gradient Boosting

We've seen one example of an *ensemble* model, specifically the Random Forest Classifier. This model takes a group of decision trees and looks at the average prediction to determine a result.

This technique is known as **bagging**: we build lots of independent models and then "average" their responses. In this instance, we choose many different models with many different parameters and samples in order to *reduce error by reducing variance*.

There is also a technique known as **boosting**: we make a prediction using a specific model, and we use the results of that model to further reduce the error in our next model. In other words, rather than creating many independent models, we ask each subsequent model to learn from the previous generation's mistakes.

Unlike bagging methods, boosting methods reduce bias *and* variance, but they are more prone to overfitting. However, as long as we choose our stopping condition carefully, we can minimze the potential of a boosting method to overfit our sample.

![bagging v boosting](https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_06_trees_methods/3.1-lesson/assets/images/BoostingVSBagging.png)

**Gradient Boosting** is a technique for supervised learning models which builds better models by minimizing the error from previous "worse" models. It generally follows the following steps:

1. apply a simple model to your samples
1. analyze that model for errors
1. isolate the larger errors (which indicate samples not capture by our model)
1. fit a new "better" model against those larger errors
1. assign weights to all your models and combine them (weighted mean)

A very popular library, known as XGBoost, implements the gradient boosting algorithm. It often offers the best performance of any other algorithm, and it is frequently used throughout academy and industry to help improve predictive performance. XGBoost is a very well-tuned library, but we will be using sklearn's gradient boosting algorithm library, simply because it's an API with which we've become familiar.

> let's work through GB on the board, using our bánh mì health example.

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100,
                                 learning_rate=1.0, 
                                 max_depth=1, 
                                 random_state=23)

## Neural Networks

Neural networks define a class of models which (1) behave well in larger dimesional spaces, (2) build predictions from isolated "computations". 

We represent our hypothesis as follows:

Consider a set of inputs, $\{x_1,...,x_i\}$:

![logistic unit](https://www.safaribooksonline.com/library/view/designing-machine-learning/9781785882951/graphics/B05198_06_01.jpg)

Note: the function defined by our hypothesis is a logistic function.

The above diagram depicts a single "neuron" in our network. We can combine these units to form a larger network:

![network](https://www.neuraldesigner.com/images/deep_neural_network.png)

The yellow circles are the **input layer**, the orange circles are the **output layer**, and the blue circles are defined as various levels of **hidden layers**.

The hidden layers define logistic functions which take a combination of parameters/weights and output a new "mini" hypothesis. This methodology, of taking our inputs and using various activation functions to form a hypothesis, is known as **forward propagation**.

In addition to forward propogation, **back propagation** performs forward propagation and then uses the discovered error in later generations to refine previous generations.

These combinations of "new features" dervied from our initial feature set, allow us to come up with interesting non linear hypothesis. Let's examine an example on the board.

Note: we often call $x_0$ the bias. We need to feed in bias to achieve better fitting to our sample data. I won't discuss this here, but it's something to keep in mind and explore if you are interested.

In [3]:
from sklearn.datasets import load_wine

wine_data = load_wine()

raw = pd.DataFrame(wine_data.data,columns=wine_data.feature_names)
raw['target'] = wine_data.target

raw.columns = raw.columns.map(lambda x: x.lower())

print(wine_data.DESCR)
raw.head()

Wine Data Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- 1) Alcohol
 		- 2) Malic acid
 		- 3) Ash
		- 4) Alcalinity of ash  
 		- 5) Magnesium
		- 6) Total phenols
 		- 7) Flavanoids
 		- 8) Nonflavanoid phenols
 		- 9) Proanthocyanins
		- 10)Color intensity
 		- 11)Hue
 		- 12)OD280/OD315 of diluted wines
 		- 13)Proline
        	- class:
                - class_0
                - class_1
                - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:     

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [4]:
raw.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
alcohol,178.0,13.000618,0.811827,11.03,12.3625,13.05,13.6775,14.83
malic_acid,178.0,2.336348,1.117146,0.74,1.6025,1.865,3.0825,5.8
ash,178.0,2.366517,0.274344,1.36,2.21,2.36,2.5575,3.23
alcalinity_of_ash,178.0,19.494944,3.339564,10.6,17.2,19.5,21.5,30.0
magnesium,178.0,99.741573,14.282484,70.0,88.0,98.0,107.0,162.0
total_phenols,178.0,2.295112,0.625851,0.98,1.7425,2.355,2.8,3.88
flavanoids,178.0,2.02927,0.998859,0.34,1.205,2.135,2.875,5.08
nonflavanoid_phenols,178.0,0.361854,0.124453,0.13,0.27,0.34,0.4375,0.66
proanthocyanins,178.0,1.590899,0.572359,0.41,1.25,1.555,1.95,3.58
color_intensity,178.0,5.05809,2.318286,1.28,3.22,4.69,6.2,13.0


In [5]:
from sklearn.model_selection import train_test_split

X = raw.drop('target', axis=1)
y = raw['target']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
# we need to help the model converge by normalizing our data
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(X_train)

# we now transform our data such that we have mean: 0 and s.d: 1
# this helps us compare our features in a "unit-less" manner
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

In [7]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(13,20,25,20,5),max_iter=1000)
mlp.fit(X_train,y_train)

pred = mlp.predict(X_test)

In [8]:
from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

             precision    recall  f1-score   support

          0       0.92      1.00      0.96        12
          1       0.95      0.91      0.93        22
          2       0.91      0.91      0.91        11

avg / total       0.93      0.93      0.93        45

