# Generalizing linear regression- Perceptrons 


In part-1 of the session we developed a model for predicting house prices using linear regression. As the name suggests, the linear regression aims to find a linear fit that best describes the training data. However, not **all relationships are linear**.

Note that for a *nonlinear data*, it does not mattter how much training data we get linear regression is not able to clearly capture the dynamics of the system, as demonstrated in the following :

In [None]:
import numpy as np
import pandas as pd
 
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
%pylab inline

plt.style.use('seaborn-whitegrid')

import warnings
warnings.filterwarnings("ignore") 

### When you bring a plastic spoon to a knife fight 

Suppose we have small dataset and want to construct a model based on the given data. 

Based on the known data (red points), a linear model (blue line) can be constructed and suppose it results in satisfactory results for the chosen evaluation metrics.

<img src="./snippet.jpg" >



By contructing a model, we aim to learn the function that describes the relation between the $x$ and $y$: $$y = f(x).$$


Our **goal is to learn**- $f$. 

#### More data arrives for the show
<img src="./data.jpg" >


### Using linear regression to fit this data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.cross_validation import train_test_split

# Getting the data from a function in another (python) file 
from util_model_evals import MoreData, linear_performance

X,Y = MoreData(N=100)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
linear = LinearRegression()

linear.fit(X_train, Y_train)

Y_pred_train = linear.predict(X_train)
Y_pred_test = linear.predict(X_test)

In [None]:
# plotting the data
fig2 = plt.figure(figsize=(5, 5))
fig2.clf()
ax2 = fig2.add_subplot(111)
ax2.scatter(X_train,Y_train,color='r',marker='o')
ax2.plot(X_train,Y_pred_train,color='b')
ax2.set_xlabel('x');
ax2.set_ylabel('y');
ax2.set_title('Training set');
#fig2.savefig('./images/X_train.jpg')

fig3= plt.figure(figsize=(5, 5))
fig3.clf()
ax3 = fig3.add_subplot(111)
ax3.scatter(X_test,Y_test,color='r',marker='o')
ax3.plot(X_test,Y_pred_test,color='b')
ax3.set_xlabel('x');
ax3.set_ylabel('y');
ax3.set_title('Test set');
#fig3.savefig('./images/X_test.jpg')

### When in doubt train on a larger dataset

In most machine learning application, often training the model on larger data, results in more accurate prediction, however, in this instance, the inability of linear model to *learn the nonlinearity* of the underlying data does not reap the benefit of larger tarining data set.

In [None]:
train_sizes = [160,320,640,1280,2560,5120,10240,20480,40960]

mse_list = linear_performance(train_sizes,X_test,Y_test,linear)

fig4= plt.figure()
fig4.clf()
ax4 = fig4.add_subplot(111)
ax4.plot(train_sizes, mse_list)
ax4.set_xlabel('Train set size');
ax4.set_ylabel('MSE');

No matter how large the training set size, the linear model is not able to represent the dynamics of the given data *correctly*.

But the data that we are trying fit is described by the chaotic function $$y =  \sin \Big(\frac{1}{x}\Big),$$ which any state of the art machine learning model might have problems learning.

In [None]:
## defining the f_x as an inline function using the built-in function- lambda
#f_x = lambda x:np.sin(np.pi*x)*np.exp(-2.*np.pi*x)   
f_x = lambda x:np.sin(1./x)
# plotting the function 
Xs = np.linspace(-0.1,0.1,501)
Ys = f_x(Xs) 

fig1 = plt.figure()
fig1.clf()
ax1 = fig1.add_subplot(111)
ax1.plot(Xs,Ys)
ax1.set_xlabel('x');
ax1.set_ylabel('y');

## Perceptrons improving linear regression 

### History 
To enhance the capability of linear models in learning nonlinear distribution, *Frank Rosenblatt* working at the Cornell Aeronautical Laboratory in 1957 proposed a family of artificial neural networks (ANN) for pattern classification and information storage. The algorithm came to be known as **perceptron**. The perceptron in its formulation was argued to be a simplified model of biological *neuron.* 

Since the very beginning, perceptron have attracted a lot of controversy, with Rosenblatt himself reporting the perceptron to be *"the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."*

Although very promising, early researcher were quick in discovering that the perceptron could not be trained to recognise many classes of patterns. In their book titled *Perceptron*, Minsky and Papert showed that single layer perceptron are only capapble of learning linearly separable patterns and that it is impossible to learn an XOR function.

Very soon perceptron became history, until it was realized that *multilayer perceptron* or  *feedforward neural networks* had greater ability in learning patterns than a single layer perceptron. It was shown that a multilayer perceptron could very well learn a XOR function. Still the widely cited work of Minsky and Papert resulted in decline of interest and funcding in research on neural networks, with the text being  reprinted in 1987 as *"Perceptrons - Expanded Edition"* where some errors in the original text were demonstrated and corrected.

### Formulation 

A single layer perceptron (SLP) can be visualized as:

<img src="./perceptron.png" />

The perceptron maps an input $\mathbf{x}$ ( $m$-dimensional vector, $\mathbf{x} \in \mathbb{R}^m$) to an output $y_k$, with $$\hat{y} = \sum_{i=0}^m w_i x_i,$$ where $x_i$, $w_i$ denote the feature of the input $\mathbf{X}$ and the ghts accociated with the neurons. Observe that the first neuron has an value $1$ and the woight $w_0$, which translates to $$\hat{y}\sum_{i=1}^{m}w_ix_i + b.$$
where $b:=w_0$, is ccommonly known a bias. For a linear classification problem, adding bias corresds to shiting/tranlation of the decision boundary.

The basic idea is of a the perceptron is simply to give each input a relative score using the corresponding weights, thus **allowing information that is more relevant to be more dominant in the prediction.**

### So whats the big deal??
#### Multi-layer perceptron
In a **multi-layer perceptron** a number of hidden layers are included, it is observed that deeper(more hidden layers) and wider (more neurons) networks are quite remarkable in learning more complex representation
    
In its formulation multi layer perceptrons are able to learn the nonlinearities represented by the data. Using the nonlinear activations like $\tanh$, *Relu (rectified linear units)*, *sigmoid* further enable to learn the underlying distribution more efficiently.

#### How do I update the weights after each iteration of the optimzation (like gradient descent)?
The most significant ability of neural network, is their ability to allow the back progation of the information (gradient of the loss function) from the output layer to the input layer, using *back propogation*, which in simple terms is **chain rule**.

### Perceptrons for predicting Boston House prices 

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()

bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
## import the perceptron model sklearn.linear_model
bos.head()

In [None]:
# Populating the feature vector X and the label Y
bos['PRICE'] = boston.target

X = bos.drop('PRICE', axis = 1)
Y = bos['PRICE']  # label that we want to predict after learning the function f, Y~ f(X)

print(X.describe())

In [None]:
# splitting into train and test sets
test_train_ratio = 0.4
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_train_ratio, random_state = 1234)

In [None]:
print('Number of training sample = '+str(X_train.shape[0]))
print('Number of test sample = '+str(X_test.shape[0]))

### Recalling linear regression 


In [None]:
linear.fit(X_train, Y_train)

Y_pred_lin = linear.predict(X_test)
print('Mean square error for linear regression ='+str(mse(Y_test,Y_pred_lin)))

fig4= plt.figure(figsize=(5, 5))
fig4.clf()
ax4 = fig4.add_subplot(111)
ax4.scatter(Y_test, Y_pred_lin,color='b',marker='o')
ax4.plot(Y_test,Y_test,color='k')
ax4.set_xlabel('Actual price');
ax4.set_ylabel('Predicted price');
ax4.set_title('Test set');

### Exercise :
Change the test train ratio and observe how the mse and fit changes.

*Hint*: Update the variable test_train_ratio and re-run the above blocks.

In [None]:
print('Number of fetures = '+str(X_train.shape[1]))

### Multi-layer perceptron- hyperparameter jargon


To demstrate the multi-layer perceptron we use a built-in function of MLPRegressor located in sklearn.neural_network, documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html


### Hyperparameters

In [None]:
## set the hyperparametes_____________________________________________________________________
activation = 'identity'
structure = (23,5)  # (neurons, hidden layers)
optimizer = 'adam'
learn_rate_type = 'adaptive' # {‘constant’, ‘invscaling’, ‘adaptive’}
learn_rate = 1.0E-05
iter_max = 1000
validation_ratio = 0.25

##_____________________________________________________________________________________________  

In [None]:
from sklearn.neural_network import MLPRegressor

mlp = MLPRegressor(activation=activation, hidden_layer_sizes=structure,solver=optimizer,
                   learning_rate=learn_rate_type,learning_rate_init=learn_rate,random_state=1234,
                   validation_fraction=validation_ratio)
mlp

In [None]:
mlp.fit(X_train, Y_train)

fig6= plt.figure(figsize=(5, 5))
fig6.clf()
ax6 = fig6.add_subplot(111)
Y_pred_mlp = mlp.predict(X_test)
ax6.plot(mlp.loss_curve_)
ax6.set_xlabel('Evolution of the cost function');
ax6.set_ylabel('Cost function');
ax5.set_title('Number of steps');

## Visualize the neural network

Our trained neural network can be visualized using the weights associated with each layer.

In [None]:
# retriving the weights and biases
hidden_layer = np.int64(1)
weights = mlp.coefs_[hidden_layer]
bias = mlp.intercepts_[hidden_layer]

fig7= plt.figure(figsize=(5, 5))
fig7.clf()
ax7 = fig7.add_subplot(111)
ax7.imshow(np.transpose(weights), cmap=plt.get_cmap("gray"), aspect="auto")
ax7.set_xlabel('Neuron in '+ str(hidden_layer-1)+' layer');
ax7.set_ylabel('Neuron in '+ str(hidden_layer+1)+' layer');
ax7.set_title('Weights of hidden layer : '+str(hidden_layer));


#### How to interpet such a visualization? 
First, on gray scale large negative numbers are black, large positive numbers are white, and numbers near zero are gray. Now we know that each neuron is taking it's weighted input and applying the logistic transformation on it, which outputs 0 for inputs much less than 0 and outputs 1 for inputs much greater than 0. So, for instance, if a particular weight $\mathbf{w}^{(l)}_ij$ is large and negative it means that neuron $i$ is having its output strongly pushed to zero by the input from neuron $j$ of the underlying layer. If a pixel is gray then that means that neuron $i$ isn't very sensitive to the output of neuron $j$ in the layer below it. 


In [None]:
Y_pred_mlp = mlp.predict(X_test)

print('')
print('Mean square error for linear regression ='+str(mse(Y_test,Y_pred_mlp)))

fig5= plt.figure(figsize=(5, 5))
fig5.clf()
ax5 = fig5.add_subplot(111)
ax5.scatter(Y_test, Y_pred_mlp,color='b',marker='o')
ax5.plot(Y_test,Y_test,color='k')
ax5.set_xlabel('Actual price');
ax5.set_ylabel('Predicted price');
ax5.set_title('Test set');

fig5= plt.figure(figsize=(5, 5))
fig5.clf()
ax5 = fig5.add_subplot(111)
ax5.scatter(Y_test, Y_pred_mlp,color='b',marker='o')
ax5.plot(Y_test,Y_test,color='k')
ax5.set_xlabel('Actual price');
ax5.set_ylabel('Predicted price');
ax5.set_title('Test set');

## What does tuning mean

I guess you might have figured it out- choosing the optimal set of hyperparameters to get the best (depending on the chosen metrics) model. 


### Competetion: Tune the hyperparameters to get the least mean squared value.


## Further reading 

Multi layer perceptrons are the most basic form of feed-forward neural network. In the past five year complex architecture of neural networks have been applied to a variety of interest and have resulted in very promising results, leading to exponential burst of what is more commonly known as **deep learning**.