<a></a>
<div style="border-radius: 10px; border: 1px solid #0F9CF5; background-color: #232323; white-space: nowrap;">
    <p style="margin-top: -10px; margin-bottom: 0px; margin-left: 10px; font-size: 1.15em; padding: 10px; overflow: hidden;">
        <span style="color: orange; font-size: 2em;">&#9432;  </span>
        Click the <span style="color: orange;">Run All</span> <img style="max-height: 1.5em; border: 1px solid orange;" src="../img/RunAll.png" /> button in the toolbar above to run the code in this notebook 
    </p>
</div>

<a id="document-top"></a>
# BQuant Machine Learning Series Part 6 - Neural Networks

<a href='https://bloombergslides.com/view/mail?iID=bfmm22RM8tDTtX2FRzRt'>Video: Episode 6 - ML Series Video - Neural Networks</a>

In [None]:
import bql
bq = bql.Service()

import pandas as pd
import numpy as np

import math 
import matplotlib.pyplot as plt

# cache bql request on disk
import src.cache as cachereq
from src.shared import * ## Shared library for retrieving data via BQL for Machine Learning Series

%load_ext autoreload
%autoreload 2

### Initial set up - PLEASE READ
<font color='magenta'>The data is pre-cached on disk and will automatically be called when running get_earnings_factors_nn() function. The query sources significant amout of data from BQL so to avoid running into data limit issues, we strongly recommend you do not modify below code. You can examine BQL code in folder src -> shared.py
</font>

In [None]:
cache = cachereq.CacheRequest(bq, {'cache_folder': 'data_neural_networks', 'cache_data_on_disk': True})

# src -> shared.py -> get_earnings_factors()
data = get_earnings_factors_nn(cache=cache)
print(data.shape)
print("Quarterly data: 2010-12-31 - 2020-12-31 for SP500")

<h3>Earnings movement prediction</h3>

<h4>Forecast direction of next quarter earnings based on accounting information of the current quarter </h4>

#### Steps:
- Enhance data with additional information
- Preprocess the data
- Learn how to apply Neural Network algorithm on our dataset




In [None]:
data.head(7)

#### Enhance data:
- change in Earnings per share : (Current Period EPS - Prior Period EPS)
- Assign 1 to positive change in EPS and 0 to negative change
- Shift data index by -1: we will be using current financial data to predict future change in earnings


In [None]:
# Create binary column of positive and negative earnings changes
data['binary_change'] = [1 if row['change_in_EPS'] > 0 else 0 for _,row in data.iterrows()]

# Shift date index by -1 so we are predicting future changes: 1 or 0
data['Future_change'] = data['binary_change'].shift(-1)

In [None]:
# Goal is to anticipate the sign of futute earnings change from the financial data of the current quarter.
# If the future earnigs changes is + , we assign 1, otherwise 0,  to Future_change value of the current quarter
data[['EPS','change_in_EPS','Future_change']].head(6)

In [None]:
# Examine data 
data.describe()

In [None]:
# Replace infinity with nan
data = data.replace([np.inf, -np.inf], np.nan)

#Drop rows where change_in_EPS is nan: they are no use to us 
data = data.dropna(subset = ['change_in_EPS', 'Future_change'])

# We no longer need these columns
data = data.drop(columns = ['EPS','change_in_EPS','binary_change'])

In [None]:
# Examine missing data
missing_column_data = 100*(data.isnull().sum() / data.shape[0]).round(3)
print('Percent of missing values per column:\n', missing_column_data)

In [None]:
# Drop 10 columns that have more than 35% of data missing
columns_to_drop = missing_column_data[missing_column_data > 35]
columns_to_drop

In [None]:
# Number of columns dropped, 10 
data = data.drop(columns = list(columns_to_drop.index))
print( f'New Dataframe shape : {data.shape}')

#### Preprocess data:
- Handle remaining missing values
- Minimize influence of outliers by performing Winsorization
- Standardize data 


Handle remaining missing data by replacing NaN by mean of the column

In [None]:
# Keep in mind that this is a naive way to handle missing values. 
# This method can cause data leakage and does not factor the covariance between features.
# For more robust methods,take a look at MICE or KNN

for col in data.columns:
    data[col].fillna(data[col].mean(), inplace=True)

In [None]:
# First we need to split our data into train and test. 
from sklearn.model_selection import train_test_split

# Independent values/features
X = data.iloc[:,:-1].values
# Dependent values
y = data.iloc[:,-1].values

# Create test and train data sets, split data randomly into 20% test and 80% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Winsorization transforms data by limiting extreme values, typically by setting all outliers to a specified percentile of data.

In [None]:
from scipy.stats import mstats

# Winsorize top 1% and bottom 1% of points
# Apply on X_train and X_test separately
X_train = mstats.winsorize(X_train, limits = [0.01, 0.01])
X_test = mstats.winsorize(X_test, limits = [0.01, 0.01])

Standardize the data

$$z=(x-mean) /  Standard Deviation$$

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

In [None]:
# IMPORTANT: During testing, it is important to construct the test feature vectors using the means and standard deviations saved from
# the training data, rather than computing it from the test data. You must scale your test inputs using the saved means
# and standard deviations, prior to sending them to your Neural Networks library for classification.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Fit to training data and then transform it
X_train = sc.fit_transform(X_train)
# Perform standardization on testing data using mu and sigma from training data
X_test = sc.transform(X_test)

# Artificial Neural Networks

### Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks.

* First it sums values of each input x multiplied by weight w
* Weighted sum is passed through an activation function 
* Activation function "converts" output to binary output of 0 or 1
* Weights are measure of influence that each input has on the final output

<img src='img/perceptron.JPG'>

### What is an Activation Function ?

### Sigmoid function
* Activation function has "switch on" and "switch off" characteristic
* Moves from 0 to 1 depending on the input values of x
* Activation function adds non-linearity to the network


In [None]:
# The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
# The function is differentiable.That means, we can find the slope of the sigmoid curve at any two points.
# There are four commonly used and popular activation functions — sigmoid, hyperbolic tangent(tanh), ReLU and Softmax.

x = np.arange(-8, 8, 0.1)
f = 1 / (1 + np.exp(-x))
plt.plot(x, f)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Sigmoid function')
plt.show()

### Tanh function
* Maps values between -1 and 1
* tanh is also sigmoidal (s - shaped)


In [None]:
x = np.arange(-8, 8, 0.1)
f = np.tanh(x)
plt.plot(x, f)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Tanh function')
plt.show()

In [None]:
# Build sigmoid function for later use
# sigmoid(w*x + b) = 1/(1+e^-(wTx+b))
# z is (w*x+b), 

def sigmoid(z):
    s = 1 / (1 + np.exp(-z))
    return s

##  Building blocks:

### Structure of ANN
<h5>Input Layer is where data enters the network</h5>  
<h5>Hidden Layers (on the picture there are 2) is where function applies weights (w) to the inputs and directs them through activation function like sigmoid or relu</h5>  
<h5>Output Layer is where function returns the outputs from the last layer</h5> 

<img src='img/nn_structure.jpg'>

<h2>The general methodology to build a Neural Network is to:</h2>  

1. Define the neural network structure ( # of input units,  # of hidden layers, etc). 
2. Initialize the model's parameters
3. Loop:
    - Implement forward propagation
    - Compute loss
    - Implement backward propagation to get the gradients
    - Update parameters (gradient descent)
    

<h4> 1 & 2 Define and Initialize model's parameters</h4> 

- n_x : size of the input layer
- n_h : size of the hidden layer
- n_y : size of the output layer

Initialize weights (w) with random values and bias (b) as zeros.
If we initialize weights with 0, the derivative with respect to a loss function will be the same for every w.

In [None]:
# Start with a basic network initialization

# Size of the input layer
n_x = 3
# Size of the hidden layer
n_h = 3
# Size of the output layer
n_y = 1


# W1 - weight matrix of shape (n_h, n_x)
W1 = np.random.randn(n_h,n_x) * 0.01

# b1 - bias vector of shape (n_h, 1)
b1 = np.zeros((n_h,1))

# W2 - weight matrix of shape (n_y, n_h)
W2 = np.random.randn(n_y,n_h) * 0.01
    
# b2 - bias vector of shape (n_y, 1)
b2 = np.zeros((n_y,1))

print("W1 = " + str(W1))
print("b1 = " + str(b1))
print("W2 = " + str(W2))
print("b2 = " + str(b2))

In [None]:
# Build function to store parameters for later use

def model_parameters(n_x, n_h, n_y): 
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
    # save to dictionary
    parameters = {'W1' : W1,
                  'b1' : b1,
                  'W2' : W2,
                  'b2' : b2}
    return parameters

### Forward propagation 
    
* Calculations in the model that take us from an input layer all the way to the output ( how NN make predictions)
* Each independent feature x will be passed to the 1st hidden layer combined with some randomized weight
* 1st hidden layer applies an activation function resulting in an output which then becomes an input for next hidden layer
* Next hidden layer, repeats step above and progresses forward
* The weights of a neuron can me thought of as weights between 2 layers 

<img src='img/forward_nn.JPG'>

In [None]:
# Implement forward pass 
# parameters - dictionary of initial parameters
# X - input data

def forward_propagation(X, parameters):
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    
    # Values from the picture above
    Z1 = np.dot(W1,X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2,A1) + b2
    # use previously built function sigmoid
    A2 = sigmoid(Z2)
    # save to dictionary
    fwd_pass_values = {"Z1" : Z1,
                       "A1" : A1,
                       "Z2" : Z2,
                       "A2" : A2}
    return A2, fwd_pass_values
    
    
    

Once the first forward pass has been completed and we have our prediction, how do we evaluate its accuracy? 

### Loss function
    
* It measures cost associated with an incorrect prediction
* Our goal is to find coefficients that minimize the loss function
* Cross entropy loss is used in classification problems 

In [None]:
# Implement loss function
# cost = -(1/m) * Sum(y*log(a^[2](i)) + (1-y)*log(1-a^[2](i)))
# A2 - output of sigmoid 
# Y is a true output against which we'll be measuring the loss

def entropy_loss(A2, Y, parameters):
    m = Y.shape[1]
    log_prob = np.multiply(np.log(A2), Y) + np.multiply(np.log(1 - A2), 1 - Y)
    cost = -(1 / m) * np.sum(log_prob)
    # squeeze removes axes of length one from cost
    cost = float(np.squeeze(cost))
    return cost
    

### Backpropagation
* Mechanism for tuning the weights based on the loss function
* During training we want to find weights and biases that minimize the error (loss function)
* To measure change in the loss function, we need to take the derivative of a function with respect to all the weights and biases.
    


In [None]:
# Implement function to measure derivatives
# Pass dictionary of  parameters, forward propagation values, input data and labeled data

def backward_propagation(parameters, fwd_pass_values, X, Y):
    m = X.shape[1]
    
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    
    A1 = fwd_pass_values["A1"]
    A2 = fwd_pass_values["A2"]
    
    # Derivatives of loss func w.r.t parameters
    dZ2 = fwd_pass_values["A2"] - Y
    dW2 = 1 / m*np.dot(dZ2, fwd_pass_values["A1"].T)
    db2 = 1 / m*np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2)*(1 - np.power(A1, 2))
    dW1 = 1 / m*np.dot(dZ1, X.T)
    db1 = 1 / m*np.sum(dZ1, axis=1, keepdims=True)
    
    gradients =       {"dW1" : dW1,
                       "db1" : db1,
                       "dW2" : dW2,
                       "db2" : db2}
    return gradients

Now that we have derivatives, sensitivity of the loss function to change in parameters, how do we use them to update our weights and biases in order to decrease our loss. 

### Gradient Descent
* Optimization algorithm used to find the values of parameters that minimize a cost function
* We can use it to recursively update the weights by iterating over all training samples
* It takes into account learning rate and initial parameter values
* Learning rate controls size of the step on each iteration
* parameter = parameter - learning rate * (derivative of loss function w.r.t parameter)
* Derivative, slope of loss function, updates the change you want to make to the parameter 
* Ideally we want Gradient Descent convering to global optimum where derivative equals to zero

<img src='img/gradient_nn.JPG'>

In [None]:
# parameters - dictionary with randomly initialized parameters 
# gradients - derivatives from backward_propagation function
# parameter = parameter - learning rate * (derivative of loss function w.r.t parameter)


def update_parameters(parameters, gradients, learning_rate = 1.1):
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    
    dW1 = gradients["dW1"]
    db1 = gradients["db1"]
    dW2 = gradients["dW2"]
    db2 = gradients["db2"]
    
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters
    

<h2>Combine functions above and build your first neural network model</h2>  

In [None]:
# Recall our dataset

print ('The shape of X_train: ' + str(X_train.shape))
print ('The shape of y_train: ' + str(y_train.shape))
print ('The shape of X_test: ' + str(X_test.shape))
print ('The shape of y_test: ' + str(y_test.shape))

In [None]:
# Reshape the data 

X_train_new = X_train.T
y_train_new = y_train.reshape(1, y_train.shape[0])
X_test_new = X_test.T
y_test_new = y_test.reshape(1, y_test.shape[0])

print ('The shape of X_train_new: ' + str(X_train_new.shape))
print ('The shape of y_train_new: ' + str(y_train_new.shape))
print ('The shape of X_test_new: ' + str(X_test_new.shape))
print ('The shape of y_test_new: ' + str(y_test_new.shape))

In [None]:
# size of input layer
n_x = X_train_new.shape[0] # size of input layer
# size of hidden layer
n_h = 4
# size of output layer
n_y = y_train_new.shape[0]

print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the hidden layer is: n_h = " + str(n_h))
print("The size of the output layer is: n_y = " + str(n_y))

<h3>Use model_parameter functions to initialize parameters</h3>  

In [None]:
parameters = model_parameters(n_x, n_h, n_y)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

<h3>Train Neural Network model</h3>  

In [None]:
# Number of iterations used in gradient descent for loop
num_iterations = 10000

for i in range(0, num_iterations):
    
    # Apply our forward propagation function
    A2, fwd_pass_values = forward_propagation(X_train_new, parameters)
    
    # Calculate cost associated with an incorrect prediction
    cost = entropy_loss(A2, y_train_new, parameters)
    
    # Apply backpropagation function to measure sensitivity of a loss function to parameters
    gradients = backward_propagation(parameters, fwd_pass_values, X_train_new, y_train_new)
    
    # Update parameters using Gradient descent 
    parameters = update_parameters(parameters, gradients)
    
    # Print cost for every 1000th iteration
    if i % 1000 == 0:
        print(i,cost)

<h3>Prediction</h3>  

Now that we have our updated parameters that minimize the entropy loss, use forward propagation to make a prediction

A2 is a vector of probabilities, recall it is a sigmoid()

if A2 > 0.5 => 1, and 0 otherwise


In [None]:
# Pass test data into forward_propagation function along with newly optimized parameters
A2, fwd_pass_values = forward_propagation(X_test_new, parameters)

predictions = (A2 > 0.5)

In [None]:
# Accuracy

print ('Accuracy: %d' % float((np.dot(y_test_new , predictions.T) + np.dot(1 - y_test_new,1 - predictions.T))/float(y_test_new.size)*100) + '%')

<h3>Neural Networks with scikit-learn </h3> 

In [None]:
# Import accuracy score
from sklearn.metrics import accuracy_score

#  Multi-layer Perceptron classifier contains one or more hidden layers and can learn non-linear functions. 
from sklearn.neural_network import MLPClassifier

# hidden_layer_sizes allows us to set the number of layers and the number of nodes we wish to have in the Neural Network Classifier
# max_iter denotes the number of epochs.
# activation function for the hidden layers.
# solver  specifies the algorithm for weight optimization across the nodes.

mlp = MLPClassifier(hidden_layer_sizes = (150,100,50), max_iter=300,activation = 'relu',solver = 'adam', random_state = 0)

# Train
mlp.fit(X_train,y_train)
# Predict 
y_pred = mlp.predict(X_test)
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))