# [Artificial Neural Networks](https://en.wikipedia.org/wiki/Artificial_neural_network)

- Artificial neural networks are computing systems vaguely inspired by the human brain.
- The subject was opened by McCulloch and Pitts (1943) by creating a computational model for neural networks.
- The network is built of neurons that are interconnected like a web.
- Each connection, like the synapses in a brain, can transmit a signal (=real number) to other neurons.
- Main types of neural networks:
  + [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (old school but still useful)
  + [autoencoder](https://en.wikipedia.org/wiki/Autoencoder) (for dimension reduction and visualization)
  + [convolutional network](https://en.wikipedia.org/wiki/Convolutional_neural_network) (originally developed for image classification)
  + [recurrent network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (originally developed for text classification; example: [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory))
  + [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) (originally developed for machine translation)
  + competitive network (example: [GAN](https://en.wikipedia.org/wiki/Generative_adversarial_network))
  + ...

## Theory of the [Multilayer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) in a Nutshell

<img src="../_img/mlp.jpg" width="320px">

- input: $x \in \mathbb{R}^{d \times 1}$<br>
- hidden layer: $h = \sigma(W^T x)$, where $W \in \mathbb{R}^{d \times K}$ and $\sigma$ is the [logistic sigmoid function](https://en.wikipedia.org/wiki/Logistic_function)<br>
- model output: $\hat{y} = \sigma(v^T h)$, where $v \in \mathbb{R}^{K \times 1}$<br>
- the parameters of the model are the matrix $W$ (hidden weights) and the vector $v$ (output weights)

<hr>

- objective function: $CE(W, v) = \sum_{i=1}^n \left( -y_i\log(\hat{y}_i) - (1 - y_i)\log(1 - \hat{y}_i) \right)$<br>
- derivative by $v$: $\frac{d}{dv} CE(W, v) = \sum_{i=1}^n(\hat{y}_i - y_i) h_i$<br>
- derivative by $W$: $\frac{d}{dW} CE(W, v) = \sum_{i=1}^n x_i \varepsilon_i^T$, where $\varepsilon_i = (\hat{y}_i - y_i) v \odot h_i \odot(1 - h_i)$ is the backpropagated error
- the approximate minimization of $CE$ can be done e.g. by [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)


## The Phishing Websites Problem

The [Phishing Websites](https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff) data set contains certain attributes of web sites. The target attribute is the last column. It specifies whether the site is legitimate (-1) or phishing (+1). Our goal will be to build an artificial neural network that predicts the value of the target attribute.

**Exercise 1**: Load the Phishing Websites data set to a data frame. Prepare the input matrix and the target vector.

In [1]:
# Load data.
import pandas as pd
from urllib.request import urlopen

url = 'https://archive.ics.uci.edu/'\
      'ml/machine-learning-databases/00327/Training%20Dataset.arff'
lines = urlopen(url).read().decode('utf-8').split('\r\n')
names = [l.split()[1] for l in lines if l.startswith('@att')]
skiprows = lines.index('@data') + 1
df = pd.read_csv(url, names=names, skiprows=skiprows)

In [2]:
df.shape

(11055, 31)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   having_IP_Address            11055 non-null  int64
 1   URL_Length                   11055 non-null  int64
 2   Shortining_Service           11055 non-null  int64
 3   having_At_Symbol             11055 non-null  int64
 4   double_slash_redirecting     11055 non-null  int64
 5   Prefix_Suffix                11055 non-null  int64
 6   having_Sub_Domain            11055 non-null  int64
 7   SSLfinal_State               11055 non-null  int64
 8   Domain_registeration_length  11055 non-null  int64
 9   Favicon                      11055 non-null  int64
 10  port                         11055 non-null  int64
 11  HTTPS_token                  11055 non-null  int64
 12  Request_URL                  11055 non-null  int64
 13  URL_of_Anchor                11055 non-null  i

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
having_IP_Address,11055.0,0.313795,0.949534,-1.0,-1.0,1.0,1.0,1.0
URL_Length,11055.0,-0.633198,0.766095,-1.0,-1.0,-1.0,-1.0,1.0
Shortining_Service,11055.0,0.738761,0.673998,-1.0,1.0,1.0,1.0,1.0
having_At_Symbol,11055.0,0.700588,0.713598,-1.0,1.0,1.0,1.0,1.0
double_slash_redirecting,11055.0,0.741474,0.671011,-1.0,1.0,1.0,1.0,1.0
Prefix_Suffix,11055.0,-0.734962,0.678139,-1.0,-1.0,-1.0,-1.0,1.0
having_Sub_Domain,11055.0,0.063953,0.817518,-1.0,-1.0,0.0,1.0,1.0
SSLfinal_State,11055.0,0.250927,0.911892,-1.0,-1.0,1.0,1.0,1.0
Domain_registeration_length,11055.0,-0.336771,0.941629,-1.0,-1.0,-1.0,1.0,1.0
Favicon,11055.0,0.628584,0.777777,-1.0,1.0,1.0,1.0,1.0


In [5]:
df.groupby('Result').size()

Result
-1    4898
 1    6157
dtype: int64

In [6]:
(df.groupby('Result').size() / len(df)) * 100

Result
-1    44.305744
 1    55.694256
dtype: float64

**Exercise 2**: Implement a multilayer perceptron classifier from scratch! Use stochastic gradient descent for training. Evaluate the model on the Phishing Websites data set using a 70%-30% train-test split!

In [7]:
# target vector (labels encodeed as 0 and 1)
y = (df['Result'].values + 1) // 2

# input matrix
X = df[df.columns[:-1]].values

y.mean(), X.mean()

(0.5569425599276345, 0.30409769335142467)

In [8]:



n, d = X.shape                  # number of examples (rows) and features (columns)

random_state = 42
sigma = 0.01                    # initialization range
K = 10                          # Number Of Hidden Neurons
eta = 0.05                      # step size (learning rate)
E = 5


# initialize model parameters
rs = np.random.RandomState(random_state)
W = rs.uniform(-sigma, sigma, (d, K))
v = rs.uniform(-sigma, sigma, K)

for ep in range(E):
    # display loss function
    #(−𝑦𝑖log(yhat 𝑖)−(1−𝑦𝑖)log(1−yhat 𝑖))
    yhat = sigmoid(sigmoid(X @ W) @ v)
    ce = log_loss(y, yhat) * n
    print(ep, ce)
    
    
    for i in range(n):
        x = X[i]
        yi = y[i]

        # compute prediction
        h = sigmoid(W.T @ x)           # compute hidden layer activations
        yhat = sigmoid(v @ h)           # compute model output

        # compute gradient
        grad_v = (yhat - yi) * h
        eps = (yhat - yi) * v * h * (1 - h)
        grad_W = np.outer(x, eps)

        # take step to negative gradient direction
        v -= eta * grad_v
        W -= eta * grad_W

0 7660.533969274072
1 2864.494838766013
2 2436.7369798779664
3 2192.939110475737
4 2076.44849334465


(4.881324340437781, 16.95649897744621)

In [33]:
import numpy as np
from sklearn.metrics import log_loss
from sklearn.base import BaseEstimator

def sigmoid(t):
    return 1 / (1 + np.exp(-t))

class SimpleMLPClassifier(BaseEstimator):
    def __init__(self, random_state=42, sigma=0.01, K=10, eta=0.05, E=5 ):
        self.random_state = random_state
        self.sigma = sigma                  # initialization range
        self.K = K                          # Number Of Hidden Neurons
        self.eta = eta                      # step size (learning rate)
        self.E = E                          # number of epochs
    
    def fit(self, X, y):
        n, d = X.shape                      # number of examples (rows) and features (columns)
        self.classes_=sorted(set((y)))
        
        # initialize model parameters
        rs = np.random.RandomState(self.random_state)
        self.W = rs.uniform(-self.sigma, self.sigma, (d, self.K))
        self.v = rs.uniform(-self.sigma, self.sigma, self.K)

        for ep in range(self.E):
            # display loss function
            #(−𝑦𝑖log(yhat 𝑖)−(1−𝑦𝑖)log(1−yhat 𝑖))
            yhat = self._predict_proba(X)
            ce = log_loss(y, yhat) * n
            print(ep, ce)


            for i in range(n):
                x = X[i]
                yi = y[i]

                # compute prediction
                h = sigmoid(self.W.T @ x)           # compute hidden layer activations
                yhat = sigmoid(self.v @ h)           # compute model output

                # compute gradient
                grad_v = (yhat - yi) * h
                eps = (yhat - yi) * self.v * h * (1 - h)
                grad_W = np.outer(x, eps)

                # take step to negative gradient direction
                self.v -= self.eta * grad_v
                self.W -= self.eta * grad_W
                
    def _predict_proba(self, X):
        return sigmoid(sigmoid(X @ self.W) @ self.v)
    
    def predict_proba(self, X):
        yhat = self._predict_proba(X)
        return np.array([1 - yhat, yhat]).T
    
    def predict(self, X):
        return (self._predict_proba(X) > 0.5).astype('int')
    
    
cl = SimpleMLPClassifier()
cl.fit(X, y)
cl.predict(X)

0 7660.533969274072
1 2864.494838766013
2 2436.7369798779664
3 2192.939110475737
4 2076.44849334465


array([0, 1, 0, ..., 0, 0, 0])

In [34]:
cl.predict_proba(X)

array([[9.97347427e-01, 2.65257258e-03],
       [8.70060970e-02, 9.12993903e-01],
       [9.17052022e-01, 8.29479784e-02],
       ...,
       [8.73943262e-01, 1.26056738e-01],
       [9.99313105e-01, 6.86895175e-04],
       [9.99403519e-01, 5.96481458e-04]])

In [35]:
# implement train-test split
from sklearn.model_selection import ShuffleSplit, cross_val_score
ss = ShuffleSplit(1, test_size=0.3, random_state=42)

cl = SimpleMLPClassifier()
cross_val_score(cl, X, y, cv=ss, scoring='neg_log_loss')

0 5362.186153262046
1 1466.141647392933
2 1374.0695043638668
3 1292.2768450088029
4 1215.9266161274693


array([-0.1643323])

**Excercise 3**: Compare the previous solution against scikit-learn's `MLPClassifier`!

In [40]:
from sklearn.neural_network import MLPClassifier
cl = MLPClassifier(
    hidden_layer_sizes=10, learning_rate_init=0.05, max_iter=5, 
    batch_size=1, random_state=42, solver='sgd', activation='logistic',
    alpha=0, momentum=0
    )
cross_val_score(cl, X, y, cv=ss, scoring='neg_log_loss')



array([-0.18149219])

In [41]:
from sklearn.neural_network import MLPClassifier
cl = MLPClassifier(
    hidden_layer_sizes=10, learning_rate_init=0.05, 
    batch_size=1, random_state=42, solver='sgd', activation='logistic',
    alpha=0, momentum=0,
    )
cross_val_score(cl, X, y, cv=ss, scoring='neg_log_loss')



array([-0.13387578])

**Excercise 4**: Optimize the meta-parameters of the neural network!

In [51]:
from sklearn.neural_network import MLPClassifier
cl = MLPClassifier(random_state=42, max_iter=800, hidden_layer_sizes=400
    )
cross_val_score(cl, X, y, cv=ss, scoring='neg_log_loss')

array([-0.08753574])

In [None]:
from sklearn.neural_network import MLPClassifier
cl = MLPClassifier(
    hidden_layer_sizes=500,
    max_iter=800,
    random_state=42,
    )
cross_val_score(cl, X, y, cv=ss, scoring='neg_log_loss')

In [42]:
MLPClassifier?