# Code

In [6]:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.svm import LinearSVC

## Neural Networks 


- Importing dataset
- Splitting dataset into features and labels
- Splitting dataset into test set and training set

In [2]:
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

- Defining simple neural network also called **Multi-Layer Perceptrons**
- Parameters for `MLPClassifier`:
  - `solver` - algorithm for fitting data
    - `lbfgs` - fast for small dataset
    - `sgd`
    - `adam` (default) - good for large dataset
  - `hidden_layer_sizes` - number of hidden nodes to be used
    - `[100]` is the default which is not required for this small dataset
    - `[10]` is a 1 hidden node of size 10, `[10, 10]` is 2 hidden nodes of size 10 each
    - The length of the list is the number of nodes, the numbers in the list is the size of each node

In [3]:
mlp: MLPClassifier = MLPClassifier(solver='lbfgs', activation='tanh', random_state=0, hidden_layer_sizes=[10]).fit(X_train, y_train)

## Linear Support Vector Machine (SVM)

## Kernel Support Vector Machine (SVM)

In [7]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

In [8]:
cancer: np.ndarray = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

In [9]:
svc: SVC = SVC()
svc.fit(X_train, y_train)

- The model does not perform very well, and for other random states it may overfit quite substantially
- While `SVM`s often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data
  - They require all the features to vary on a similar scale
  - But in this case the features are on vastly different scales
- This would be somewhat of a problem for other models (like linear models)
  - But it often has devastating effects for the kernel SVM

In [11]:
print("Accuracy on training set: ", (svc.score(X_train, y_train)))
print("Accuracy on test set: ", (svc.score(X_test, y_test)))

Accuracy on training set:  0.903755868544601
Accuracy on test set:  0.9370629370629371


### Normalizing Data

- After normalizing the data, `SVC` is much more accurate

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [13]:
scaler: MinMaxScaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled: np.ndarray = scaler.transform(X_train)
X_test_scaled: np.ndarray = scaler.transform(X_test)

In [14]:
svc.fit(X_train_scaled, y_train)

In [15]:
print("Accuracy on training set: ", (svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: ", (svc.score(X_test_scaled, y_test)))

Accuracy on training set:  0.9835680751173709
Accuracy on test set:  0.972027972027972


- Normalizing the data has made a big improvement
- Now, it is possible to try either increasing `C` or decreasing `gamma` to fit a more complex model

In [16]:
svc: SVC = SVC(C=100)
svc.fit(X_train_scaled, y_train)

- Increasing `C` led to overfitting

In [17]:
print("Accuracy on training set: ", (svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: ", (svc.score(X_test_scaled, y_test)))

Accuracy on training set:  1.0
Accuracy on test set:  0.965034965034965


## Multiclass Classification

- Apply the one-vs-rest method to a simple three-class classification dataset
- The dataset is 2D and each class is given by data sampled from a Gaussian distribution

In [18]:
from sklearn.datasets import make_blobs

In [19]:
X, y = make_blobs(random_state=42)

In [20]:
svm: LinearSVC = LinearSVC().fit(X, y)

In [22]:
print("Coefficient shape: ", svm.coef_.shape)
print("Intercept shape: ", svm.intercept_.shape)
print("Coefficient: ", svm.coef_)

Coefficient shape:  (3, 2)
Intercept shape:  (3,)
Coefficient:  [[-0.17492239  0.23141345]
 [ 0.47621217 -0.06937504]
 [-0.18913912 -0.20400513]]


## Exercises

Print out the array `svm.coef_`, as in In[8]. Notice that the signs of the two numbers in rows 0 and 1 are opposite, while the signs of the two numbers in row 2 coincide. Explain briefly why.

In [23]:
print("Coefficient: ", svm.coef_)

Coefficient:  [[-0.17492239  0.23141345]
 [ 0.47621217 -0.06937504]
 [-0.18913912 -0.20400513]]


# Revision Questions

## Question 1
What is the model for neural networks? You may assume that the number of hidden layers is 2. Describe its parameters.

```
h[0] := tanh(w [0, 0]x [0] + w [0, 1]x [1] + w [0, 2]x [2] + w [0, 3]x [3] + b[0])
h[1] := tanh(w [1, 0]x [0] + w [1, 1]x [1] + w [1, 2]x [2] + w [1, 3]x [3] + b[1])
h[2] := tanh(w [2, 0]x [0] + w [2, 1]x [1] + w [2, 2]x [2] + w [2, 3]x [3] + b[2])
ŷ := v [0]h[0] + v [1]h[1] + v [2]h[2] + b
```
- Mostly a linear model however it is non-linear as it contains a non-linear activation function $tanh$

## Question 2
Give an advantage and three disadvantages of neural networks.

**Advantages**
- Neural networks are able to capture information contained in large amounts of data and build incredibly complex models.

**Disadvantage**
- Take a long time to train
- They required very careful preprocessing
- They work best with homogeneous data (similar features)

## Question 3
What is a linear scoring function? How can it be used for classifying test samples into positive and negative?

- **Scoring Function** -  a function that maps the input data to a real-valued output. The output of the scoring function is used to determine the class label of the input data
- The output of the scoring function can be used to classify samples into positive and negative:
  - If the output of the scoring function is greater than 0, the sample is classified as positive
  - If the output of the scoring function is less than 0, the sample is classified as negative

## Question 4
Suppose we have a linear scoring function with parameters $b = −1$ and $w = (−2, 1, 0, 3)$. The test sample is $x* = (0, 2, −1, 1)$. Calculate the predicted label for $x*$ .

$b.x*=((-2)(0))+((1)(2))+((0)(-1))+((3)(1))=5$

$5-1=4>0$
- The sample is classified as positive

## Question 5
How would you interpret the magnitude of a linear scoring function?

- If $f(x*)$ is far from $0$, then $x*$ lies far from the hyperplane hence it is a good assignment
- If $f(x*)$ is close to $0$, then $x*$ lies close to the hyperplane hence it is not a good assignment

## Question 6
What is the margin of a given separating hyperplane?

- **Margin** - the shortest perpendicular distance between each sample and the hyperplane 

## Question 7
Define the maximum margin classifier.

- **Maximum Margin Hyperplane** - the separating hyperplane for which the margin is largest.
- **Maximum Margin Classifier** - classify a test sample based on which side of the maximum margin hyperplane it lies on

## Question 8
What is meant by the optimal separating hyperplane?

- **Optimal Separating Hyperplane** is the same as *maximum margin hyperplane*

## Question 9
How is the optimal separating hyperplane used for classification?

- The optimal separating hyperplane is used to classify data samples by maximizing the margin between the positive and negative samples
- The optimal separating hyperplane is the line that bisects the two classes of data samples and has the maximum margin.

## Question 10
Define the notion of a support vector in the context of maximum margin classifiers.

- **Support Vector** - the vectors from each class which are the at the limit and are used to create the `slab`

## Question 11
How is the maximum margin classifier a special case of the soft margin classifier?

- The maximum margin classifier is the soft margin classifier for $C = ∞$, where $C$ is the tuning parameter.