# Workshop 9: ANNs
---

## 1) Neural Network Playground (Follow: Explore individually; Discuss as a Class)

First, go to Tensorflow's [Neural Network Playground](https://playground.tensorflow.org/). This website is an interactive and exploratory visualization of how the features, number of layers, training time, etc, influence the classification boundries of an ANN. Right now, we'll only worry ourselves with *classification* problems.

Play with the visualization, and then answer the following questions below.

### Scenarios

1. Using the default network topology, try training the network with the different activation functions (ReLU, Tanh, Sigmoid, Linear). What effect does the activation function have on the training time? What effect does the activation function have on the shape of the classification boundries?
2. Take a look at [this setup](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.21855&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). Train until the classification boundry converges. This is one of the rare cases where the nodes in an ANN can be (semi) interpreted. What do the nodes in the first hidden layer represent? What about the second hidden layer? How do you think the ANN uses these learned "features" to make a decision?

### Exploration
For each of the following questions:
* Make a prediction before you begin exploring and testing.
* Explain why you think this scenario has this property.

**Questions**

3. Find a scenario where a simple model (fewer neurons) outperforms a complex model. (In regards to overfitting)
4. Find a scenario where no hidden layers perform well.
5. Find a scenario where a model with no hidden layers performs poorly no matter the features.
6. Find a scenario where it takes a lot of training time to get a correct solution.

1. Where the data is allocated in diffrent group
2. [Answer]
3. [Answer]
4. [Answer]
5. [Answer]
6. [Answer]

## 2) Training and Testing a Neural Network (Group)

For this problem, you'll be looking at a reduced subset of the [Credit Card Fraud Data](https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains transactions made by credit cards in September 2013 by European cardholders, including some fradulent transactions.
 
There are two interesting properties about this dataset:

1) **The data only contains dimensionality reduced data from a PCA transformation.** Sometimes, due to privacy concerns, all of the features (and even the names of the features used) cannot be known. Therefore, you'll be trying to train a model of data that has been reduced in dimensions with uninterpretable features.

2) **The dataset is highly unbalanced.** The positive class (frauds) account for 0.172% of all transactions.

Knowing the data, what classification metrics (Precision, Recall, F1-Score) are most appropriate and why?

**Write your answer here.**

For this question, **you have enough experience to do the entire model pipeline yourself**. That means *loading the data, creating splits, scaling the data, training and tuning the model, and evaluating the model.*

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

random_state = 42

### Step 1: Load the data into a dataframe. Use `value_counts` to check the class balance.

In [2]:
df = pd.read_csv("./creditcard.csv")

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,164032.0,0.013026,0.77721,0.168464,-0.782449,0.631586,-0.531628,0.876275,-0.000646,-0.248065,...,-0.226443,-0.515073,0.029329,-0.409008,-0.497966,0.14707,0.244097,0.082947,4.49,0
1,63407.0,-0.227828,0.503434,0.960992,0.979314,0.074042,0.640817,0.374438,0.014293,0.09155,...,-0.102313,-0.032916,-0.353239,-0.947066,0.137538,0.735928,-0.02636,-0.006919,74.5,0
2,75822.0,1.458861,-0.942226,-0.302423,-1.401064,-1.020394,-0.308819,-1.165356,0.024556,-1.870639,...,-0.081561,0.082309,-0.223705,-0.656232,0.518888,0.010662,0.046806,0.04029,42.2,0
3,168855.0,2.141957,-0.997336,-0.738212,-0.929019,-0.77233,-0.241391,-0.942758,-0.106791,-0.001484,...,0.324429,0.973512,0.097843,0.537377,-0.068501,-0.111042,0.006144,-0.037058,39.99,0
4,67996.0,0.965124,-0.961507,-0.119976,-0.421448,-0.975116,-1.164778,0.272813,-0.443593,-1.284454,...,-0.655408,-1.954242,0.07651,0.399212,-0.064425,0.595953,-0.112873,0.050798,239.0,0


In [5]:
df['Class'].value_counts()

0    85284
1      158
Name: Class, dtype: int64

### Step 2: Partition the data into an X dataframe (features) and Y single-column dataframe (class)

In [6]:
df_features = df.iloc[:,0:-1]
df_labels =df["Class"]

In [7]:
df_features.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,164032.0,0.013026,0.77721,0.168464,-0.782449,0.631586,-0.531628,0.876275,-0.000646,-0.248065,...,-0.035318,-0.226443,-0.515073,0.029329,-0.409008,-0.497966,0.14707,0.244097,0.082947,4.49
1,63407.0,-0.227828,0.503434,0.960992,0.979314,0.074042,0.640817,0.374438,0.014293,0.09155,...,0.124125,-0.102313,-0.032916,-0.353239,-0.947066,0.137538,0.735928,-0.02636,-0.006919,74.5
2,75822.0,1.458861,-0.942226,-0.302423,-1.401064,-1.020394,-0.308819,-1.165356,0.024556,-1.870639,...,-0.211826,-0.081561,0.082309,-0.223705,-0.656232,0.518888,0.010662,0.046806,0.04029,42.2
3,168855.0,2.141957,-0.997336,-0.738212,-0.929019,-0.77233,-0.241391,-0.942758,-0.106791,-0.001484,...,0.124308,0.324429,0.973512,0.097843,0.537377,-0.068501,-0.111042,0.006144,-0.037058,39.99
4,67996.0,0.965124,-0.961507,-0.119976,-0.421448,-0.975116,-1.164778,0.272813,-0.443593,-1.284454,...,0.049,-0.655408,-1.954242,0.07651,0.399212,-0.064425,0.595953,-0.112873,0.050798,239.0


In [8]:
df_labels.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

### Step 3: Create your train/test split. Use the provided random_state.

**Note**: You should use a `train_size` of 0.7, or 70%

In [10]:
from sklearn.model_selection import train_test_split
test_data_fraction = 0.3

X_train, X_test, Y_train, Y_test = train_test_split(df_features, df_labels, test_size=test_data_fraction)

In [14]:
X_train.shape

(59809, 30)

In [15]:
X_test.shape

(25633, 30)

In [16]:
Y_train.shape

(59809,)

In [17]:
Y_test.shape

(25633,)

### Step 4: Use a [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to standardize the data. 

Fit the scaler only the the training X features, and then apply it to both training and test X features. We do this because in practice, we wouldn't be able to see data in the test X, so it shouldn't affect feature transformation. We therefore only use X_train for feature transformation.

**Note**: Even though most of the features are already transformed using PCA (which would not require additional standardize), there is one other feature (time) that is not, so we should scale as a best practice.

In [18]:
from sklearn.preprocessing import MinMaxScaler

In [21]:
scalar = MinMaxScaler()
scalar.fit(X_train)

X_train = scalar.transform(X_train)
X_test =  scalar.transform(X_test)

### Step 5:  Train an MLP with default hyperparameters.

For the following, you'll be using sklearn's built in Multi-layer Perceptron classifier [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

Use the default hyperparams aside from `max_iter`. `max_iter` is how many iterations of training the ANN goes though until it manually stops. The default `max_iter=200` is too long for our data currently. 

**Use random_state as the random_states and max_iter=20**. The detault parameters will use a single hidden layer.



In [24]:
from sklearn.neural_network import MLPClassifier

In [25]:
# create the machine learning model
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, Y_train)

In [31]:
# predict the probability of model
clf.predict_proba(X_test)

array([[  9.99863070e-01,   1.36930351e-04],
       [  9.99832765e-01,   1.67235180e-04],
       [  9.99679433e-01,   3.20567267e-04],
       ..., 
       [  9.99653258e-01,   3.46741521e-04],
       [  9.99814883e-01,   1.85117054e-04],
       [  9.99959859e-01,   4.01413249e-05]])

In [34]:
# 
predict = clf.predict(X_test)

In [33]:
clf.score(X_test, Y_test)

0.99906370694027236

In [36]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [38]:
confusion_matrix(Y_test, predict)

array([[25572,     8],
       [   16,    37]])

In [40]:
print(classification_report(Y_test, predict))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     25580
          1       0.82      0.70      0.76        53

avg / total       1.00      1.00      1.00     25633



If all went well, your model should have an accuracy of almost 100%. Use `classification_report` to explain what you think happened. Is the model performing well? If not, is it overfitting or underfitting? Remember that the classes in the problem are very imbalanced, but out main goal is to detect fraud (class 1).

**Note**: `classification_report` outputs Precision, Recall and F1 for both classes. Remember that how we calculate these metrics depends on which class we treat as the positive class. If we say Class 0 is the positive class, a FP means incorrectly predicting Class 0, but for Class 1 a FP is incorrectly predicting Class 1.

**Answer here**.

## 3) Hyperparameters (Group)

**Hyperparams**:

ANNs have *a lot* of hyperparams. This can include simple things such as the number of layers and nodes, up to tuning the learning rate and the gradient descent algorithm used. 

Unfortunately, there is no tried an true method for selecting hyperparams for a neural network. It requires a lot of experimentation and intution through experience. (In fact, one of the most successful methods in training neural networks is *Graduate Student Descent*, where you simply give the laborious process of tuning to a graduate student while you go and do more research!)

For now, the paramaters that you should explore are:

* `activation`: The activation function of the the ANN. Defaults to ReLU.
* `max_iter`: The ANN will train iterations until either the loss stops improving by a specified threshold, or `max_iters` is reached. Warning: the more you increase this, the more the training time will take! Patience is a virtue.
* `hidden_layer_sizes`: A tuple representing the structure of the hidden layers. For example, giving the tuple `(100,50)` means that there's two hidden layers: the first being of size 100, and the second being of size 50. The tuple (100,) would mean a single hidden layer of size 100.

**Try different permutations of these hyperprams and see how it affects the classification scores of your model.**

Answer the following questions:
1. What criteria did you use to determine which model hyperparameters performed "best"? Why? Justify your answer with respect to the problem: fraud detection.
2. What hyperparameters performed best. Why do you think they performed best?