## SI 670 Applied Machine Learning, Week 9: dimension reduction, deep learning

## Problem 1 (30 points)

For this problem, you will be comparing PCA (Principal Component Analysis) and MDS (Multi-Dimensional Scaling). We will be using our ever-familiar breast cancer dataset. We will also use `GradientBoostingClassifier` with `learning_rate= 0.1` and `n_estimators=100` (i.e. the default values) as our classifier.

### For PCA:
In the model we want to use for PCA, we use `PCA` to transform our data and then use `GradientBoostingClassifier` to classify the data. What we need to figure out is the best value of `n_components` for `PCA`. To do so, we will apply cross-validation using `GridSearchCV` on a `Pipeline`. Please consider `n_components = 2, 3, 5` for the parameter grid, and please report the test score of the best model.

### For MDS:
Notice that `MDS` is a dimensionality reduction tool that preserves pairwise distance. Unlike PCA which learns a linear transform and can apply it to future data, MDS can only learn embeddings for the specified data. Which is why `MDS` and does not have a stand-alone `transform()` method, and can only be applied to data that was fit to it. As a consequence, we are not able to learn MDS transforms on some data and then apply them to some other data. 

Therefore, you should first apply `fit_transform()` from `MDS` to ALL data. Then split it into training and test, fit `GradientBoostingClassifier` on training data, and then compute score on test data. You should repeat this process for three possible values of the `n_components` parameter of `MDS`: `n_components = 2, 3, 5`. Once you have done so, please report the best score on test data.

(Please note that you do not need to scale the data for either of these two parts.)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline

def answer_one():
    cancer = load_breast_cancer()
    (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
    X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

    ######## Your code for PCA here
    # parameter
    param_grid = {"pca__n_components": [2, 3, 5]}
    # build pipeline
    components = [('pca', PCA()), ('clf', GradientBoostingClassifier())]
    pipe = Pipeline(components)
    # grid search a pipeline
    grid = GridSearchCV(pipe, param_grid)
    # GridSearchCV itself can be viewed as a classifier, can implement .fit and .score functions
    # tune with GridSearchCV
    grid.fit(X_train, y_train)
    best_model_test_score_PCA = grid.score(X_test, y_test)
    # best parameters
    pca_best_n_components = grid.best_params_['pca__n_components']
    # print('The best PCA parameters (i.e. best model) for n_components:', pca_best_n_components)
    print('The BEST PCA test score is:', best_model_test_score_PCA)

    ######## Your code for MDS here
    # we cannot really use MDS on training/validation/test data separately, 
    # since that would affect the resulting pairwise distances between points that are on different datasets.
    # so, just in this question, apply MDS to All data (but this do not work for other real world examples)
    
    (X, y) = load_breast_cancer(return_X_y = True)
    
    n_components_list = [2, 3, 5]
    score_list = []
    for i in n_components_list:
        
        # apply MDS on all X data
        mds = MDS(n_components = i)
        X = mds.fit_transform(X)
        
        # split train-test
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)
        
        # fit GradientBoostingClassifier on training data
        clf = GradientBoostingClassifier()
        clf.fit(X_train, y_train)
        # compute score on test data
        this_score = clf.score(X_test, y_test)
        score_list.append(this_score)
        
    # report max score
    print(score_list)
    best_model_test_score_MDS = np.max(score_list)
    print('The BEST MDS test score is:', best_model_test_score_MDS)
  
    return best_model_test_score_PCA, best_model_test_score_MDS

answer_one()

The BEST PCA test score is: 0.9300699300699301
[0.9370629370629371, 0.9300699300699301, 0.916083916083916]
The BEST MDS test score is: 0.9370629370629371


(0.9300699300699301, 0.9370629370629371)

## Problem 2 (30 points)

### (a)
Leon and Claire has access to some data about viral infection in Raccoon City, where the target variable is the number of people who will get infected in a given day. What type of activation should they use on the final layer?

### (b)
Jill is training a multi-layer perceptron. While the error on the training data is low, the error on the test data is high. In order to improve generalization to test data Jill increases the number of layers in the model. Is this likely to work?

#### Your answer for 2(a) here:
* the number of people who will get infected -- they are trying to do Regression: Predicting a numerical value, so the Final Activation Function should be Linear (This results in a numerical value which we require) or ReLU (This results in a numerical value greater than 0).

#### Your answer for 2(b) here:
* No, Jill's method will not work. The error on the training data is low, the error on the test data is high -- this MLP is overfitting -- the basic idea to deal with the problem of overfitting is to decrease the complexity of the model. To do so, we can make the network smaller by simply removing the layers or reducing the number of neurons, etc.

* ref:
*https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8
*https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-prevent-overfitting-in-neural-networks-part-1/

## Problem 3 (40 points)

You will be building a multi-layer perceptron using `keras` for classification of the breast cancer dataset.

Your perceptron should have three hidden layers, each with 32 neurons. The hidden layers should use `tanh` activation.

Since we are doing a classification problem with two classes, your output layer needs to have two neurons. You should use `sigmoid` activation for the output layer.

Please report the accuracy score of your model on the test data for 1000 epochs and batch size 200.

In [10]:
# ! pip install tensorflow --user

In [15]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
import keras
from keras.models import Sequential
from keras.layers import Dense

def answer_three():
    (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

    X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state=0)
    # print(y_test)

    # Scale your data using standard scaler. Beware of leakage!
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Your code for the perceptron here
    network = Sequential()
    
    # hidden layers
    network.add(Dense(32, activation = 'tanh', input_shape = X_train_scaled[0].shape))
    network.add(Dense(32, activation = 'tanh'))
    network.add(Dense(32, activation = 'tanh'))
    
    # classification problem with two classes -- output layer needs to have two neurons
    # output layer
    network.add(Dense(1, activation = 'sigmoid'))
    
    ################## rashape label y to (,2) ################
    # print("before label reshape:", y_test)
    y_train = np.asarray(y_train).astype('float32').reshape((-1,1))
    y_test = np.asarray(y_test).astype('float32').reshape((-1,1))
    # print("after label reshape:", y_test)
    ##################################
    
    
    # compile 
    network.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics=['accuracy'])
    network.fit(X_train_scaled, y_train, epochs = 1000, batch_size = 200, verbose = False)
    # 
    _, test_accuracy = network.evaluate(X_test_scaled, y_test)
    print(test_accuracy)

    return test_accuracy

answer_three()

0.9510489702224731


0.9510489702224731