# Testing regressors

Which of the regressors will be the best to encode a face? Let's find out!

This notebook prexents a very naive approach, i.e. we will measure how good a regressor is by the MSE on a game face. The problem is the regressors will be used for real faces, which may casuse the regressor to not be able to infer any meaningful results if overfitted. 


My plan:
1. Get the datasets `X_faces`, `X_z` and `Y` from the training set.
2. Get regressors: M_LR, M_OMP, M_RF, M_MLP with default values (at least for now)
3. Train the regressors on `X_faces[:1000]` and `Y[:1000]`
4. Test the regressors on `X_faces[1000:]` and `Y[1000:]` 
5. Train the regressors on `X_z[:5000]` and `Y[:5000]`
6. Test the regressors on `X_z[5000:]` and `Y[5000:]`
7. Compare the results by creating a table with MSE, R2, and time of training and testing.


TODO:
 1. e2e tests
 2. cross-validation for finding best parameters of the best regressors

## 1. Preparation

In [25]:
import glob
import numpy as np
import matplotlib.pyplot as plt

from time import time
from tqdm import tqdm

from sklearn.linear_model import LinearRegression, OrthogonalMatchingPursuit
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from pathlib import Path
import json

In [3]:
def load_and_concatenate(folder_path):
    file_paths = glob.glob(folder_path + '/*.npy')
    arrays = [np.load(file_path) for file_path in file_paths]
    concatenated_array = np.concatenate(arrays)
    return concatenated_array


def load_data():
    permute = np.load("permuted_100000.npy")
    
    X_faces = np.load("/media/pawel/DATA/tmp/freddie_mercuries/naive_input.npy")[permute]
    X_z = load_and_concatenate("/media/pawel/DATA/tmp/freddie_mercuries/en_face/vectors")
    Y = np.load("./data/outvec.npy")[permute]
    
    return X_faces, X_z, Y


X_faces, X_z, Y = load_data()
print(X_faces.shape, X_z.shape, Y.shape)

(100000, 65536) (11000, 256) (100000, 5)


In [4]:
def get_regressors():
    regressors = [
        LinearRegression(),
        OrthogonalMatchingPursuit(),
        RandomForestRegressor(),
        MLPRegressor()
    ]
    return regressors

regressors = get_regressors()

## 2. Training



In [23]:
def train_and_validate(X, Y, regressors, X_test=None, Y_test=None):
    X_train, X_valid, Y_train, Y_valid = train_test_split(X, Y, test_size=0.2, random_state=42)
    
    normalizer = Normalizer()
    X_train = normalizer.fit_transform(X_train)
    X_valid = normalizer.transform(X_valid)
    X_test = normalizer.transform(X_test) if X_test is not None else None
    
    models = { f"{regressor.__class__.__name__}_{i}": {
        "scores": {
            "train": {},
            "valid": {},
            "test": {}
            }
        }      for i, regressor in enumerate(regressors) }
    
    for i, regressor in enumerate(regressors):
        name = f"{regressor.__class__.__name__}_{i}"
        print(f"Model: {name}")
        
        print(f"Training ...")
        t1 = time()
        regressor.fit(X_train, Y_train)
        models[name]["scores"]["train"]["time"] = time() - t1
        models[name]["model"] = regressor
        models[name]["scores"]["train"]["mse"] = mean_squared_error(Y_train, regressor.predict(X_train))
        models[name]["scores"]["train"]["r2"] = regressor.score(X_train, Y_train)
        
        print("Testing...")
        Y_pred = regressor.predict(X_valid)
        models[name]["scores"]['valid']["mse"] = mean_squared_error(Y_valid, Y_pred)
        models[name]["scores"]['valid']["r2"] = regressor.score(X_valid, Y_valid)

        print(f"MSE   train: {models[name]['scores']['train']['mse']:6f}   validation: {models[name]['scores']['valid']['mse']:6f}")
        print(f"R2    train: {models[name]['scores']['train']['r2']:6f}   validation: {models[name]['scores']['valid']['r2']:6f}")
        print(f"Training {name} took {models[name]['scores']['train']['time']:4f} s")
        
        if X_test is not None and Y_test is not None:
            print(f"Testing on provided data...")
            Y_pred = regressor.predict(X_test)
            models[name]["scores"]["test"]["mse"] = mean_squared_error(Y_test, Y_pred)
            models[name]["scores"]["test"]["r2"] = regressor.score(X_test, Y_test)
            print(f"MSE   test: {models[name]['scores']['test']['mse']:6f}")
            print(f"R2    test: {models[name]['scores']['test']['r2']:6f}")
        print("-----------------------")
        
    return models

In [28]:
def save_model(model, path):
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    np.save(path, model)
    
def save_models(models, path):
    for name, model in models.items():
        save_model(model['model'], path + name + ".npy")
        
def save_results(results: dict, path):
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    with open(path, 'w') as f:
        json.dump(results, f, default=lambda o: '<not serializable>')
    


### Faces

In [8]:
FACES_TEST_SIZE = 1000


X_face_train, X_face_test, Y_face_train, Y_face_test = X_faces[:FACES_TEST_SIZE], X_faces[FACES_TEST_SIZE:], Y[:FACES_TEST_SIZE], Y[FACES_TEST_SIZE:]
face_models = train_and_validate(X_face_train, Y_face_train, regressors, X_test=X_face_test[:1000], Y_test=Y_face_test[:1000])


Model: LinearRegression
Training ...
Testing...
MSE   train: 0.000000   validation: 215.349699
R2    train: 1.000000   validation: 0.931399
Training LinearRegression took 11.112163 s
Testing on provided data...
MSE   test: 195.589463
R2    test: 0.932260
-----------------------
Model: OrthogonalMatchingPursuit
Training ...


  out = _cholesky_omp(


Testing...
MSE   train: 0.000000   validation: 496.780953
R2    train: 1.000000   validation: 0.841044
Training OrthogonalMatchingPursuit took 98.066185 s
Testing on provided data...
MSE   test: 475.542824
R2    test: 0.835402
-----------------------
Model: RandomForestRegressor
Training ...
Testing...
MSE   train: 83.609110   validation: 594.817234
R2    train: 0.971118   validation: 0.796814
Training RandomForestRegressor took 2427.769744 s
Testing on provided data...
MSE   test: 593.777755
R2    test: 0.797492
-----------------------
Model: MLPRegressor
Training ...




Testing...
MSE   train: 2594.282733   validation: 2624.482599
R2    train: 0.106170   validation: 0.096522
Training MLPRegressor took 264.611069 s
Testing on provided data...
MSE   test: 2641.165863
R2    test: 0.099468
-----------------------


### Z

In [9]:
Z_TEST_SIZE = 10000
Z_TOTAL_SIZE = X_z.shape[0]

X_z_train, X_z_test, Y_z_train, Y_z_test = X_z[:Z_TEST_SIZE], X_z[Z_TEST_SIZE:], Y[:Z_TEST_SIZE], Y[Z_TEST_SIZE:Z_TOTAL_SIZE]
z_models = train_and_validate(X_z_train, Y_z_train, regressors, X_test=X_z_test, Y_test=Y_z_test)

Model: LinearRegression
Training ...
Testing...
MSE   train: 2626.518847   validation: 2764.595554
R2    train: 0.107148   validation: 0.040685
Training LinearRegression took 0.324778 s
Testing on provided data...
MSE   test: 1839.819431
R2    test: 0.378103
-----------------------
Model: OrthogonalMatchingPursuit
Training ...
Testing...
MSE   train: 2792.708585   validation: 2829.536494
R2    train: 0.050653   validation: 0.017830
Training OrthogonalMatchingPursuit took 0.033734 s
Testing on provided data...
MSE   test: 2453.016719
R2    test: 0.169728
-----------------------
Model: RandomForestRegressor
Training ...
Testing...
MSE   train: 410.717816   validation: 2863.176646
R2    train: 0.860381   validation: 0.006056
Training RandomForestRegressor took 198.296329 s
Testing on provided data...
MSE   test: 2842.678070
R2    test: 0.036918
-----------------------
Model: MLPRegressor
Training ...
Testing...
MSE   train: 2617.501513   validation: 2750.399628
R2    train: 0.110203   val



In [29]:
save_models(face_models, "./model/encoder/faces/")
save_models(z_models, "./model/encoder/z/")

save_results(face_models, "./results/encoder/faces.json")
save_results(z_models, "./results/encoder/z.json")

## Multiple training sizes

In [None]:
X_test, Y_test = X_face_test[:1000], Y_face_test[:1000]
for size in [200, 400, 600, 800, 1000]:
    X, Y = X_faces[:size], Y[:size]
    models = train_and_validate(X, Y, regressors, X_test=X_test, Y_test=Y_test)
    save_models(models, f"./model/encoder/faces/{size}/")
    save_results(models, f"./results/encoder/faces/{size}.json")

#### Bonus! Trying different MLPs


In [None]:

mlp_regressors = [
    MLPRegressor(hidden_layer_sizes=(128),),
    MLPRegressor(hidden_layer_sizes=(128, 64)),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32)),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16)),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8))
]

In [16]:
z_mlp_models = train_and_validate(X_z_train, Y_z_train, mlp_regressors, X_test=X_z_test, Y_test=Y_z_test)

Model: MLPRegressor_0
Training ...




Testing...
MSE   train: 2588.243542   validation: 2859.380607
R2    train: 0.115976   validation: 0.026510
Training MLPRegressor_0 took 28.593994 s
Testing on provided data...
MSE   test: 1837.888326
R2    test: 0.378845
-----------------------
Model: MLPRegressor_1
Training ...




Testing...
MSE   train: 2102.339593   validation: 3094.887056
R2    train: 0.281951   validation: -0.053670
Training MLPRegressor_1 took 57.899157 s
Testing on provided data...
MSE   test: 1964.919060
R2    test: 0.335591
-----------------------
Model: MLPRegressor_2
Training ...




Testing...
MSE   train: 1491.488160   validation: 3888.193603
R2    train: 0.490754   validation: -0.324628
Training MLPRegressor_2 took 86.235492 s
Testing on provided data...
MSE   test: 2833.105375
R2    test: 0.040396
-----------------------
Model: MLPRegressor_3
Training ...




Testing...
MSE   train: 1245.156741   validation: 4594.400672
R2    train: 0.573831   validation: -0.563941
Training MLPRegressor_3 took 85.641251 s
Testing on provided data...
MSE   test: 3506.701890
R2    test: -0.189512
-----------------------
Model: MLPRegressor_4
Training ...
Testing...
MSE   train: 1674.284581   validation: 4357.207539
R2    train: 0.428059   validation: -0.484756
Training MLPRegressor_4 took 53.888160 s
Testing on provided data...
MSE   test: 3710.253951
R2    test: -0.257127
-----------------------




In [17]:
mlp_bigger_alpha_regressors = [
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8), alpha=0.001),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8), alpha=0.01),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8), alpha=0.1),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8), alpha=1),
    MLPRegressor(hidden_layer_sizes=(128, 64, 32, 16, 8), alpha=10)
]
    


z_mlp_bigger_alpha_models = train_and_validate(X_z_train, Y_z_train, mlp_bigger_alpha_regressors, X_test=X_z_test, Y_test=Y_z_test)

Model: MLPRegressor_0
Training ...




Testing...
MSE   train: 1621.199057   validation: 4370.266300
R2    train: 0.444777   validation: -0.480488
Training MLPRegressor_0 took 90.635764 s
Testing on provided data...
MSE   test: 3731.174181
R2    test: -0.264210
-----------------------
Model: MLPRegressor_1
Training ...




Testing...
MSE   train: 1785.886481   validation: 4208.526610
R2    train: 0.388666   validation: -0.428263
Training MLPRegressor_1 took 67.593153 s
Testing on provided data...
MSE   test: 3873.832012
R2    test: -0.313193
-----------------------
Model: MLPRegressor_2
Training ...




Testing...
MSE   train: 2380.988848   validation: 3299.848041
R2    train: 0.185796   validation: -0.118681
Training MLPRegressor_2 took 22.220021 s
Testing on provided data...
MSE   test: 3141.443045
R2    test: -0.064619
-----------------------
Model: MLPRegressor_3
Training ...




Testing...
MSE   train: 1625.143582   validation: 4271.744768
R2    train: 0.444300   validation: -0.444201
Training MLPRegressor_3 took 59.662424 s
Testing on provided data...
MSE   test: 3687.583618
R2    test: -0.252403
-----------------------
Model: MLPRegressor_4
Training ...
Testing...
MSE   train: 1416.860344   validation: 4712.738722
R2    train: 0.515396   validation: -0.601942
Training MLPRegressor_4 took 85.440482 s
Testing on provided data...
MSE   test: 4214.834446
R2    test: -0.427988
-----------------------




In [21]:
mlp_different_iter_regressors = [
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=200),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=400),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=600),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=800),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000)
]

z_mlp_different_iter_models = train_and_validate(X_z_train, Y_z_train, mlp_different_iter_regressors, X_test=X_z_test, Y_test=Y_z_test)

Model: MLPRegressor_0
Training ...




Testing...
MSE   train: 2075.482804   validation: 2999.711074
R2    train: 0.290986   validation: -0.021803
Training MLPRegressor_0 took 35.076074 s
Testing on provided data...
MSE   test: 1985.666946
R2    test: 0.328280
-----------------------
Model: MLPRegressor_1
Training ...




Testing...
MSE   train: 1027.253332   validation: 4199.907813
R2    train: 0.649038   validation: -0.431527
Training MLPRegressor_1 took 85.974538 s
Testing on provided data...
MSE   test: 3112.836865
R2    test: -0.052942
-----------------------
Model: MLPRegressor_2
Training ...




Testing...
MSE   train: 500.527108   validation: 5576.276629
R2    train: 0.828962   validation: -0.899456
Training MLPRegressor_2 took 146.529135 s
Testing on provided data...
MSE   test: 4544.358006
R2    test: -0.538690
-----------------------
Model: MLPRegressor_3
Training ...




Testing...
MSE   train: 361.038135   validation: 6368.318591
R2    train: 0.876629   validation: -1.170974
Training MLPRegressor_3 took 161.944777 s
Testing on provided data...
MSE   test: 5301.342213
R2    test: -0.792946
-----------------------
Model: MLPRegressor_4
Training ...




Testing...
MSE   train: 256.883746   validation: 7017.877939
R2    train: 0.912254   validation: -1.391625
Training MLPRegressor_4 took 175.791965 s
Testing on provided data...
MSE   test: 5820.026468
R2    test: -0.969657
-----------------------


In [22]:
mlp_different_iter_regularizedregressors = [
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000, alpha=0.001),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000, alpha=0.01),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000, alpha=0.1),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000, alpha=1),
    MLPRegressor(hidden_layer_sizes=(128,64), max_iter=1000, alpha=10),
]

z_mlp_different_iter_models = train_and_validate(X_z_train, Y_z_train, mlp_different_iter_regressors, X_test=X_z_test, Y_test=Y_z_test)

Model: MLPRegressor_0
Training ...




Testing...
MSE   train: 2037.291320   validation: 3013.946276
R2    train: 0.306350   validation: -0.041283
Training MLPRegressor_0 took 61.948158 s
Testing on provided data...
MSE   test: 2008.732700
R2    test: 0.320692
-----------------------
Model: MLPRegressor_1
Training ...




Testing...
MSE   train: 798.760349   validation: 4483.733755
R2    train: 0.727572   validation: -0.547799
Training MLPRegressor_1 took 126.711411 s
Testing on provided data...
MSE   test: 3641.647301
R2    test: -0.231930
-----------------------
Model: MLPRegressor_2
Training ...




Testing...
MSE   train: 458.158243   validation: 5616.300629
R2    train: 0.843788   validation: -0.939679
Training MLPRegressor_2 took 143.417865 s
Testing on provided data...
MSE   test: 4696.370524
R2    test: -0.589742
-----------------------
Model: MLPRegressor_3
Training ...




Testing...
MSE   train: 361.631607   validation: 6427.622011
R2    train: 0.876739   validation: -1.220290
Training MLPRegressor_3 took 214.398274 s
Testing on provided data...
MSE   test: 5336.007526
R2    test: -0.806763
-----------------------
Model: MLPRegressor_4
Training ...




Testing...
MSE   train: 239.871411   validation: 6696.317033
R2    train: 0.918254   validation: -1.313555
Training MLPRegressor_4 took 224.950246 s
Testing on provided data...
MSE   test: 5584.159705
R2    test: -0.890250
-----------------------
