# Autoencoder approach: predict part of plant

## Contents
- [1. Imports](#Imports)
- [2. Build models](#Build-models)
- [3. Approximation error](#Approximation-error)
- [4. Encode data with computed autoencoders](#Encode-data-with-computed-autoencoders)
- [5. Logistic regression classifier with encoded data](#Logistic-regression-classifier-with-encoded-data)
- [6. Gaussian Naive Bayes classifier with encoded data](#Gaussian-Naive-Bayes-classifier-with-encoded-data)
- [7. Hybrid Bayesian classifier with bnlearn](#Hybrid-Bayesian-classifier-with-bnlearn)



[Back to Chemfin](../Chemfin.ipynb)

### Imports
The first cell with code includes all necessary inputs.

Requires [numpy](http://www.numpy.org/), [scikit-learn](http://scikit-learn.org/), [pyTorch](http://pytorch.org/), [Rpy2](https://rpy2.readthedocs.io).

[Back to contents](#Contents)

In [9]:
import sys
sys.path.append('../src/')
import copy

import numpy as np
import os
import torch
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader

import autoencoder as ae

random_state = 150
torch.manual_seed(random_state);


from computational_utils import reshape
import bayesian_networks as bn

from io_work import stringSplitByNumbers

import time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

from computational_utils import reshape
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB


### Build models

Next cell contains script to build autoencoder models relevant to CV indexes supplied by data/cv_indices.npz.

Parameters to control are:

- sizes: list of integers which specifies output sizes for each encoding layer
- batch_size: number of samples to be used for computing new update at each epoch
- nEpoch: number of epochs for each layer
- num_workers: number of parallel processes to work

[Back to contents](#Contents)

In [10]:
data_dirname = '../data/'
model_dirname = '../models/autoencoder/'
filename_dataset = 'dataset_parts.npz'
filename_cv = 'cv_indices_parts.npz'

model_filename_prefix = 'parts_model_ae_'

df = np.load(data_dirname+filename_dataset)
T, labels = df['data'], df['label']
# unfold into matrix
T = reshape(T, [T.shape[0], -1])
# normalize among samples
T /= np.linalg.norm(T, axis=1, keepdims=1)
print 'full'
    
sizes = [400, 100, 25]
nEpoch = [1000, 1000, 1000]
batch_size = 200
num_workers = 4

df = np.load(data_dirname+filename_cv)
test_indices, train_indices = df['test_indices'], df['train_indices']



ae.buildAutoencoderModels(
    T, train_indices, test_indices, sizes, model_dirname, nEpoch,
    batch_size, num_workers, model_filename_prefix
)

full
(1) Errors on training set (1808 samples): 
min=8.201e-02 / mean=1.370e-01 / median=1.327e-01 / max=2.933e-01
(1) Errors on validation set (455 samples): 
min=9.331e-02 / mean=1.672e-01 / median=1.511e-01 / max=6.083e-01
(2) Errors on training set (1808 samples): 
min=4.771e-02 / mean=1.003e-01 / median=9.730e-02 / max=2.104e-01
(2) Errors on validation set (455 samples): 
min=6.020e-02 / mean=1.401e-01 / median=1.190e-01 / max=6.083e-01
(3) Errors on training set (1808 samples): 
min=6.476e-02 / mean=1.175e-01 / median=1.138e-01 / max=3.463e-01
(3) Errors on validation set (455 samples): 
min=6.751e-02 / mean=1.653e-01 / median=1.409e-01 / max=7.377e-01
(1) Errors on training set (1808 samples): 
min=7.969e-02 / mean=1.371e-01 / median=1.323e-01 / max=2.867e-01
(1) Errors on validation set (455 samples): 
min=7.995e-02 / mean=1.636e-01 / median=1.473e-01 / max=6.735e-01
(2) Errors on training set (1808 samples): 
min=4.471e-02 / mean=9.607e-02 / median=9.325e-02 / max=2.108e-01
(

OSError: [Errno 4] Interrupted system call

In [8]:
data_dirname = '../data/'
model_dirname = '../models/autoencoder/'
filename_dataset = 'dataset_parts_3.npz'
filename_cv = 'cv_indices_parts_3.npz'

model_filename_prefix = 'parts3_model_ae_'

df = np.load(data_dirname+filename_dataset)
T, labels = df['data'], df['label']
# unfold into matrix
T = reshape(T, [T.shape[0], -1])
# normalize among samples
T /= np.linalg.norm(T, axis=1, keepdims=1)
print 'full'
    
sizes = [400, 100, 25]
nEpoch = [1000, 1000, 1000]
batch_size = 200
num_workers = 4

df = np.load(data_dirname+filename_cv)
test_indices, train_indices = df['test_indices'], df['train_indices']



ae.buildAutoencoderModels(
    T, train_indices[:5], test_indices[:5], sizes, model_dirname, nEpoch,
    batch_size, num_workers, model_filename_prefix
)

full
(1) Errors on training set (1809 samples): 
min=7.838e-02 / mean=1.364e-01 / median=1.327e-01 / max=2.908e-01
(1) Errors on validation set (454 samples): 
min=8.157e-02 / mean=1.702e-01 / median=1.514e-01 / max=6.846e-01
(2) Errors on training set (1809 samples): 
min=4.770e-02 / mean=1.009e-01 / median=9.897e-02 / max=2.056e-01
(2) Errors on validation set (454 samples): 
min=4.926e-02 / mean=1.443e-01 / median=1.213e-01 / max=6.839e-01
(3) Errors on training set (1809 samples): 
min=5.766e-02 / mean=1.206e-01 / median=1.172e-01 / max=2.390e-01
(3) Errors on validation set (454 samples): 
min=5.973e-02 / mean=1.743e-01 / median=1.455e-01 / max=7.904e-01
(1) Errors on training set (1809 samples): 
min=7.348e-02 / mean=1.373e-01 / median=1.325e-01 / max=3.088e-01
(1) Errors on validation set (454 samples): 
min=9.019e-02 / mean=1.651e-01 / median=1.476e-01 / max=5.086e-01
(2) Errors on training set (1809 samples): 
min=4.428e-02 / mean=1.046e-01 / median=1.024e-01 / max=2.211e-01
(

Process Process-145415:
Process Process-145413:
Process Process-145414:
Process Process-145416:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/pavel/apd/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
  File "/home/pavel/apd/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
  File "/home/pavel/apd/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
    self.run()
    self.run()
  File "/home/pavel/apd/lib/python2.7/multiprocessing/process.py", line 114, in run
Traceback (most recent call last):
    self._target(*self._args, **self._kwargs)
  File "/home/pavel/apd/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/home/pavel/apd/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 21, in recv
  File "/home/pavel/apd/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 50, in _worker_loop
    r = index_queue.

KeyboardInterrupt: 

In [None]:
data_dirname = '../data/'
dirname_results = '../results/'
filename_results = 'autoencoder+LR_parts'
data_filename = 'autoencoded_dataset_parts.npz'
data_test2_filename = 'autoencoded_test2.npz'

filename_cv = 'cv_indices.npz'
df = np.load(data_dirname+filename_cv)
test_indices, train_indices = df['test_indices'], df['train_indices']

df = np.load(data_dirname+data_test2_filename)
X_test2, y_test2 = df['data'], df['label']
y_test2 = reshape(y_test2, [-1, 1])

df = np.load(data_dirname+data_filename)
X, y = df['data'], df['label']
y = reshape(y, [-1, 1])
colnames = ['identity'] + ['V%d' % (i) for i in xrange(X.shape[-1])]

tms = []
predict_train_all = []
predict_test_all = []
predict_test2_all = []

confusion_matrices = []
accuracies = []
f1s = []

# correct label in the end
predicted_probas_test = []
predicted_probas_test2 = []
for k in xrange(len(train_indices)):
    print "CV %d / %d" % (k+1, len(train_indices))
    train_index = train_indices[k]
    test_index = test_indices[k]
    
    classifier = LogisticRegression(
        penalty='l1', dual=False, tol=0.0001, C=1000.0, fit_intercept=True,
        intercept_scaling=1, class_weight=None, random_state=None,
        solver='saga', max_iter=1000, multi_class='multinomial', verbose=0,
        warm_start=False, n_jobs=1
    )
    
    tic = time.clock();
    classifier.fit(X[k][train_index], y[train_index])
    toc = time.clock();
    
    tms_loc = [toc-tic]
    
    tic = time.clock()
    predict_train = classifier.predict(X[k][train_index])
    toc = time.clock()
    tms_loc.append(toc-tic)
    acc_loc = [accuracy_score(y[train_index], predict_train)]
    f1_loc = [f1_score(y[train_index], predict_train, average='weighted')]
    tic = time.clock()
    predict_test = classifier.predict(X[k][test_index])
    toc = time.clock()
    acc_loc.append( accuracy_score(y[test_index], predict_test) )
    f1_loc.append(f1_score(y[test_index], predict_test, average='weighted') )
    confusion_matrices.append(confusion_matrix(y[test_index], predict_test))
    tms_loc.append(toc-tic)
    
    tmp = reshape(np.array(y[test_index]), [-1, 1])
    tmp = np.hstack([classifier.predict_proba(X[k][test_index]), tmp])
    predicted_probas_test.append( tmp.copy() )
    tmp = reshape(np.array(y_test2), [-1, 1])
    tmp = np.hstack([classifier.predict_proba(X_test2[k]), tmp])
    predicted_probas_test2.append( tmp.copy() )
    
    predict_test2 = classifier.predict(X_test2[k])
    acc_loc.append( accuracy_score(y_test2, predict_test2) )
    f1_loc.append(f1_score(y_test2, predict_test2, average='weighted') )
    
    accuracies.append(acc_loc)
    f1s.append(f1_loc)
    tms.append(tms_loc)
    predict_train_all.append( predict_train )
    predict_test_all.append( predict_test )
    predict_test2_all.append( predict_test2 )
    np.savez_compressed(
        dirname_results+filename_results, tms=tms, predict_train=predict_train_all,
        predict_test=predict_test_all, predict_test2=predict_test2_all, test_indices=test_indices,
        train_indices=train_indices, y_test2=y_test2.T, y=y, confusion_matrices=confusion_matrices,
        acc=accuracies, f1=f1s, predicted_probas_test=predicted_probas_test,
        predicted_probas_test2=predicted_probas_test2
    )
accuracies = np.array(accuracies)
f1s = np.array(f1s)
print "accuracies"
print np.median(accuracies, axis=0)
print "f1 measure"
print np.median(f1s, axis=0)