# Feature-Extraktion from CNN

In this notebook we extract the values before they normally would be forwarded to the final fully connected layer. We use the values at this semi-final stage as features for the Machine Learning classifiers which are no Neural Networks. The values will be calculated for the training- and test-set with a fully trained CNN with attention. The (dis-)advatange of the CNN with attention ist, that it hast half the feature count then the normal CNN. The CNN has been trained beforehand on the training-set. To generate the features it basically runs in test-mode. Which means model.eval() and volatile=true for the input autograd.Variables in case of PyTorch which is used here. The generated values are saved as a CSV file. Each column in the CSV file stands for a result of the 1-max-pooling which is currently used. The actual training of the ML classfiers will happen in a different notebook

In [None]:

import json
import os
from pathlib import Path

path = os.path.realpath(os.path.join('..', '..'))
os.chdir(path)

import numpy as np
import pandas as pd
import torch
from tqdm import tqdm  # _notebook as tqdm
from torch.autograd.variable import Variable

from src.training.learning_session import LearningSession
from src.tools.config import Config

Because we would have to load the JSON config-file of the CNN anyway because of the name of the saved network-files, we will just use the LearningSession class in an irregular way. That doesn't lead to full function code clones and saves some work. This goes also for the preprocessed data for the CNN input. You have to be aware of the fact, that the config file should contain the actual params of the the training session you want to load your model from.

In [None]:
train_pipe_config_path = Path(Config.path.project_root_folder) / 'src' / 'strategies'
train_pipe_config_path = train_pipe_config_path / 'cnn' / 'attentive_cnn_config.json'
data_folder = Path(Config.path.data_folder)
train_file = data_folder / "features_train_i.csv"
test_file = data_folder / "features_test_i.csv"

with open(str(train_pipe_config_path)) as file:
    params = json.load(file)

logger_args = {
    'tensorboard_log_dir': Config.path.log_folder,
    'mongo_host': 'localhost',
    'mongo_port': Config.logging.port
}
params['logger']['args'].update(logger_args)
params['learning_session'].update({'cache_folder': Config.path.cache_folder})
ls = LearningSession(params)

In [None]:
%%time
result_dict = ls.data_factory.get_data()
train_datadict = result_dict['train_data']
test_datadict = result_dict['test_data']
reply_lengths = result_dict.get('reply_lengths', None)
ls.word_vectors = result_dict.get('word_vectors', None)
model = ls._load_saved_model(fold=1, mode="tr_full")

if model is None:
    raise Warning("Loading of model has failed, there seems to be no model with given parameters")

# We MUST change the flag for the feature extraction after loading
if not hasattr(model, 'log_features'):
    raise AttributeError("Model does not have log_features attribute, wrong model type!")
model.log_features = True

In [None]:
key = 'cv_iterator_factory'
args = {'reply_lengths': reply_lengths}
ls.args[key] = ls._update_args(key, args)
ls._load_cv_iterator_factory()
train_data_loader = ls._create_dataloader(data_dict=train_datadict, reply_lengths=None)
test_data_loader = ls._create_dataloader(data_dict=test_datadict, reply_lengths=None)
train_data_loader.shuffle = False
test_data_loader.shuffle = False

In [None]:
# Numpy array get feat_num + 1, because of the label column
feat_num = params["model"]["args"]["hl1_kernel_num"]
features_train = np.zeros((196526, feat_num + 1), dtype='float32')
features_test = np.zeros((21836, feat_num + 1), dtype='float32')

In [None]:
has_cuda = torch.cuda.is_available()
column_names = [str(i) for i in range(feat_num)]
column_names.append("label")

In [None]:
if has_cuda:
    model.cuda()

## The functions we will need

In [None]:
def extract_features(dataloader, model, feat_num, feature_table):
    def _step(variable_dict, labels):
        features = model(**variable_dict)
        batch_size = len(variable_dict["replies"])
        feature_table[i:i + batch_size, :feat_num] = features.squeeze().data.cpu().numpy()
        feature_table[i:i + batch_size, feat_num] = labels.cpu().numpy()

    # For the increase of the index i. Only the last batch will probabaly be different, which is no problem
    batch_size = dataloader.batch_sampler.batch_size
    model.eval()
    i = 0

    if has_cuda:
        for data, labels in tqdm(dataloader):
            variable_dict = {k: Variable(v, volatile=True).cuda() for k, v in data.items()}
            _step(variable_dict, labels)
            i += batch_size
    else:
        for data, labels in tqdm(dataloader):
            variable_dict = {k: Variable(v, volatile=True) for k, v in data.items()}
            _step(variable_dict, labels)
            i += batch_size
    return feature_table

**This till take some seconds:**

In [None]:
%%time
features_train = extract_features(train_data_loader, model, feat_num, features_train)

In [None]:
%%time
features_test = extract_features(test_data_loader, model, feat_num, features_test)

In [None]:
%%time
train_df = pd.DataFrame(data=features_train, dtype='float32', columns=column_names)
test_df = pd.DataFrame(data=features_test, dtype='float32', columns=column_names)
train_df.to_csv(str(train_file), index=False)
test_df.to_csv(str(test_file), index=False)

### Finished