# Precomputing dataset splits
This notebook shows how the dataset splits have been computed beforehand, under difference data split approaches and for different dataset versions.
The dataset versions handled are:
* v1: symbols dataset with only 10 classes by topic(network, cryptography, disk, etc)
* v2: symbols dataset with >100 classes by topic and task (network_send, cryptography_encrypt, network_config,...)
* v3: symbols dataset with a selection of 24 classes containing both topic and task 

The data split approaches consist in:
* 1) just split the dataset without modifying class imbalance
* 2) remove minimum classes, classes with less samples than a minimum threshold
* 3) undersample majority classes, classes with more samples than a threshold will be undersampled


In [67]:
%load_ext autoreload
%autoreload 2
from TFM_function_renaming_baseline_models import *
from TFM_function_renaming_preprocess_dataset_splits import *
from TFM_function_renaming_nlp_models import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Testing tfidf on v3 dataset

In [2]:
X_train = pickle.load(open('tmp/symbols_dataset_3_precomp_split_unchanged/X_train.pickle','rb'))
X_test = pickle.load(open('tmp/symbols_dataset_3_precomp_split_unchanged/X_test.pickle','rb'))
y_train = pickle.load(open('tmp/symbols_dataset_3_precomp_split_unchanged/y_train.pickle','rb'))
y_test = pickle.load(open('tmp/symbols_dataset_3_precomp_split_unchanged/y_test.pickle','rb'))
nclasses = pickle.load(open('tmp/symbols_dataset_3_precomp_split_unchanged/nclasses.pickle','rb'))


In [19]:
dataset_version='v3'
features = 'document'
min_count=0
nlp_models_training_and_testing(
        X_train, X_test, 
        y_train, y_test,
        features,dataset_version,
        nclasses,results_folder='results/tfidf_params_hp_search_v3.json')
print_training_stats('v3',fileversion='results/tfidf_params_hp_search_v3.json')


cv_train_nn_nlp_models, nclasses= 24
Before unrolling:
{'mlp1': {'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
          'params_set': [{'d1': [20, 50, 100],
                          'd2': [10, 20],
                          'num_epochs': [150]}]},
 'mlp2': {'model': <class 'TFM_function_renaming_baseline_models.mlp2'>,
          'params_set': [{'d2': [50, 100, 150],
                          'd3': [5, 50, 200],
                          'num_epochs': [100, 200]}]}}
'mlp1'
{'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
 'params_set': [{'d1': [20, 50, 100], 'd2': [10, 20], 'num_epochs': [150]}]}
{'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
 'params_set': [{'d1': [20, 50, 100],
                 'd2': [10, 20],
                 'model_class': [<class 'TFM_function_renaming_baseline_models.mlp1'>],
                 'num_classes': [24],
                 'num_epochs': [150],
                 'preprocessor__tfidf__tvec__max_df': [0.8],
  

KeyboardInterrupt: 

In [68]:
features = 'document and topo feats'
nlp_models_training_and_testing(
    X_train, X_test, y_train, y_test,
    features,dataset_version,
    nclasses,results_folder='results/tfidf_params_hp_search_v3.json',
    nlp_models=prepare_models_quick(),
    nn_models=prepare_nn_models_quick())


cv_train_nn_nlp_models, nclasses= 24
Before unrolling:
{'mlp1': {'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
          'params_set': [{'d1': [1], 'd2': [3], 'num_epochs': [2]}]}}
'mlp1'
{'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
 'params_set': [{'d1': [1], 'd2': [3], 'num_epochs': [2]}]}
{'model': <class 'TFM_function_renaming_baseline_models.mlp1'>,
 'params_set': [{'d1': [1],
                 'd2': [3],
                 'model_class': [<class 'TFM_function_renaming_baseline_models.mlp1'>],
                 'num_classes': [24],
                 'num_epochs': [2],
                 'preprocessor__tfidf__tvec__max_df': [0.8],
                 'preprocessor__tfidf__tvec__max_features': [100],
                 'preprocessor__tfidf__tvec__min_df': [0.1],
                 'preprocessor__tfidf__tvec__ngram_range': [(2, 3)]}]}
[{'d1': 1,
  'd2': 3,
  'model_class': <class 'TFM_function_renaming_baseline_models.mlp1'>,
  'num_classes': 24,
  'num_epochs'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Training  LogisticRegression
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

[10]

GridseachCV for  f1_micro


  'precision', 'predicted', average, warn_for)





Before save results, results_folder= results/tfidf_params_hp_search_v3.json 



{'f1_micro': {'0': {'f1-score': 0.5669663984855655,
                    'precision': 0.4795836669335468,
                    'recall': 0.6932870370370371,
                    'support': 864},
              '1': {'f1-score': 0.3212996389891697,
                    'precision': 0.3662551440329218,
                    'recall': 0.2861736334405145,
                    'support': 622},
              '10': {'f1-score': 0.20202020202020202,
                     'precision': 0.45454545454545453,
                     'recall': 0.12987012987012986,
                     'support': 77},
              '11': {'f1-score': 0.0,
                     'precision': 0.0,
                     'recall': 0.0,
                     'support': 41},
              '12': {'f1-score': 0.10555555555555557,
                     'precision': 0.34545454545454546,
                     'recall': 0.06229508196721312,
                     'su

KeyboardInterrupt: 

## Check everything works (non-tfidf models)

In [21]:
features = 'x_topo_feats'
baseline_training_and_testing(
    X_train, X_test, y_train, y_test,
    features,dataset_version,
    nclasses,results_folder='results/tfidf_params_hp_search_v3.json',
    baseline_models=prepare_models_quick(),
    baseline_nn_models=prepare_nn_models_quick())

Training  LogisticRegression
GridseachCV for  f1_micro


  'precision', 'predicted', average, warn_for)


Training  RandomForestClassifier
GridseachCV for  f1_micro


  'precision', 'predicted', average, warn_for)



nn_train_models, nclasses= 24
n_X_cols for the nn:  10  and X shape  (23163, 10)
params_set
{'d1': [1], 'd2': [3], 'num_epochs': [2]}


In [22]:
features = 'code feats'
baseline_training_and_testing(
    X_train, X_test, y_train, y_test,
    features,dataset_version,
    nclasses,results_folder='results/tfidf_params_hp_search_v3.json',
    baseline_models=prepare_models_quick(),
    baseline_nn_models=prepare_nn_models_quick())

Training  LogisticRegression
GridseachCV for  f1_micro


  'precision', 'predicted', average, warn_for)


Training  RandomForestClassifier
GridseachCV for  f1_micro


  'precision', 'predicted', average, warn_for)



nn_train_models, nclasses= 24
n_X_cols for the nn:  7  and X shape  (23163, 7)
params_set
{'d1': [1], 'd2': [3], 'num_epochs': [2]}


## Parsing Results

In [64]:
print_training_stats('','',results_file='results/tfidf_params_hp_search_v3.json')

Unnamed: 0,model,parameters,data features,optimized score,avg score in cv,micro-precision,micro-recall,micro-f1,support
0,LogisticRegression,C:1__max_iter:100__multi_class:ovr__penalty:l2...,code feats,f1_micro2019-09-07_19_41_46,0.181194,0.0235676,0.153458,0.04086,7722
1,RandomForestClassifier,max_depth:8__n_estimators:16,code feats,f1_micro2019-09-07_19_41_47,0.294737,0.190556,0.254856,0.171008,7722
2,mlp1,,document,f1_micro2019-09-07_19_30_18,0.275965,"[, , , ]","[, , , ]","[, , , ]","[, , , ]"


In [65]:
print_all_training_stats('','',results_file='results/tfidf_params_hp_search_v3.json')

Unnamed: 0,model,parameters,data features,optimized score,avg score in cv,micro-precision,micro-recall,micro-f1,support
0,mlp1,,document,f1_micro2019-09-07_19_30_18,0.275965,,,,
1,mlp1,d1:100__d2:3__num_classes:24__preprocessor__tf...,document,f1_micro2019-09-07_19_32_15,0.237762,,,,
2,mlp1,d1:10__d2:3__num_classes:24,x_topo_feats,f1_micro2019-09-07_19_39_48,0.153458,,,,
3,mlp1,d1:7__d2:3__num_classes:24,code feats,f1_micro2019-09-07_19_42_00,0.153328,,,,
4,LogisticRegression,C:1__max_iter:100__multi_class:ovr__penalty:l2...,x_topo_feats,f1_micro2019-09-07_19_39_32,0.148729,0.0235676,0.153458,0.04086,7722.0
5,LogisticRegression,C:1__max_iter:100__multi_class:ovr__penalty:l2...,code feats,f1_micro2019-09-07_19_41_46,0.181194,0.160886,0.180912,0.09906,7722.0
6,RandomForestClassifier,max_depth:8__n_estimators:16,x_topo_feats,f1_micro2019-09-07_19_39_34,0.252083,0.190556,0.254856,0.171008,7722.0
7,RandomForestClassifier,max_depth:8__n_estimators:16,code feats,f1_micro2019-09-07_19_41_47,0.294737,0.387746,0.305232,0.24593,7722.0
