## Entraînement de plusieurs SVM

L'entraînement de SVM sur des jeux de taille importante (1 million d'exemples) est impossible en un temps raisonnable à cause de la complexité de l'algorithme (voir notebook 15.0).

Nous allons donc entraîner ici plusieurs SVM sur des jeux de tailles croissantes (10000, 20000, ...), afin d'estimer le temps d'entraînement nécessaire en fonction de la taille du jeu (suivant théoriquement une fonction cubique). De plus, cela nous permettra d'avoir une idée des performances des modèles de type SVM.

En plus de réduire le nombre d'exemples dans nos jeux, nous allons également réduire la largeur des exemples, en limitant à 15 le nombre d'atomes que l'on considère au voisinage des liaisons. Si le nombre d'atomes est supérieur à 15, alors on n'enregistre pas l'exemple dans le jeu d'entraînement.

### JSONS

In [None]:
{
  "paths":{
        "train_set_loc":"../../data/train_set_riken_v2_reduced.h5",
        "test_set_loc":"../../data/test_set_riken_v2_reduced.h5",
        "train_prepared_input_loc":"../../data/DIST_REL_C_SVM_01/train_set_prepared_input.h5",
        "test_prepared_input_loc":"../../data/DIST_REL_C_SVM_01/test_set_prepared_input.h5",
        "train_labels_loc":"../../data/DIST_REL_C_SVM_01/train_set_labels.h5",
        "test_labels_loc":"../../data/DIST_REL_C_SVM_01/test_set_labels.h5",
        "model_loc":"../../models/DIST_REL_C_SVM_01/DIST_REL_C_SVM_01.pkl",
        "bonds_lengths_loc":"/home/jleguy/data/stats/CC/CC_bonds_lengths_total_set.h5",
        "plots_dir":"tests/DIST_REL_C_SVM_01/"
  },
  "tasks":[
    {
      "prepare_model_data_IGNORED": {
        "selected_mols": {
          "mol_min_size": "2",
          "mol_max_size": "60",
          "max_anum": "9",
          "anum_1": "6",
          "anum_2": "6",
          "min_bond_size": "0",
          "max_bond_size": "1.6",
          "bond_max_neighbours":"15"
        },
        "params": {
          "wished_train_size": "10000",
          "wished_test_size": "50000",
          "pos_class": "True",
          "one_hot_anums": "True",
          "amasses": "True",
          "distances": "True",
          "distances_cut_off": "2",
          "batch_size": "10000",
          "distances_fun":"inv"
        }
      }
    },
    {
      "model_train":{
        "model_name":"DIST_REL_C_SVM_01",
        "model_type":"SVM",
        "params":{
          "kernel":"poly",
          "degree":"2",
          "epsilon":"0.1",
          "gamma":"auto",
          "coef0":"0",
          "shrinking":"True",
          "tol":"0.001",
          "cache_size":"500",
          "verbose":"True",
          "save_model": "True",
          "max_iter":"-1",
          "C":"1"
        }
      }
    },
    {
      "plot_predictions": {
        "params": {
          "model_name": "DIST_REL_C_SVM_01",
          "model_type": "SVM",
          "anum_1": "6",
          "anum_2": "6",
          "plot_error_distrib": "True",
          "plot_targets_error_distrib": "True",
          "plot_targets_predictions": "True",
          "asymb_1": "C",
          "asymb_2": "C",
          "batch_size": "1060",
          "display_plots":"True"
        }
      }
    }
  ]
}

In [None]:
{
  "paths":{
        "train_set_loc":"../../data/train_set_riken_v2_reduced.h5",
        "test_set_loc":"../../data/test_set_riken_v2_reduced.h5",
        "train_prepared_input_loc":"../../data/DIST_REL_C_SVM_02/train_set_prepared_input.h5",
        "test_prepared_input_loc":"../../data/DIST_REL_C_SVM_02/test_set_prepared_input.h5",
        "train_labels_loc":"../../data/DIST_REL_C_SVM_02/train_set_labels.h5",
        "test_labels_loc":"../../data/DIST_REL_C_SVM_02/test_set_labels.h5",
        "model_loc":"../../models/DIST_REL_C_SVM_02/DIST_REL_C_SVM_02.pkl",
        "bonds_lengths_loc":"/home/jleguy/data/stats/CC/CC_bonds_lengths_total_set.h5",
        "plots_dir":"tests/DIST_REL_C_SVM_02/"
  },
  "tasks":[
    {
      "prepare_model_data": {
        "selected_mols": {
          "mol_min_size": "2",
          "mol_max_size": "60",
          "max_anum": "9",
          "anum_1": "6",
          "anum_2": "6",
          "min_bond_size": "0",
          "max_bond_size": "1.6",
          "bond_max_neighbours":"15"

        },
        "params": {
          "wished_train_size": "20000",
          "wished_test_size": "50000",
          "pos_class": "True",
          "one_hot_anums": "True",
          "amasses": "True",
          "distances": "True",
          "distances_cut_off": "2",
          "batch_size": "10000",
          "distances_fun":"inv"
        }
      }
    },
    {
      "model_train":{
        "model_name":"DIST_REL_C_SVM_02",
        "model_type":"SVM",
        "params":{
          "kernel":"poly",
          "degree":"2",
          "epsilon":"0.1",
          "gamma":"auto",
          "coef0":"0",
          "shrinking":"True",
          "tol":"0.001",
          "cache_size":"500",
          "verbose":"True",
          "save_model": "True",
          "max_iter":"-1",
          "C":"1"
        }
      }
    },
    {
      "plot_predictions": {
        "params": {
          "model_name": "DIST_REL_C_SVM_02",
          "model_type": "SVM",
          "anum_1": "6",
          "anum_2": "6",
          "plot_error_distrib": "True",
          "plot_targets_error_distrib": "True",
          "plot_targets_predictions": "True",
          "asymb_1": "C",
          "asymb_2": "C",
          "batch_size": "1060",
          "display_plots":"True"
        }
      }
    }
  ]
}

In [None]:
{
  "paths":{
        "train_set_loc":"../../data/train_set_riken_v2_reduced.h5",
        "test_set_loc":"../../data/test_set_riken_v2_reduced.h5",
        "train_prepared_input_loc":"../../data/DIST_REL_C_SVM_03/train_set_prepared_input.h5",
        "test_prepared_input_loc":"../../data/DIST_REL_C_SVM_03/test_set_prepared_input.h5",
        "train_labels_loc":"../../data/DIST_REL_C_SVM_03/train_set_labels.h5",
        "test_labels_loc":"../../data/DIST_REL_C_SVM_03/test_set_labels.h5",
        "model_loc":"../../models/DIST_REL_C_SVM_03/DIST_REL_C_SVM_03.pkl",
        "bonds_lengths_loc":"/home/jleguy/data/stats/CC/CC_bonds_lengths_total_set.h5",
        "plots_dir":"tests/DIST_REL_C_SVM_03/"
  },
  "tasks":[
    {
      "prepare_model_data": {
        "selected_mols": {
          "mol_min_size": "2",
          "mol_max_size": "60",
          "max_anum": "9",
          "anum_1": "6",
          "anum_2": "6",
          "min_bond_size": "0",
          "max_bond_size": "1.6",
          "bond_max_neighbours":"15"

        },
        "params": {
          "wished_train_size": "60000",
          "wished_test_size": "50000",
          "pos_class": "True",
          "one_hot_anums": "True",
          "amasses": "True",
          "distances": "True",
          "distances_cut_off": "2",
          "batch_size": "10000",
          "distances_fun":"inv"
        }
      }
    },
    {
      "model_train":{
        "model_name":"DIST_REL_C_SVM_03",
        "model_type":"SVM",
        "params":{
          "kernel":"poly",
          "degree":"2",
          "epsilon":"0.1",
          "gamma":"auto",
          "coef0":"0",
          "shrinking":"True",
          "tol":"0.001",
          "cache_size":"500",
          "verbose":"True",
          "save_model": "True",
          "max_iter":"-1",
          "C":"1"
        }
      }
    },
    {
      "plot_predictions": {
        "params": {
          "model_name": "DIST_REL_C_SVM_03",
          "model_type": "SVM",
          "anum_1": "6",
          "anum_2": "6",
          "plot_error_distrib": "True",
          "plot_targets_error_distrib": "True",
          "plot_targets_predictions": "True",
          "asymb_1": "C",
          "asymb_2": "C",
          "batch_size": "1060",
          "display_plots":"True"
        }
      }
    }
  ]
}

## Analyse des résultats de DIST_REL_C_SVM_03

#### Statistiques erreurs (pm)

```
Plotting DIST_REL_C_SVM_03
Dataset size : 51554
Mean error : 1.285404665873713
Median error : 0.5004685638983914
Standard deviation : 2.23017220759463
Min error : 2.8341046186142193e-05
Max error : 24.222475670396854
Relative error : 0.8907730317428133%
```

#### Distribution des erreurs

![title](../figures/DIST_REL_C_SVM_03/DIST_REL_C_SVM_03_distrib_rmse_val.png)

#### Erreurs en fonction des distances cibles

![title](../figures/DIST_REL_C_SVM_03/DIST_REL_C_SVM_03_distrib_rmse_dist.png)


#### Prédiction en fonction des distances cibles

![title](../figures/DIST_REL_C_SVM_03/DIST_REL_C_SVM_03_preds_targets.png)

