# Optimize feature selection
Many different features were calculated (49 in total). In this notebook we try different combinations of these features, to select the best combination. 
Also different weighting methods are selected for the "chemical neighbourhood score"

Different models trained are in order:

Run with 30 features for similar structures, without chemical neighbourhood score
Duplicate score to check how random forest behaves
Results training with cos scores 

# IMPORTANT NOTE:
This notebook will only run in the github branch add_cosine_to_features

### Load in scores

In [1]:
import os
from ms2query.utils import load_pickled_file
from matplotlib import pyplot as plt



In [4]:
training_scores, training_labels, validation_scores, validation_labels = load_pickled_file("C:/Users/jonge094/PycharmProjects/PhD_MS2Query/ms2query/data/libraries_and_models/gnps_15_12_2021/ms2q_training_data_with_additional_weigthing_scores.pickle")

In [13]:
print(training_scores.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 56 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0   query_precursor_mz                                       467600 non-null  float64
 1   precursor_mz_difference                                  467600 non-null  float64
 2   s2v_score                                                467600 non-null  float64
 3   ms2ds_score                                              467600 non-null  float64
 4   average_ms2ds_score_for_inchikey14                       467600 non-null  float64
 5   nr_of_spectra_with_same_inchikey14                       467600 non-null  int64  
 6   chemical_neighbourhood_score                             467600 non-null  float64
 7   average_tanimoto_score_for_chemical_neighbourhood_score  467600 non-null  float64
 8   nr_of_spectra_

# Test additional features
Different features are tested in different steps. The order of the tests is given below:
- Final selection of features
- (modified) cosine score
- Instrument types
- Method for taking average of multiple library structures
- Method for weighting average of multiple library structures

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def train_random_forest(training_scores, validation_scores):
    # train rf using optimised parameters from below

    rf = RandomForestRegressor(n_estimators = 250, 
                               random_state = 42, 
                               max_depth = 5, 
                               verbose=1,
                               min_samples_leaf=50,
                               n_jobs=7)
    rf.fit(selection_of_training_scores, training_labels)

    # predict on train
    rf_train_predictions = rf.predict(selection_of_training_scores)
    mse_train_rf = mean_squared_error(training_labels, rf_train_predictions)
    print('Training MSE', mse_train_rf)

    # predict on test
    rf_predictions = rf.predict(selection_of_validation_scores)
    mse_rf = mean_squared_error(validation_labels, rf_predictions)
    print('Validation MSE', mse_rf)

    # get feature importances
    importances = list(rf.feature_importances_)
    feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(selection_of_training_scores.columns, importances)]
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

# Final selection of features
The model trained with the final selection of features

In [12]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0",
                            "average_tanimoto_score_for_chemical_neighbourhood_score"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 6 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0   query_precursor_mz                                       467600 non-null  float64
 1   precursor_mz_difference                                  467600 non-null  float64
 2   s2v_score                                                467600 non-null  float64
 3   ms2ds_score                                              467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0    467600 non-null  float64
 5   average_tanimoto_score_for_chemical_neighbourhood_score  467600 non-null  float64
dtypes: float64(6)
memory usage: 21.4 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:   11.6s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:  1.0min
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.5min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.6s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.0s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.02822048791511903
Validation MSE 0.025649444294042374
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02
Variable: average_tanimoto_score_for_chemical_neighbourhood_score Importance: 0.01


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.2s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


# (modified) cosine score
The cosine and modified cosine score are added as features. 

In [11]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0",
                            "cosine_score",
                            "modified_cosine_score"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 7 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
 5   cosine_score                                           467600 non-null  float64
 6   modified_cosine_score                                  467600 non-null  float64
dtypes: float64(7)
memory usage: 25.0 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:   11.2s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:  1.3min
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.7min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.8s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.2s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.028244165570602768
Validation MSE 0.025764406064172858
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02
Variable: cosine_score         Importance: 0.0
Variable: modified_cosine_score Importance: 0.0


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.2s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


### Cosine without other scores

In [25]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "cosine_score",
                            "modified_cosine_score"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 4 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   query_precursor_mz       467600 non-null  float64
 1   precursor_mz_difference  467600 non-null  float64
 2   cosine_score             467600 non-null  float64
 3   modified_cosine_score    467600 non-null  float64
dtypes: float64(4)
memory usage: 14.3 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    5.5s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   32.9s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:   47.2s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.8s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.03624574130161057
Validation MSE 0.034595751645214
Variable: query_precursor_mz   Importance: 0.56
Variable: precursor_mz_difference Importance: 0.4
Variable: cosine_score         Importance: 0.03
Variable: modified_cosine_score Importance: 0.02


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


[None, None, None, None]

# Instrument type
The instrument type of the query and library spectrum are given as features. 

In [15]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0",
                            "lib_instrument_orbitrap",
                            "lib_instrument_ion_trap",
                            "lib_instrument_tof",
                            "lib_instrument_quadrupole",
                            "query_instrument_orbitrap",
                            "query_instrument_ion_trap",
                            "query_instrument_tof",
                            "query_instrument_quadrupole"
                            ]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 13 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
 5   lib_instrument_orbitrap                                467600 non-null  int64  
 6   lib_instrument_ion_trap                                467600 non-null  int64  
 7   lib_instrument_tof                                     467600 non-null  int64  
 8   lib_instrument_quadrupole         

  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    8.6s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   57.0s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.3min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.7s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.028262590756230818
Validation MSE 0.025694134516425868
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02
Variable: lib_instrument_orbitrap Importance: 0.0
Variable: lib_instrument_ion_trap Importance: 0.0
Variable: lib_instrument_tof   Importance: 0.0
Variable: lib_instrument_quadrupole Importance: 0.0
Variable: query_instrument_orbitrap Importance: 0.0
Variable: query_instrument_ion_trap Importance: 0.0
Variable: query_instrument_tof Importance: 0.0
Variable: query_instrument_quadrupole Importance: 0.0


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.2s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


# Compare average on structure vs average on spectra

In [15]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0",
                           "chemical_neighbourhood_tanimoto_0"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 6 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
 5   chemical_neighbourhood_tanimoto_0                      467600 non-null  float64
dtypes: float64(6)
memory usage: 21.4 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    8.9s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   57.4s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.3min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.8s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.028238497618477378
Validation MSE 0.025736053101000977
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02
Variable: chemical_neighbourhood_tanimoto_0 Importance: 0.0


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


[None, None, None, None, None, None]

In [16]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                           "chemical_neighbourhood_tanimoto_0"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   query_precursor_mz                 467600 non-null  float64
 1   precursor_mz_difference            467600 non-null  float64
 2   s2v_score                          467600 non-null  float64
 3   ms2ds_score                        467600 non-null  float64
 4   chemical_neighbourhood_tanimoto_0  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    7.7s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   49.9s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.2min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.5s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.8s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.03014091876604286
Validation MSE 0.028607621186213544
Variable: chemical_neighbourhood_tanimoto_0 Importance: 0.54
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.17
Variable: s2v_score            Importance: 0.1
Variable: ms2ds_score          Importance: 0.02


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


[None, None, None, None, None]

In [17]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    9.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   38.9s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:   51.5s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.5s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.7s finished


Training MSE 0.02826842554510124


[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.1s finished


Validation MSE 0.02571552323007786
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02


[None, None, None, None, None]

# Weigthing of averge of multiple library scores

### Adding additional features
The different weighting methods were calculated with the functions below, but are currently already stored in the file loaded in the first step of this notebook 

In [11]:
from tqdm.notebook import tqdm

def add_different_weighting(dataframe, power_tanimioto):
    new_score = []
    for index, row in tqdm(dataframe.iterrows()):
        total_weight = 0
        total_ms2ds = 0 
        for i in range(10):
            average_ms2ds = row["average_ms2deepscore_" + str(i)]
            tanimoto_score = row["tanimoto_score_structure_" + str(i)]
            nr_of_spectra = row["nr_of_spectra_structure_" + str(i)]
            weight = nr_of_spectra * tanimoto_score**power_tanimioto
            total_weight += weight
            weighted_score = average_ms2ds * weight
            total_ms2ds += weighted_score
        average = total_ms2ds/total_weight
        new_score.append(average)
    dataframe["chemical_neighbourhood_tanimoto_"+str(power_tanimioto)] = new_score


In [5]:
from tqdm.notebook import tqdm

def no_nr_of_spectra_filtering(dataframe, power_tanimioto):
    new_score = []
    for index, row in tqdm(dataframe.iterrows()):
        total_weight = 0
        total_ms2ds = 0 
        for i in range(10):
            average_ms2ds = row["average_ms2deepscore_" + str(i)]
            tanimoto_score = row["tanimoto_score_structure_" + str(i)]
            weight = tanimoto_score**power_tanimioto
            total_weight += weight
            weighted_score = average_ms2ds * weight
            total_ms2ds += weighted_score
        average = total_ms2ds/total_weight
        new_score.append(average)
    dataframe["chemical_neighbourhood_no_spectrum_nr_tanimoto_power"+str(power_tanimioto)] = new_score

In [None]:
no_nr_of_spectra_filtering(training_scores, 1)
no_nr_of_spectra_filtering(validation_scores, 1)
no_nr_of_spectra_filtering(training_scores, 2)
no_nr_of_spectra_filtering(validation_scores, 2)
no_nr_of_spectra_filtering(training_scores, 3)
no_nr_of_spectra_filtering(validation_scores, 3)
no_nr_of_spectra_filtering(training_scores, 4)
no_nr_of_spectra_filtering(validation_scores, 4)
no_nr_of_spectra_filtering(training_scores, 5)
no_nr_of_spectra_filtering(validation_scores, 5)
no_nr_of_spectra_filtering(training_scores, 0)
no_nr_of_spectra_filtering(validation_scores, 0)
add_different_weighting(training_scores, 0)
add_different_weighting(validation_scores, 0)

### Use all features in one model

In [19]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0",
                           "chemical_neighbourhood_no_spectrum_nr_tanimoto_power1",
                           "chemical_neighbourhood_no_spectrum_nr_tanimoto_power2",
                           "chemical_neighbourhood_no_spectrum_nr_tanimoto_power3",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power4",
                              "chemical_neighbourhood_no_spectrum_nr_tanimoto_power5"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 10 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
 5   chemical_neighbourhood_no_spectrum_nr_tanimoto_power1  467600 non-null  float64
 6   chemical_neighbourhood_no_spectrum_nr_tanimoto_power2  467600 non-null  float64
 7   chemical_neighbourhood_no_spectrum_nr_tanimoto_power3  467600 non-null  float64
 8   chemical_neighbourhood_no_spectrum

  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:   24.5s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:  1.8min
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  2.3min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.7s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.028228783148908698
Validation MSE 0.025983753544486
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power1 Importance: 0.46
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.12
Variable: s2v_score            Importance: 0.05
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power2 Importance: 0.03
Variable: ms2ds_score          Importance: 0.02
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power3 Importance: 0.0
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power4 Importance: 0.0
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power5 Importance: 0.0


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


### Use different weighting one by one

In [16]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power0"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power0  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    7.5s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   48.7s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.1min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.8s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.02826842554510124
Validation MSE 0.02571552323007786
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power0 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


In [15]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power1"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power1  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    7.5s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   48.0s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.1min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.8s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.028375345962102953
Validation MSE 0.026004592509868557
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power1 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


[None, None, None, None, None]

In [17]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power2"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power2  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:   11.8s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   56.4s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.3min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.7s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.0s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.02859209987438069
Validation MSE 0.026239289479840083
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power2 Importance: 0.61
Variable: precursor_mz_difference Importance: 0.17
Variable: query_precursor_mz   Importance: 0.15
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


In [18]:
subselection_of_features = ["query_precursor_mz",
                            "precursor_mz_difference",
                            "s2v_score",
                            "ms2ds_score",
                            "chemical_neighbourhood_no_spectrum_nr_tanimoto_power3"]
selection_of_training_scores = training_scores[subselection_of_features]
selection_of_validation_scores = validation_scores[subselection_of_features]
selection_of_training_scores.info()
train_random_forest(selection_of_training_scores, selection_of_validation_scores)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467600 entries, 0 to 467599
Data columns (total 5 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   query_precursor_mz                                     467600 non-null  float64
 1   precursor_mz_difference                                467600 non-null  float64
 2   s2v_score                                              467600 non-null  float64
 3   ms2ds_score                                            467600 non-null  float64
 4   chemical_neighbourhood_no_spectrum_nr_tanimoto_power3  467600 non-null  float64
dtypes: float64(5)
memory usage: 17.8 MB


  rf.fit(selection_of_training_scores, training_labels)
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:   12.2s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:   55.6s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:  1.2min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    1.0s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    1.4s finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.0s


Training MSE 0.02900090331946365
Validation MSE 0.02658879720884346
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power3 Importance: 0.61
Variable: query_precursor_mz   Importance: 0.17
Variable: precursor_mz_difference Importance: 0.17
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.0


[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    0.1s
[Parallel(n_jobs=7)]: Done 250 out of 250 | elapsed:    0.2s finished


# Additional runs not structured

### Run on all scores

In [58]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# train rf using optimised parameters from below

rf = RandomForestRegressor(n_estimators = 250, 
                           random_state = 42, 
                           max_depth = 5, 
                           verbose=1,
                           min_samples_leaf=50,
                           n_jobs=-1)
rf.fit(training_scores, training_labels)

# predict on train
rf_train_predictions = rf.predict(training_scores)
mse_train_rf = mean_squared_error(training_labels, rf_train_predictions)
print('Training MSE', mse_train_rf)

# predict on test
rf_predictions = rf.predict(validation_scores)
mse_rf = mean_squared_error(validation_labels, rf_predictions)
print('Validation MSE', mse_rf)

# get feature importances
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(training_scores.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

  rf.fit(training_scores, training_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  9.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    1.9s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.


Training MSE 0.028084835705205147


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    0.4s finished


Validation MSE 0.02558998386978749
Variable: chemical_neighbourhood_no_spectrum_nr_tanimoto_power1 Importance: 0.6
Variable: precursor_mz_difference Importance: 0.18
Variable: query_precursor_mz   Importance: 0.13
Variable: s2v_score            Importance: 0.05
Variable: ms2ds_score          Importance: 0.02
Variable: tanimoto_score_structure_5 Importance: 0.01
Variable: average_ms2deepscore_7 Importance: 0.01
Variable: average_ms2ds_score_for_inchikey14 Importance: 0.0
Variable: nr_of_spectra_with_same_inchikey14 Importance: 0.0
Variable: chemical_neighbourhood_score Importance: 0.0
Variable: average_tanimoto_score_for_chemical_neighbourhood_score Importance: 0.0
Variable: nr_of_spectra_for_chemical_neighbourhood_score Importance: 0.0
Variable: cosine_score         Importance: 0.0
Variable: modified_cosine_score Importance: 0.0
Variable: lib_instrument_orbitrap Importance: 0.0
Variable: lib_instrument_ion_trap Importance: 0.0
Variable: lib_instrument_tof   Importance: 0.0
Variable: li

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [59]:
training_scores.corr()

Unnamed: 0,query_precursor_mz,precursor_mz_difference,s2v_score,ms2ds_score,average_ms2ds_score_for_inchikey14,nr_of_spectra_with_same_inchikey14,chemical_neighbourhood_score,average_tanimoto_score_for_chemical_neighbourhood_score,nr_of_spectra_for_chemical_neighbourhood_score,cosine_score,...,average_ms2deepscore_8,tanimoto_score_structure_8,nr_of_spectra_structure_8,average_ms2deepscore_9,tanimoto_score_structure_9,nr_of_spectra_structure_9,chemical_neighbourhood_tanimoto_3,chemical_neighbourhood_no_spectrum_nr_tanimoto_power1,chemical_neighbourhood_tanimoto_4,chemical_neighbourhood_tanimoto_5
query_precursor_mz,1.0,0.376204,0.060726,0.280129,0.252732,-0.068031,0.367003,0.401451,-0.00688,-0.174724,...,0.345448,0.477739,0.051192,0.347443,0.484518,0.00449,0.323522,0.352519,0.303038,0.285893
precursor_mz_difference,0.376204,1.0,-0.159887,-0.049766,-0.06059,0.006943,-0.02921,0.105554,0.01343,-0.130538,...,-0.010312,0.106936,0.033204,-0.013854,0.108111,-0.006929,-0.044686,-0.039129,-0.05135,-0.056701
s2v_score,0.060726,-0.159887,1.0,0.29234,0.277627,-0.095746,0.220795,0.016752,-0.105255,0.256245,...,0.19103,0.046601,-0.026938,0.19092,0.047476,-0.038428,0.230656,0.250778,0.234396,0.237673
ms2ds_score,0.280129,-0.049766,0.29234,1.0,0.465841,0.021132,0.493936,0.258245,0.094478,0.426974,...,0.390889,0.305707,0.067034,0.395373,0.307129,0.05755,0.485797,0.461083,0.47937,0.473221
average_ms2ds_score_for_inchikey14,0.252732,-0.06059,0.277627,0.465841,1.0,-0.319088,0.77643,0.016046,-0.220868,0.112303,...,0.596604,0.168498,-0.037785,0.579684,0.167204,-0.0442,0.823672,0.789089,0.841307,0.85541
nr_of_spectra_with_same_inchikey14,-0.068031,0.006943,-0.095746,0.021132,-0.319088,1.0,-0.105165,0.331616,0.655639,0.028875,...,-0.084394,0.153997,0.219161,-0.049468,0.155055,0.128287,-0.144255,-0.138126,-0.162716,-0.178775
chemical_neighbourhood_score,0.367003,-0.02921,0.220795,0.493936,0.77643,-0.105165,1.0,0.341885,-0.121845,0.050612,...,0.778365,0.376794,-0.024232,0.766231,0.373542,-0.044909,0.97922,0.91676,0.959757,0.940077
average_tanimoto_score_for_chemical_neighbourhood_score,0.401451,0.105554,0.016752,0.258245,0.016046,0.331616,0.341885,1.0,0.295781,-0.053413,...,0.370047,0.858453,0.094969,0.383896,0.851443,0.042888,0.210808,0.303177,0.150899,0.101793
nr_of_spectra_for_chemical_neighbourhood_score,-0.00688,0.01343,-0.105255,0.094478,-0.220868,0.655639,-0.121845,0.295781,1.0,0.031652,...,-0.013231,0.314493,0.481313,0.007455,0.314768,0.425647,-0.173167,-0.077132,-0.193787,-0.210055
cosine_score,-0.174724,-0.130538,0.256245,0.426974,0.112303,0.028875,0.050612,-0.053413,0.031652,1.0,...,-0.014836,-0.067477,0.012557,-0.007853,-0.070049,0.02209,0.072754,0.025297,0.081594,0.088477


# Tune settings

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# train rf using optimised parameters from below

rf = RandomForestRegressor(n_estimators = 250, 
                           random_state = 42, 
                           max_depth = 5, 
                           verbose=1,
                           min_samples_leaf=50,
                           n_jobs=-1)
rf.fit(training_scores, training_labels)

# predict on train
rf_train_predictions = rf.predict(training_scores)
mse_train_rf = mean_squared_error(training_labels, rf_train_predictions)
print('Training MSE', mse_train_rf)

# predict on test
rf_predictions = rf.predict(validation_scores)
mse_rf = mean_squared_error(validation_labels, rf_predictions)
print('Validation MSE', mse_rf)

# get feature importances
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(training_scores.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

  rf.fit(training_scores, training_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   57.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  6.7min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    1.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.


Training MSE 0.029233031734963836


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.1s


Validation MSE 0.028655398385403413
Variable: chemical_neighbourhood_score Importance: 0.43
Variable: precursor_mz_difference Importance: 0.16
Variable: query_precursor_mz   Importance: 0.14
Variable: s2v_score            Importance: 0.11
Variable: average_ms2deepscore_9 Importance: 0.04
Variable: tanimoto_score_structure_7 Importance: 0.02
Variable: tanimoto_score_structure_9 Importance: 0.02
Variable: ms2ds_score          Importance: 0.01
Variable: average_ms2deepscore_1 Importance: 0.01
Variable: average_ms2deepscore_3 Importance: 0.01
Variable: average_ms2deepscore_5 Importance: 0.01
Variable: average_ms2deepscore_6 Importance: 0.01
Variable: average_ms2ds_score_for_inchikey14 Importance: 0.0
Variable: nr_of_spectra_with_same_inchikey14 Importance: 0.0
Variable: average_tanimoto_score_for_chemical_neighbourhood_score Importance: 0.0
Variable: nr_of_spectra_for_chemical_neighbourhood_score Importance: 0.0
Variable: cosine_score         Importance: 0.0
Variable: modified_cosine_score

[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    0.2s finished


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

# Run with 30 features for similar structures, without chemical neighbourhood score

In [31]:
columns_to_drop = ["average_ms2ds_score_for_inchikey14",
                       "nr_of_spectra_with_same_inchikey14",
                       "chemical_neighbourhood_score",
                       "average_tanimoto_score_for_chemical_neighbourhood_score",
                       "nr_of_spectra_for_chemical_neighbourhood_score",
                   "lib_instrument_orbitrap",
                       "lib_instrument_ion_trap",
                       "lib_instrument_tof",
                       "lib_instrument_quadrupole",
                       "query_instrument_orbitrap",
                       "query_instrument_ion_trap",
                       "query_instrument_tof",
                       "query_instrument_quadrupole"]
selected_training_scores = training_scores.drop(columns_to_drop, axis=1)
selected_validation_scores = validation_scores.drop(columns_to_drop, axis=1)

In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# train rf using optimised parameters from below

rf = RandomForestRegressor(n_estimators = 250, 
                           random_state = 42, 
                           max_depth = 5, 
                           verbose=1,
                           min_samples_leaf=50,
                           n_jobs=-1)
rf.fit(selected_training_scores, training_labels)

# predict on train
rf_train_predictions = rf.predict(selected_training_scores)
mse_train_rf = mean_squared_error(training_labels, rf_train_predictions)
print('Training MSE', mse_train_rf)

# predict on test
rf_predictions = rf.predict(selected_validation_scores)
mse_rf = mean_squared_error(validation_labels, rf_predictions)
print('Validation MSE', mse_rf)

# get feature importances
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(selected_training_scores.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

  rf.fit(selected_training_scores, training_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   42.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  5.2min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s


Training MSE 0.029163197001385248
Validation MSE 0.027283918454377432
Variable: average_ms2deepscore_9 Importance: 0.37
Variable: precursor_mz_difference Importance: 0.2
Variable: query_precursor_mz   Importance: 0.16
Variable: average_ms2deepscore_5 Importance: 0.06
Variable: s2v_score            Importance: 0.05
Variable: average_ms2deepscore_1 Importance: 0.05
Variable: average_ms2deepscore_6 Importance: 0.04
Variable: average_ms2deepscore_2 Importance: 0.03
Variable: ms2ds_score          Importance: 0.01
Variable: average_ms2deepscore_3 Importance: 0.01
Variable: average_ms2deepscore_7 Importance: 0.01
Variable: average_ms2deepscore_8 Importance: 0.01
Variable: cosine_score         Importance: 0.0
Variable: modified_cosine_score Importance: 0.0
Variable: average_ms2deepscore_0 Importance: 0.0
Variable: tanimoto_score_structure_0 Importance: 0.0
Variable: nr_of_spectra_structure_0 Importance: 0.0
Variable: tanimoto_score_structure_1 Importance: 0.0
Variable: nr_of_spectra_structure_

[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    0.2s finished


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

### Duplicate score to check how random forest behaves

In [27]:
selected_training_scores["average_ms2deepscore_9_duplicated"] = selected_training_scores["average_ms2deepscore_9"]
selected_training_scores.head()

Unnamed: 0,query_precursor_mz,precursor_mz_difference,s2v_score,ms2ds_score,cosine_score,modified_cosine_score,average_ms2deepscore_0,tanimoto_score_structure_0,nr_of_spectra_structure_0,average_ms2deepscore_1,...,average_ms2deepscore_7,tanimoto_score_structure_7,nr_of_spectra_structure_7,average_ms2deepscore_8,tanimoto_score_structure_8,nr_of_spectra_structure_8,average_ms2deepscore_9,tanimoto_score_structure_9,nr_of_spectra_structure_9,average_ms2deepscore_9_duplicated
0,357.206,17.027,0.342509,0.958357,0.843849,0.844019,0.738493,1.0,2,0.544468,...,0.654104,0.714286,1,0.779384,0.685567,2,0.664638,0.668224,2,0.664638
1,495.381,121.148,0.120013,0.871954,0.040108,0.040108,0.58293,1.0,23,0.576751,...,0.575334,0.83377,13,0.599541,0.825916,1,0.577323,0.806069,25,0.577323
2,373.165,1.068,0.234691,0.867854,0.183264,0.19098,0.79366,1.0,4,0.740528,...,0.360988,0.673967,1,0.594069,0.672654,2,0.32159,0.672294,1,0.32159
3,455.29,81.057,0.082091,0.865033,0.064454,0.065283,0.671383,1.0,1,0.653571,...,0.393941,0.51005,1,0.511891,0.506122,1,0.451606,0.50541,2,0.451606
4,373.165,1.068,0.280886,0.861416,0.131869,0.148776,0.79366,1.0,4,0.740528,...,0.360988,0.673967,1,0.594069,0.672654,2,0.32159,0.672294,1,0.32159


In [28]:
selected_validation_scores["average_ms2deepscore_9_duplicated"] = selected_training_scores["average_ms2deepscore_9"]
selected_validation_scores.head()

Unnamed: 0,query_precursor_mz,precursor_mz_difference,s2v_score,ms2ds_score,cosine_score,modified_cosine_score,average_ms2deepscore_0,tanimoto_score_structure_0,nr_of_spectra_structure_0,average_ms2deepscore_1,...,average_ms2deepscore_7,tanimoto_score_structure_7,nr_of_spectra_structure_7,average_ms2deepscore_8,tanimoto_score_structure_8,nr_of_spectra_structure_8,average_ms2deepscore_9,tanimoto_score_structure_9,nr_of_spectra_structure_9,average_ms2deepscore_9_duplicated
0,353.07,0.0,0.275057,0.994153,0.999314,0.999314,0.696601,1.0,12,0.536532,...,0.512984,0.666199,2,0.506112,0.65808,3,0.560865,0.648837,93,0.664638
1,353.07,0.0,0.795537,0.99091,0.999493,0.999493,0.696601,1.0,12,0.536532,...,0.512984,0.666199,2,0.506112,0.65808,3,0.560865,0.648837,93,0.577323
2,353.064,0.006,0.309851,0.988226,0.998043,0.998043,0.650752,1.0,119,0.65531,...,0.571004,0.931792,72,0.55056,0.928654,62,0.664738,0.928112,60,0.32159
3,353.064,0.006,0.188902,0.98687,0.998182,0.998182,0.650752,1.0,119,0.65531,...,0.571004,0.931792,72,0.55056,0.928654,62,0.664738,0.928112,60,0.451606
4,353.06,0.01,0.589266,0.986744,0.999215,0.999215,0.766091,1.0,81,0.710621,...,0.559293,0.621039,1,0.525179,0.620448,11,0.576588,0.620321,74,0.32159


In [29]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# train rf using optimised parameters from below

rf = RandomForestRegressor(n_estimators = 250, 
                           random_state = 42, 
                           max_depth = 5, 
                           verbose=1,
                           min_samples_leaf=50,
                           n_jobs=-1)
rf.fit(selected_training_scores, training_labels)

# predict on train
rf_train_predictions = rf.predict(selected_training_scores)
mse_train_rf = mean_squared_error(training_labels, rf_train_predictions)
print('Training MSE', mse_train_rf)

# predict on test
rf_predictions = rf.predict(selected_validation_scores)
mse_rf = mean_squared_error(validation_labels, rf_predictions)
print('Validation MSE', mse_rf)

# get feature importances
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(selected_training_scores.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

  rf.fit(selected_training_scores, training_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   48.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  5.5min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s


Training MSE 0.029163198449911065
Validation MSE 0.029619576135149716
Variable: precursor_mz_difference Importance: 0.2
Variable: average_ms2deepscore_9 Importance: 0.19
Variable: average_ms2deepscore_9_duplicated Importance: 0.18
Variable: query_precursor_mz   Importance: 0.16
Variable: average_ms2deepscore_5 Importance: 0.06
Variable: s2v_score            Importance: 0.05
Variable: average_ms2deepscore_1 Importance: 0.05
Variable: average_ms2deepscore_6 Importance: 0.04
Variable: average_ms2deepscore_2 Importance: 0.03
Variable: ms2ds_score          Importance: 0.01
Variable: average_ms2deepscore_3 Importance: 0.01
Variable: average_ms2deepscore_7 Importance: 0.01
Variable: average_ms2deepscore_8 Importance: 0.01
Variable: cosine_score         Importance: 0.0
Variable: modified_cosine_score Importance: 0.0
Variable: average_ms2deepscore_0 Importance: 0.0
Variable: tanimoto_score_structure_0 Importance: 0.0
Variable: nr_of_spectra_structure_0 Importance: 0.0
Variable: tanimoto_score_s

[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 250 out of 250 | elapsed:    0.2s finished


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]