# CEDAR's Value Recommender - Evaluation

This notebook describes the steps followed to evaluate [CEDAR's Value Recommender](https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Value-Recommender).

## Step 1: Datasets download
### NCBI BioSample
We downloaded the full content of the [NCBI BioSample database](https://www.ncbi.nlm.nih.gov/biosample/) from the [NCBI BioSample FTP repository](https://ftp.ncbi.nih.gov/biosample/). As a result, we obtained a ZIP file ([2018-03-09-biosample_set.xml.zip](data/samples/ncbi_samples/original/2018-03-09-biosample_set.xml.zip)) with a total of 7.8M samples in XML format. File size: 618MB compressed, 20.11GB uncompressed. These samples were downloaded on March 9, 2018.
If you want to download the most recent content of NCBI BioSample, you can run the following code:

In [None]:
import urllib.request
import sys
import time
import scripts.util

url = 'https://ftp.ncbi.nih.gov/biosample/biosample_set.xml.gz'
file_name = 'tmp/biosample_set.xml.gz'
urllib.request.urlretrieve(url, file_name, reporthook=log_progress)

### EBI BioSamples
We wrote a script ([ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py)) to download all the biosample metadata from the [EBI BioSamples database](https://www.ebi.ac.uk/biosamples/) using the [EBI BioSamples API](https://www.ebi.ac.uk/biosamples/help/api.html), and to store the results as multiple JSON files. We obtained 4.1M samples stored as [412 JSON files](data/samples/ebi_samples/original) with 10K samples each (5.07GB in total). These samples were downloaded on March 9, 2018.

In [None]:
%run ./scripts/ebi_biosamples_1_download_split # Run it again, it does not work

## Step 2: Generation of template instances

### 2.1 Determine relevant attributes and create CEDAR templates

#### NCBI BioSample

For NCBI BioSample, we created a CEDAR template with all the attributes defined by the [NCBI BioSample Human Package v1.0](https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Human.1.0&action=definition), which are: *biosample_accession, sample_name, sample_title, bioproject_accession, organism, isolate, age, biomaterial_provider, sex, tissue, cell_line, cell_subtype, cell_type, culture_collection, dev_stage, disease, disease_stage, ethnicity, health_state, karyotype, phenotype, population, race, sample_type, treatment, description*.

#### EBI BioSamples

The EBI BioSamples API's output format defines some top-level attributes and makes it possible to add new attributes that describe sample characteristics:
```
{
    "accession": "...",
    "name": "...",
    "releaseDate": "...",
    "updateDate": "...",
    "characteristics": { // key-value pairs (e.g., organism, age, sex, organismPart, etc.)
    	...
    },
    "organization": "...",
    "contact": "..."
}
```

Based on this format, we defined a metadata template containing 14 fields with general metadata about biological samples and some additional fields that capture specific characteristics of human samples: *accession, name, releaseDate, updateDate, organization, contact, organism, age, sex, organismPart, cellLine, cellType, diseaseState, ethnicity*.

We focused our analysis on the subset of fields that meet two key requirements: (1) they are present in both templates and, therefore, can be used to evaluate cross-template recommendations; and (2) they contain categorical values, that is, they represent information about discrete characteristics. We selected 6 fields that met these criteria. These fields are: *sex, organism part, cell line, cell type, disease, and ethnicity*. The names used to refer to these fields in both CEDAR's NCBI BioSample template and CEDAR's EBI BioSamples template are shown in the following table:

|Characteristic|NCBI BioSample attribute name|EBI BioSamples attribute name|
|---|---|---|
|sex|sex|sex|
|organism part|tissue|organismPart|
|cell line|cell_line|cellLine|
|cell type|cell_type|cellType|
|disease|disease|diseaseState|
|ethnicity|ethnicity|ethnicity|

### 2.2 Select samples

We filtered the samples based on two criteria:
* The sample is from "Homo sapiens" (organism=Homo sapiens).
* The sample has non-empty values for at least 3 of the 6 fields in the previous table.

#### NCBI BioSample

Script used: [ncbi_biosample_1_filter.py](scripts/ncbi_biosample_1_filter.py).

From the total 7.8M samples, 4,600,722 are homo sapiens samples. 

In [1]:
%run ./scripts/ncbi_biosample_1_filter.py

FileNotFoundError: [Errno 2] No such file or directory: 'data/samples/ncbi_samples/original/2018-03-09-biosample_set.xml'

Result: [data/samples/ncbi_samples/filtered/biosample_result_filtered.xml](biosample_result_filtered.xml). It contains 157,653 samples from homo sapiens that have a minimum of 3 relevant attributes.

#### EBI BioSamples

Script used: `ebi_biosamples_2_filter.py`

Input parameters (see `arm_constants.py`):
```
EBI_FILTER_INPUT_FOLDER = BASE_PATH + '/samples/ebi_samples/original'
EBI_FILTER_OUTPUT_FOLDER = BASE_PATH + '/samples/ebi_samples/filtered'
EBI_FILTER_MAX_SAMPLES_PER_FILE = 10000
EBI_FILTER_RELEVANT_ATTS = ['sex', 'organismPart', 'cellLine', 'cellType', 'diseaseState', 'ethnicity']
EBI_FILTER_MIN_RELEVANT_ATTS = 3
```

Number of samples: 4,120,598
Number of homo sapiens samples: 1,381,843
Number of homo sapiens samples that have a minimum of relevant attributes: 135,187.

Result: Files at `/samples/ebi_samples/filtered` with 135,187 samples.






## 4. Generate CEDAR instances

135,000 total instances
114,750 training instances
20,250 testing instances

In order to be able to generate different samples for training and testing sets, I pick:
120,000 total instances
102,000 training instances (85%)
18,000 testing instances (15%)

1) EBI Training and EBI Testing
2) NCBI Training and NCBI testing discarding EBI ids

Different:
EBI Training <-> EBI Testing
			 <-> NCBI Testing

NCBI Training <-> NCBI Testing
			  <-> EBI Testing


STEP 1. Run `ebi_biosamples_3_to_cedar_instances.py` with the following input parameters (see `arm_constants.py`):
```
EBI_INSTANCES_TRAINING_SET_SIZE = 102000
EBI_INSTANCES_TESTING_SET_SIZE = 18000
EBI_INSTANCES_MAX_FILES_PER_FOLDER = 10000
EBI_INSTANCES_INPUT_PATH = EBI_FILTER_OUTPUT_FOLDER
EBI_INSTANCES_OUTPUT_BASE_PATH = BASE_PATH + '/cedar_instances/ebi_cedar_instances'
EBI_INSTANCES_TRAINING_BASE_PATH = EBI_INSTANCES_OUTPUT_BASE_PATH + '/training'
EBI_INSTANCES_TESTING_BASE_PATH = EBI_INSTANCES_OUTPUT_BASE_PATH + '/testing'
EBI_INSTANCES_EXCLUDE_IDS = False
EBI_INSTANCES_EXCLUDED_IDS_FILE_PATH = None
EBI_INSTANCES_OUTPUT_BASE_FILE_NAME = 'ebi_biosample_instance'
EBI_INSTANCES_EMPTY_BIOSAMPLE_INSTANCE_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ebi/ebi_biosample_instance_empty.json'
```

Output: EBI training and testing sets.
- folders 'training' and 'testing', with the corresponding instances
- files 'training_ids.txt' and 'testing_ids.txt', with the ids of the generated instances. These ids will be useful to avoid using the same instances when generating the NCBI training and test sets.

find ebi_cedar_instances/testing  -name "*.json" | wc -l
   18000

find ebi_cedar_instances/training  -name "*.json" | wc -l
  102000

STEP 2. Run 'ncbi_biosample_2_to_cedar_instances.py' with the following input parameters (see `arm_constants.py`):
```
NCBI_INSTANCES_TRAINING_SET_SIZE = 102000
NCBI_INSTANCES_TESTING_SET_SIZE = 0
NCBI_INSTANCES_MAX_FILES_PER_FOLDER = 10000
NCBI_INSTANCES_INPUT_PATH = NCBI_FILTER_OUTPUT_FILE
NCBI_INSTANCES_OUTPUT_BASE_PATH = BASE_PATH + '/cedar_instances/ncbi_cedar_instances'
NCBI_INSTANCES_TRAINING_BASE_PATH = NCBI_INSTANCES_OUTPUT_BASE_PATH + '/training'
NCBI_INSTANCES_TESTING_BASE_PATH = NCBI_INSTANCES_OUTPUT_BASE_PATH + '/testing'
NCBI_INSTANCES_EXCLUDE_IDS = True
NCBI_INSTANCES_EXCLUDED_IDS_FILE_PATH = BASE_PATH + '/cedar_instances/ebi_cedar_instances/testing_ids.txt'
NCBI_INSTANCES_OUTPUT_BASE_FILE_NAME = 'ncbi_biosample_instance'
NCBI_INSTANCES_EMPTY_BIOSAMPLE_INSTANCE_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ncbi/ncbi_biosample_instance_empty.json'
```
Note that EBI testing instances ids are excluded from the NCBI training set.

Output: NCBI training set with samples that are not in the EBI testing set.

STEP 3. 
3.1. Merge the following two files into one:
	- `cedar_instances/ebi_cedar_instances/training_ids.txt`
	- `cedar_instances/ncbi_cedar_instances/training_ids.txt`
	Create a new file called 'ebi_ncbi_training_ids.txt' and store it in 'cedar_instances'
3.2. Run 'ncbi_biosample_2_to_cedar_instances.py' with the following input parameters (see `arm_constants.py`):
```
NCBI_INSTANCES_TRAINING_SET_SIZE = 0
NCBI_INSTANCES_TESTING_SET_SIZE = 18000
NCBI_INSTANCES_MAX_FILES_PER_FOLDER = 10000
NCBI_INSTANCES_INPUT_PATH = NCBI_FILTER_OUTPUT_FILE
NCBI_INSTANCES_OUTPUT_BASE_PATH = BASE_PATH + '/cedar_instances/ncbi_cedar_instances'
NCBI_INSTANCES_TRAINING_BASE_PATH = NCBI_INSTANCES_OUTPUT_BASE_PATH + '/training'
NCBI_INSTANCES_TESTING_BASE_PATH = NCBI_INSTANCES_OUTPUT_BASE_PATH + '/testing'
NCBI_INSTANCES_EXCLUDE_IDS = True
NCBI_INSTANCES_EXCLUDED_IDS_FILE_PATH = BASE_PATH + '/cedar_instances/ncbi_ebi_training_ids.txt'
NCBI_INSTANCES_OUTPUT_BASE_FILE_NAME = 'ncbi_biosample_instance'
NCBI_INSTANCES_EMPTY_BIOSAMPLE_INSTANCE_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ncbi/ncbi_biosample_instance_empty.json'
```

Output: NCBI testing test with samples that are not in the NCBI training set nor the EBI training set.

## 5. Generate annotated instances and mappings file

### 5.1. Extraction of unique values from CEDAR instances

Script: `cedar_annotator/1_unique_values_extractor.py`

(`cedar_annotator/annotation_constants.py`)
```
VALUES_EXTRACTION_INSTANCE_PATHS = [NCBI_INSTANCES_OUTPUT_BASE_PATH + '/training', NCBI_INSTANCES_OUTPUT_BASE_PATH + '/testing',
                  EBI_INSTANCES_OUTPUT_BASE_PATH + '/training', EBI_INSTANCES_OUTPUT_BASE_PATH + '/testing']
VALUES_EXTRACTION_OUTPUT_FILE_PATH = BASE_PATH + 'cedar_instances_annotated/unique_values/unique_values.txt'
```

No. files processed: 240,000
No. unique values identified: 26,166 (26,122 valid values)

### 5.2. Annotation of unique values

Script: `cedar_annotator/1_unique_values_annotator.py`

(`cedar_annotator/annotation_constants.py`)
```
VALUES_ANNOTATION_INPUT_VALUES_FILE_PATH = VALUES_EXTRACTION_OUTPUT_FILE_PATH
VALUES_ANNOTATION_OUTPUT_FILE_PATH = BASE_PATH + '/cedar_instances_annotated/unique_values/unique_values_annotated.json'
VALUES_ANNOTATION_MAPPINGS_FILE_PATH = '/cedar_instances_annotated/unique_values/mappings.json'
VALUES_ANNOTATION_BIOPORTAL_API_KEY = '<my_BP_API_key>'
VALUES_ANNOTATION_VALUES_PER_ITERATION = 2000
VALUES_ANNOTATION_PREFERRED_ONTOLOGIES = ['EFO', 'DOID', 'OBI', 'CL', 'CLO', 'PATO', 'CHEBI', 'BFO', 'PR', 'CPT',
                                          'MEDDRA', 'UBERON','RXNORM', 'SNOMEDCT', 'FMA', 'LOINC', 'NDFRT', 'EDAM',
                                          'RCD', 'ICD10CM', 'SNMI', 'BTO', 'MESH', 'NCIT', 'OMIM']
VALUES_ANNOTATION_USE_NORMALIZED_VALUES = False
VALUES_ANNOTATION_NORMALIZED_VALUES_FILE_NAME = 'normalized_values.json'  # We assume that the file is stored in the current path
VALUES_ANNOTATION_LIMIT_TO_PREFERRED_ONTOLOGIES = False
```

No. resulting URIs: 12,711
No. values that were no annotated: 26,166 - 12,711

### 5.3. Annotation of CEDAR instances
This process uses the output of the previous steps to annotate all instances without making any calls to BioPortal.

Script: `cedar_annotator/3_cedar_instances_annotator.py`

(`cedar_annotator/annotation_constants.py`)
```
INSTANCES_ANNOTATION_INPUT_BASE_PATH = BASE_PATH + '/cedar_instances'
INSTANCES_ANNOTATION_OUTPUT_BASE_PATH = BASE_PATH + '/cedar_instances_annotated'
INSTANCES_ANNOTATION_INPUT_FOLDERS = [
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ncbi_cedar_instances/training',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ncbi_cedar_instances/testing',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ebi_cedar_instances/training',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ebi_cedar_instances/testing'
]
INSTANCES_ANNOTATION_OUTPUT_SUFFIX = '_annotated'
INSTANCES_ANNOTATION_VALUES_ANNOTATED_FILE_PATH = VALUES_ANNOTATION_OUTPUT_FILE_PATH
INSTANCES_ANNOTATION_NCBI_EMPTY_INSTANCE_ANNOTATED_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ncbi/ncbi_biosample_instance_annotated_empty.json'
INSTANCES_ANNOTATION_EBI_EMPTY_INSTANCE_ANNOTATED_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ebi/ebi_biosample_instance_annotated_empty.json'
INSTANCES_ANNOTATION_NON_ANNOTATED_VALUES_FILE_NAME = 'non_annotated_values_report.txt'
INSTANCES_ANNOTATION_USE_NORMALIZED_VALUES = False
INSTANCES_ANNOTATION_NORMALIZED_VALUES_FILE_NAME = 'normalized_values.json'
```

NCBI training: 
No. total values: 336,351
No. non annotated values: 47,877 (14%)

NCBI testing:
No. total values: 58,529 (394880-336351)
No. non annotated values: 8,877 (56754-47877) (14%)

EBI training:
No. total values: 328,904(723784-394880)
No. non annotated values: 46,166 (102920-56754) (14%)

EBI testing:
No. total values: 57,865 (781649-723784)
No. non annotated values: 8,060 (110980-102920) (14%)

## 6. Generate Association Rules

Delete the current cedar-value-recommender index from Elasticsearch: `DELETE cedar-value-recommender`

Restart the cedar-value-recommender-server. The index will be created again, with the corresponding ES mappings.

Update the following file:
"/Users/marcosmr/Development/git_repos/CEDAR/cedar-valuerecommender-server/cedar-valuerecommender-server-core/src/main/java/org/metadatacenter/intelligentauthoring/valuerecommender/util/Constants.java"

READ_INSTANCES_FROM_CEDAR = false
Update the variable "CEDAR_INSTANCES_PATH" with the full paths of the corresponding instances.

Apriori configuration:
```
public static final int APRIORI_MAX_NUM_RULES = 1000000;
public static int MIN_SUPPORTING_INSTANCES = 5; // The support will be dynamically calculated based on this value
public static final double MIN_CONFIDENCE = 0.3;
public static final double MIN_LIFT = 1.2;
public static final double MIN_LEVERAGE = 1.1;
public static final double MIN_CONVICTION = 1.1;
public static final int METRIC_TYPE_ID = 0; // 0 = Confidence | 1 = Lift | 2 = Leverage | 3 = Conviction
public static final String SUPPORT_METRIC_NAME = "Support";
public static final String CONFIDENCE_METRIC_NAME = "Confidence";
public static final String LIFT_METRIC_NAME = "Lift";
public static final String LEVERAGE_METRIC_NAME = "Leverage";
public static final String CONVICTION_METRIC_NAME = "Conviction";
public static final boolean VERBOSE_MODE = true;
```

For NCBI, POST to https://valuerecommender.metadatacenter.orgx/generate-rules:
{
	"templateIds" : [
		"https://repo.metadatacenter.orgx/templates/eef6f399-aa4e-4982-ab04-ad8e9635aa91"]	
}

For EBI:
{
	"templateIds" : [
		"https://repo.metadatacenter.orgx/templates/6b6c76e6-1d9b-4096-9702-133e25ecd140"]	
}

### 6.1. Generate rules for the NCBI training set (free text)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances/ncbi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
}
```

Number of rules generated: 52,192
No. rules after filtering: 30,295
Execution time: 5,682

### 6.2. Generate rules for the EBI training set (free text)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances/ebi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ncbi-annotated"
  }
}
```
Number of rules generated: 36915
No. rules after filtering: 24983
Execution time: 4079 seg.

### 6.3. Generate rules for the NCBI training set (annotated)

Don't forget to put the mappings.json file into the appropriate resources folder in the value recommender server so that the rules can be created using those mappings.
value-recommender-server

CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances_annotated/ncbi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ebi"
  }
}
```

Number of rules generated: 18223
No. rules after filtering: 12400
Execution time: 1,293 seg.


### 6.4. Generate rules for the EBI training set (annotated)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances_annotated/ebi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ebi-annotated"
  }
}
```

Number of rules generated: 16838
No. rules after filtering: 11932
Execution time: 1087 seg.

The ARFF files generated will be stored in a local temporal folder. The specific path is logged. In my case, for the NCBI template the path to the ARFF file is: `/var/folders/kk/7t15qjtd5cq0kpqnvm2mxp_00000gn/T//cedar-valuerecommender-server/arff-files/eef6f399-aa4e-4982-ab04-ad8e9635aa91.arff`


## 7. Perform evaluation


Generation of most frequent values (baseline)
R script



## 7.1. NCBI training, NCBI testing (free text)
Restore backup of rules:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
  "dest": {
    "index": "cedar-value-recommender"
  }
}
```

main parameters used (see arm_constants.py):
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  8748.737908124924 seconds 

## 7.2. NCBI training, EBI testing (free text)

main parameters used (see arm_constants.py):
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  10036.864538908005 seconds

## 7.3. EBI training, EBI testing (free text)

Restore backup of rules:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
  "dest": {
    "index": "cedar-value-recommender"
  }
}
```

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  10676.317619085312 seconds 


## 7.4. EBI training, NCBI testing (free text)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  9867.484709978104 seconds 

## 7.5. NCBI training, NCBI testing (annotated)
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  6836 seconds 
## 7.6. NCBI training, EBI testing (annotated)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7860.066431045532 seconds 

## 7.7. EBI training, EBI testing (annotated)
```json
POST _reindex
{
  "dest": {
    "index": "cedar-value-recommender"
  },
  "source": {
    "index": "cedar-value-recommender_backup-ebi-annotated"
  }
}
```

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7926.45552110672 seconds 

## 7.8. EBI training, NCBI testing (annotated)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7181.8355939388275 seconds 


## 7.9. NCBI training, NCBI testing (annotated, using mappings)

Enable mappings in the value recommender server (constants file)


```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  6901.707034826279 seconds 

## 7.10. NCBI training, EBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
---->

## 7.11. EBI training, EBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  6688.167499065399 seconds 


## 7.12. EBI training, NCBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7615.736355066299 seconds 
