# CEDAR's Value Recommender - Evaluation

This notebook describes the steps followed to evaluate [CEDAR's Value Recommender](https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Value-Recommender).

The output of all the scripts will be saved to the "output" folder. The "data" folder contains source data files and the full outputs used to evaluate the system.

(reference to the constants file)

we are using a workspace folder
see the shortcuts

## Step 1: Datasets download
### 1.a. NCBI BioSample
We downloaded the full content of the [NCBI BioSample database](https://www.ncbi.nlm.nih.gov/biosample/) from the [NCBI BioSample FTP repository](https://ftp.ncbi.nih.gov/biosample/) as a .gz file, which you can find in the folder [data/samples/ncbi_samples/original](data/samples/ncbi_samples/original). This file contains metadata about 7.8M NCBI samples. To begin, copy the file to the workspace folder:

In [8]:
%%time
# Copy the .gz file with the NCBI samples used in the evaluation to the workspace
from shutil import copy
import os
import scripts.constants as c

source_file_path = c.NCBI_SAMPLES_ORIGINAL_FILE_PATH
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)
print('Source file path: ' + source_file_path)
print('Destination path: ' + dest_path)
dest_file_name = c.NCBI_SAMPLES_FILE_DEST
if not os.path.exists(dest_path):
    os.makedirs(dest_path)
copy(c.NCBI_SAMPLES_ORIGINAL_FILE_PATH, os.path.join(dest_path, dest_file_name))

Source file path: data/samples/ncbi_samples/original/2018-03-09-biosample_set.xml.gz
Destination path: workspace/data/samples/ncbi_samples/original
CPU times: user 245 ms, sys: 902 ms, total: 1.15 s
Wall time: 1.4 s


Note that the NCBI samples file was downloaded on March 9, 2018. Alternatively, if you want to conduct the evaluation with the most recent NCBI samples, run the following cell:

In [7]:
# OPTIONAL: Download the most recent NCBI biosamples to the workspace
import zipfile
import urllib.request
import sys
import os
import time
import scripts.util as util
import scripts.constants as c

url = c.NCBI_DOWNLOAD_URL
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)
dest_file_name = c.NCBI_SAMPLES_FILE_DEST
print('Source URL: ' + url)
print('Destination path: ' + dest_path)
if not os.path.exists(dest_path):
    os.makedirs(dest_path)
urllib.request.urlretrieve(url, os.path.join(dest_path, dest_file_name), reporthook=util.log_progress)

Source URL: https://ftp.ncbi.nih.gov/biosample/biosample_set.xml.gz
Destination path: workspace/data/samples/ncbi_samples/original
...0%, 6 MB, 339 KB/s, 20 seconds passed

KeyboardInterrupt: 

### 1.b. EBI BioSamples
We wrote a script ([ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py)) to download all samples metadata from the [EBI BioSamples database](https://www.ebi.ac.uk/biosamples/) using the [EBI BioSamples API](https://www.ebi.ac.uk/biosamples/help/api.html). We stored the results as a ZIP file [2018-03-09-ebi_samples.zip](data/samples/ebi_samples/original/2018-03-09-ebi_samples.zip) that contains 412 JSON files with metadata for 4.1M samples in total. Extract the file to the workspace:

In [10]:
import zipfile, os
import scripts.constants as c

source_path = c.EBI_SAMPLES_ORIGINAL_FILE_PATH
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.EBI_SAMPLES_ORIGINAL_PATH)
with zipfile.ZipFile(c.EBI_SAMPLES_ORIGINAL_FILE_PATH, 'r') as zip_obj:
    zip_obj.extractall(dest_path)

Note that these EBI samples were downloaded on March 9, 2018. If you want to run the evaluation with the most recent EBI samples, you can run [ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py) again:

In [9]:
# OPTIONAL: download all the EBI samples from the EBI's API
%run ./scripts/ebi_biosamples_1_download_split

Downloading EBI samples to: ./workspace/data/samples/ebi_samples/original


KeyboardInterrupt: 

## Step 2: Generation of template instances

### 2.1. Determine relevant attributes and create CEDAR templates

#### 2.1.a. NCBI BioSample

For NCBI BioSample, we created a CEDAR template with all the attributes defined by the [NCBI BioSample Human Package v1.0](https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Human.1.0&action=definition), which are: *biosample_accession, sample_name, sample_title, bioproject_accession, organism, isolate, age, biomaterial_provider, sex, tissue, cell_line, cell_subtype, cell_type, culture_collection, dev_stage, disease, disease_stage, ethnicity, health_state, karyotype, phenotype, population, race, sample_type, treatment, description*.

#### 2.1.b. EBI BioSamples

The EBI BioSamples API's output format defines some top-level attributes and makes it possible to add new attributes that describe sample characteristics:
```
{
    "accession": "...",
    "name": "...",
    "releaseDate": "...",
    "updateDate": "...",
    "characteristics": { // key-value pairs (e.g., organism, age, sex, organismPart, etc.)
    	...
    },
    "organization": "...",
    "contact": "..."
}
```

Based on this format, we defined a metadata template containing 14 fields with general metadata about biological samples and some additional fields that capture specific characteristics of human samples: *accession, name, releaseDate, updateDate, organization, contact, organism, age, sex, organismPart, cellLine, cellType, diseaseState, ethnicity*.

We focused our analysis on the subset of fields that meet two key requirements: (1) they are present in both templates and, therefore, can be used to evaluate cross-template recommendations; and (2) they contain categorical values, that is, they represent information about discrete characteristics. We selected 6 fields that met these criteria. These fields are: *sex, organism part, cell line, cell type, disease, and ethnicity*. The names used to refer to these fields in both CEDAR's NCBI BioSample template and CEDAR's EBI BioSamples template are shown in the following table:

|Characteristic|NCBI BioSample attribute name|EBI BioSamples attribute name|
|---|---|---|
|sex|sex|sex|
|organism part|tissue|organismPart|
|cell line|cell_line|cellLine|
|cell type|cell_type|cellType|
|disease|disease|diseaseState|
|ethnicity|ethnicity|ethnicity|

### 2.2. Select samples

We filtered the samples based on two criteria:
* The sample is from "Homo sapiens" (organism=Homo sapiens).
* The sample has non-empty values for at least 3 of the 6 fields in the previous table.

#### 2.2.a. NCBI BioSample

Script used: [ncbi_biosample_1_filter.py](scripts/ncbi_biosample_1_filter.py). 

In [1]:
# Filter the NCBI samples
%run ./scripts/ncbi_biosample_1_filter.py

Input file: ./workspace/data/samples/ncbi_samples/original/biosample_set.xml.gz
Processing NCBI samples...
Processed samples: 5000
Selected samples: 0
Processed samples: 10000
Selected samples: 2
Processed samples: 15000
Selected samples: 59
Processed samples: 20000
Selected samples: 67
Processed samples: 25000
Selected samples: 67
Processed samples: 30000
Selected samples: 67
Processed samples: 35000
Selected samples: 67
Processed samples: 40000
Selected samples: 67
Processed samples: 45000
Selected samples: 67


KeyboardInterrupt: 

The result is an XML file with 157,653 samples ([biosample_result_filtered.xml](data/samples/ncbi_samples/filtered/biosample_result_filtered.xml)). 

<font color='blue'>**Shortcut:**</font> copy the precomputed NCBI filtered samples to the workspace:

In [3]:
# Shortcut: reuse existing filtered NCBI samples 
import os
from shutil import copyfile
import scripts.arm_constants as c

src = c.NCBI_FILTER_OUTPUT_FILE_PRECOMPUTED
dst = c.NCBI_FILTER_OUTPUT_FILE
if not os.path.exists(os.path.dirname(dst)):
    os.makedirs(os.path.dirname(dst))
copyfile(src, dst)

'./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml'

#### 2.2.b. EBI BioSamples

In the case of the EBI samples, we used the script [ebi_biosamples_2_filter.py](scripts/ebi_biosamples_2_filter.py)

In [1]:
# Filter the EBI samples
%run ./scripts/ebi_biosamples_2_filter.py

Processing file: 00001 ebi_biosamples_1to10000.json
Accumulated selected samples: 1029
Processing file: 00002 ebi_biosamples_10001to20000.json
Accumulated selected samples: 1897
Processing file: 00003 ebi_biosamples_20001to30000.json
Accumulated selected samples: 2555
Processing file: 00004 ebi_biosamples_30001to40000.json
Accumulated selected samples: 3434
Processing file: 00005 ebi_biosamples_40001to50000.json
Accumulated selected samples: 4372
Processing file: 00006 ebi_biosamples_50001to60000.json
Accumulated selected samples: 5147
Processing file: 00007 ebi_biosamples_60001to70000.json
Accumulated selected samples: 5611
Processing file: 00008 ebi_biosamples_70001to80000.json
Accumulated selected samples: 5810
Processing file: 00009 ebi_biosamples_80001to90000.json
Accumulated selected samples: 6336
Processing file: 00010 ebi_biosamples_90001to100000.json
Accumulated selected samples: 6567
Processing file: 00011 ebi_biosamples_100001to110000.json
Accumulated selected samples: 6746


KeyboardInterrupt: 

Results: 14 JSON files with a total of 135,187 samples, which are available [in this folder](data/samples/ebi_samples/filtered/). 

<font color='blue'>**Shortcut:**</font> copy the precomputed EBI filtered samples to the workspace:

In [4]:
# Shortcut: reuse existing filtered EBI samples 
import os
from shutil import copyfile
import scripts.arm_constants as c

src = c.EBI_FILTER_OUTPUT_FOLDER_PRECOMPUTED
dst = c.EBI_FILTER_OUTPUT_FOLDER
if not os.path.exists(dst):
    os.makedirs(dst)

for file_name in os.listdir(src):
    print(file_name)
    print(os.path.join(src, file_name))
    print(os.path.join(dst, file_name))
    copyfile(os.path.join(src, file_name), os.path.join(dst, file_name))

ebi_biosamples_filtered_3_20000to29999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json
ebi_biosamples_filtered_1_0to9999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json
ebi_biosamples_filtered_2_10000to19999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json
ebi_biosamples_filtered_4_30000to39999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json
ebi_biosamples_filtered_10_90000to99999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_10_90000to99999.json
./workspace/data/samples/ebi_samp

### 2.3. Generate CEDAR instances

We transformed the NCBI and EBI samples obtained from the previous step to CEDAR template instances conforming to [CEDAR's JSON-based Template Model](https://metadatacenter.org/tools-training/outreach/cedar-template-model).

For NCBI samples, we used the script [ncbi_biosample_2_to_cedar_instances.py](scripts/ncbi_biosample_2_to_cedar_instances.py):

In [1]:
%%time
# Generate CEDAR instances from NCBI samples
%run ./scripts/ncbi_biosample_2_to_cedar_instances.py

Reading file: ./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml
Extracting all samples from file (no. samples: 157653)
Randomly picking 135187 samples
Generating CEDAR instances...
No. instances generated: 10000(7%)
No. instances generated: 20000(15%)
No. instances generated: 30000(22%)
No. instances generated: 40000(30%)
No. instances generated: 50000(37%)
No. instances generated: 60000(44%)
No. instances generated: 70000(52%)
No. instances generated: 80000(59%)
No. instances generated: 90000(67%)
No. instances generated: 100000(74%)
No. instances generated: 110000(81%)
No. instances generated: 120000(89%)
No. instances generated: 130000(96%)
Finished
CPU times: user 2min 30s, sys: 44.2 s, total: 3min 14s
Wall time: 4min 7s


CEDAR's NCBI instances will be saved to [workspace/data/cedar_instances/ncbi_cedar_instances](workspace/data/cedar_instances/ncbi_cedar_instances).

For EBI samples, we used the script [ebi_biosamples_3_to_cedar_instances.py](scripts/ebi_biosamples_3_to_cedar_instances.py):

In [2]:
%%time
# Generate CEDAR instances from EBI samples
%run ./scripts/ebi_biosamples_3_to_cedar_instances.py

Reading EBI biosamples from folder: ./workspace/data/samples/ebi_samples/filtered
Total no. samples: 135187
Generating CEDAR instances...
No. instances generated: 10000(7%)
No. instances generated: 20000(15%)
No. instances generated: 30000(22%)
No. instances generated: 40000(30%)
No. instances generated: 50000(37%)
No. instances generated: 60000(44%)
No. instances generated: 70000(52%)
No. instances generated: 80000(59%)
No. instances generated: 90000(67%)
No. instances generated: 100000(74%)
No. instances generated: 110000(81%)
No. instances generated: 120000(89%)
No. instances generated: 130000(96%)
Finished
CPU times: user 1min 43s, sys: 46.4 s, total: 2min 30s
Wall time: 3min 28s


EBI's NCBI instances will be saved to [workspace/data/cedar_instances/ebi_cedar_instances](workspace/data/cedar_instances/ebi_cedar_instances).

## Step 3: Semantic annotation

We used the [NCBO Annotator](https://bioportal.bioontology.org/annotator) via the [NCBO BioPortal API](http://data.bioontology.org/documentation) to automatically annotate a total of 270,374 template instances (135,187 instances for each template).

### 3.1. Extraction of unique values from CEDAR instances

To avoid making multiple calls to the NCBO Annotator API for the same terms, we first extracted all the unique values in the CEDAR instances.

In [3]:
%%time
# Extract unique values from NCBI and EBI instances
%run ./scripts/cedar_annotator/1_unique_values_extractor.py

Extracting unique values from CEDAR instances...
No. instances processed: 10000
No. instances processed: 20000
No. instances processed: 30000
No. instances processed: 40000
No. instances processed: 50000
No. instances processed: 60000
No. instances processed: 70000
No. instances processed: 80000
No. instances processed: 90000
No. instances processed: 100000
No. instances processed: 110000
No. instances processed: 120000
No. instances processed: 130000
No. instances processed: 140000
No. instances processed: 150000
No. instances processed: 160000
No. instances processed: 170000
No. instances processed: 180000
No. instances processed: 190000
No. instances processed: 200000
No. instances processed: 210000
No. instances processed: 220000
No. instances processed: 230000
No. instances processed: 240000
No. instances processed: 250000
No. instances processed: 260000
No. instances processed: 270000
No. unique values extracted: 26556


We processed 270,374 instances and obtained 26,556 unique values (see [unique_values.txt](workspace/data/cedar_instances_annotated/unique_values/unique_values.txt)).


### 3.2. Annotation of unique values and generation of mappings

We invoked the NCBO Annotator for the unique values obtained from the previous step. Additionally, we took advantage of the output provided by the Annotator API to extract all the different term URIs that map to each term in BioPortal and store all these equivalences into a mappings file. 

Script used: [2_unique_values_annotator.py](scripts/cedar_annotator/2_unique_values_annotator.py)

Note that when running the following cell, you will be asked to enter your BioPortal API key. If you don't have one, follow [these instructions](https://bioportal.bioontology.org/help#Getting_an_API_key).

In [None]:
%%time
# Enter BioPortal API key
bp_api_key = input('Please, enter you BioPortal API key and press Enter:')
# Annotate unique values and generate mappings file
%run ./scripts/cedar_annotator/2_unique_values_annotator.py --bioportal-api-key $bp_api_key

<font color='blue'>**Shortcut:**</font> if you don't have access to the NCBO Annotator or you don't want to wait for the annotation process to finish, copy the files with the annotated values to your workspace:

In [1]:
# Shortcut: reuse previously generated annotations for the unique values
import os
from shutil import copyfile
import scripts.cedar_annotator.annotation_constants as c

def my_copy(src, dst):
    if not os.path.exists(os.path.dirname(dst)):
        os.makedirs(os.path.dirname(dst))
    copyfile(src, dst)
    print (src + ' copied to ' + dst)

src1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1_PRECOMPUTED
dst1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1
src2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2_PRECOMPUTED
dst2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2

my_copy(src1, dst1)
my_copy(src2, dst2)

./data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json
./data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json


### 3.3. Annotation of CEDAR instances

This process uses the annotations generated in the previous step to annotate the values of the CEDAR instances without making any additional calls to the BioPortal API.
Script: [3_cedar_instances_annotator.py](scripts/cedar_annotator/3_cedar_instances_annotator.py)

In [None]:
%%time
# Generate annotated CEDAR instances
%run ./scripts/cedar_annotator/3_cedar_instances_annotator.py

---------------------------------

### 5.2. Annotation of unique values and generation of mappings file


### 5.3. Annotation of CEDAR instances
This process uses the output of the previous steps to annotate all instances without making any calls to BioPortal.

Script: `cedar_annotator/3_cedar_instances_annotator.py`

(`cedar_annotator/annotation_constants.py`)
```
INSTANCES_ANNOTATION_INPUT_BASE_PATH = BASE_PATH + '/cedar_instances'
INSTANCES_ANNOTATION_OUTPUT_BASE_PATH = BASE_PATH + '/cedar_instances_annotated'
INSTANCES_ANNOTATION_INPUT_FOLDERS = [
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ncbi_cedar_instances/training',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ncbi_cedar_instances/testing',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ebi_cedar_instances/training',
    INSTANCES_ANNOTATION_INPUT_BASE_PATH + '/ebi_cedar_instances/testing'
]
INSTANCES_ANNOTATION_OUTPUT_SUFFIX = '_annotated'
INSTANCES_ANNOTATION_VALUES_ANNOTATED_FILE_PATH = VALUES_ANNOTATION_OUTPUT_FILE_PATH
INSTANCES_ANNOTATION_NCBI_EMPTY_INSTANCE_ANNOTATED_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ncbi/ncbi_biosample_instance_annotated_empty.json'
INSTANCES_ANNOTATION_EBI_EMPTY_INSTANCE_ANNOTATED_PATH = BASE_PATH + '/cedar_templates_and_reference_instances/ebi/ebi_biosample_instance_annotated_empty.json'
INSTANCES_ANNOTATION_NON_ANNOTATED_VALUES_FILE_NAME = 'non_annotated_values_report.txt'
INSTANCES_ANNOTATION_USE_NORMALIZED_VALUES = False
INSTANCES_ANNOTATION_NORMALIZED_VALUES_FILE_NAME = 'normalized_values.json'
```

NCBI training: 
No. total values: 336,351
No. non annotated values: 47,877 (14%)

NCBI testing:
No. total values: 58,529 (394880-336351)
No. non annotated values: 8,877 (56754-47877) (14%)

EBI training:
No. total values: 328,904(723784-394880)
No. non annotated values: 46,166 (102920-56754) (14%)

EBI testing:
No. total values: 57,865 (781649-723784)
No. non annotated values: 8,060 (110980-102920) (14%)

## 6. Generate Association Rules

Delete the current cedar-value-recommender index from Elasticsearch: `DELETE cedar-value-recommender`

Restart the cedar-value-recommender-server. The index will be created again, with the corresponding ES mappings.

Update the following file:
"/Users/marcosmr/Development/git_repos/CEDAR/cedar-valuerecommender-server/cedar-valuerecommender-server-core/src/main/java/org/metadatacenter/intelligentauthoring/valuerecommender/util/Constants.java"

READ_INSTANCES_FROM_CEDAR = false
Update the variable "CEDAR_INSTANCES_PATH" with the full paths of the corresponding instances.

Apriori configuration:
```
public static final int APRIORI_MAX_NUM_RULES = 1000000;
public static int MIN_SUPPORTING_INSTANCES = 5; // The support will be dynamically calculated based on this value
public static final double MIN_CONFIDENCE = 0.3;
public static final double MIN_LIFT = 1.2;
public static final double MIN_LEVERAGE = 1.1;
public static final double MIN_CONVICTION = 1.1;
public static final int METRIC_TYPE_ID = 0; // 0 = Confidence | 1 = Lift | 2 = Leverage | 3 = Conviction
public static final String SUPPORT_METRIC_NAME = "Support";
public static final String CONFIDENCE_METRIC_NAME = "Confidence";
public static final String LIFT_METRIC_NAME = "Lift";
public static final String LEVERAGE_METRIC_NAME = "Leverage";
public static final String CONVICTION_METRIC_NAME = "Conviction";
public static final boolean VERBOSE_MODE = true;
```

For NCBI, POST to https://valuerecommender.metadatacenter.orgx/generate-rules:
{
	"templateIds" : [
		"https://repo.metadatacenter.orgx/templates/eef6f399-aa4e-4982-ab04-ad8e9635aa91"]	
}

For EBI:
{
	"templateIds" : [
		"https://repo.metadatacenter.orgx/templates/6b6c76e6-1d9b-4096-9702-133e25ecd140"]	
}

### 6.1. Generate rules for the NCBI training set (free text)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances/ncbi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
}
```

Number of rules generated: 52,192
No. rules after filtering: 30,295
Execution time: 5,682

### 6.2. Generate rules for the EBI training set (free text)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances/ebi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ncbi-annotated"
  }
}
```
Number of rules generated: 36915
No. rules after filtering: 24983
Execution time: 4079 seg.

### 6.3. Generate rules for the NCBI training set (annotated)

Don't forget to put the mappings.json file into the appropriate resources folder in the value recommender server so that the rules can be created using those mappings.
value-recommender-server

CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances_annotated/ncbi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ebi"
  }
}
```

Number of rules generated: 18223
No. rules after filtering: 12400
Execution time: 1,293 seg.


### 6.4. Generate rules for the EBI training set (annotated)
CEDAR_INSTANCES_PATH = "/Users/marcosmr/tmp/ARM_resources/EVALUATION/cedar_instances_annotated/ebi_cedar_instances/training"

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender"
  },
  "dest": {
    "index": "cedar-value-recommender_backup-ebi-annotated"
  }
}
```

Number of rules generated: 16838
No. rules after filtering: 11932
Execution time: 1087 seg.

The ARFF files generated will be stored in a local temporal folder. The specific path is logged. In my case, for the NCBI template the path to the ARFF file is: `/var/folders/kk/7t15qjtd5cq0kpqnvm2mxp_00000gn/T//cedar-valuerecommender-server/arff-files/eef6f399-aa4e-4982-ab04-ad8e9635aa91.arff`


## 7. Perform evaluation


Generation of most frequent values (baseline)
R script



## 7.1. NCBI training, NCBI testing (free text)
Restore backup of rules:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
  "dest": {
    "index": "cedar-value-recommender"
  }
}
```

main parameters used (see arm_constants.py):
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  8748.737908124924 seconds 

## 7.2. NCBI training, EBI testing (free text)

main parameters used (see arm_constants.py):
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  10036.864538908005 seconds

## 7.3. EBI training, EBI testing (free text)

Restore backup of rules:
```json
POST _reindex
{
  "source": {
    "index": "cedar-value-recommender_backup-ncbi"
  }
  "dest": {
    "index": "cedar-value-recommender"
  }
}
```

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  10676.317619085312 seconds 


## 7.4. EBI training, NCBI testing (free text)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = False
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  9867.484709978104 seconds 

## 7.5. NCBI training, NCBI testing (annotated)
```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  6836 seconds 
## 7.6. NCBI training, EBI testing (annotated)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7860.066431045532 seconds 

## 7.7. EBI training, EBI testing (annotated)
```json
POST _reindex
{
  "dest": {
    "index": "cedar-value-recommender"
  },
  "source": {
    "index": "cedar-value-recommender_backup-ebi-annotated"
  }
}
```

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7926.45552110672 seconds 

## 7.8. EBI training, NCBI testing (annotated)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = False
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7181.8355939388275 seconds 


## 7.9. NCBI training, NCBI testing (annotated, using mappings)

Enable mappings in the value recommender server (constants file)


```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
Execution time:  6901.707034826279 seconds 

## 7.10. NCBI training, EBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```
---->

## 7.11. EBI training, EBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.EBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  6688.167499065399 seconds 


## 7.12. EBI training, NCBI testing (annotated, using mappings)

```
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.EBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True
EVALUATION_EXTEND_URIS_WITH_MAPPINGS = True
EVALUATION_MAX_NUMBER_INSTANCES = 20000
EVALUATION_CEDAR_API_KEY = '<my_CEDAR_apiKey>'
```

Execution time:  7615.736355066299 seconds 
