# CEDAR's Value Recommender - Evaluation

This Jupyter notebook describes the steps followed to evaluate [CEDAR's Value Recommender](https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Value-Recommender), a recommendation system that suggest the most appropriate values for metadata fields. 

We provide the scripts used to run the evaluation pipeline. All the resulting files will be stored into a local 'workspace' folder.


## Table of contents
* [Step 1. Datasets download](#s1)
    * [1.a. NCBI BioSample](#s1-a)
    * [1.b. EBI BioSamples](#s1-b)
* [Step 2: Generation of template instances](#s2)
    * [2.1. Determine relevant attributes and create CEDAR templates](#s2-1)
        * [2.1.a. NCBI BioSample](#s2-1-a)
        * [2.1.b. EBI BioSamples](#s2-1-b)
    * [2.2. Select samples](#s2-2)
        * [2.2.a. NCBI BioSample](#s2-2-a)
        * [2.2.b. EBI BioSamples](#s2-2-b)
    * [2.3. Generate CEDAR instances](#s2-3)
* [Step 3: Semantic annotation](#s3)
    * [3.1. Extraction of unique values from CEDAR instances](#s3-1)
    * [3.2. Annotation of unique values and generation of mappings](#s3-2)
    * [3.3. Annotation of CEDAR instances](#s3-3)
* [Step 4: Generation of experimental data sets](#s4)
* [Step 5: Training](#s5)
    * [5.1. Rules generated](#s5-results)
* [Step 6: Testing](#s6)
* [Step 7: Analysis of results](#s7)
* [Additional experiments](#additional-experiments)
    * [Additional experiment 1](#additional-experiment-1)
    * [Additional experiment 2](#additional-experiment-2)
* [Links](#links)
* [Contact](#contact)

# Data download
Download the data.zip file from https://drive.google.com/a/stanford.edu/file/d/1X8-K1DjRh4FAmRKuGed1XXKsA1iOSl5x/view?usp=sharing and uncompress it in your repository root folder

## <a name="s1"></a>Step 1: Datasets download
### <a name="s1-a"></a>1.a. NCBI BioSample
We downloaded the full content of the [NCBI BioSample database](https://www.ncbi.nlm.nih.gov/biosample/) from the [NCBI BioSample FTP repository](https://ftp.ncbi.nih.gov/biosample/) as a .gz file, which you can find in the folder [data/samples/ncbi_samples/original](data/samples/ncbi_samples/original). This file contains metadata about 7.8M NCBI samples. To begin, copy the file to the workspace folder:

In [8]:
%%time
# Copy the .gz file with the NCBI samples used in the evaluation to your workspace folder
from shutil import copy
import os
import scripts.constants as c

source_file_path = c.NCBI_SAMPLES_ORIGINAL_FILE_PATH
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)
print('Source file path: ' + source_file_path)
print('Destination path: ' + dest_path)
dest_file_name = c.NCBI_SAMPLES_FILE_DEST
if not os.path.exists(dest_path):
    os.makedirs(dest_path)
copy(c.NCBI_SAMPLES_ORIGINAL_FILE_PATH, os.path.join(dest_path, dest_file_name))

Source file path: data/samples/ncbi_samples/original/2018-03-09-biosample_set.xml.gz
Destination path: workspace/data/samples/ncbi_samples/original
CPU times: user 245 ms, sys: 902 ms, total: 1.15 s
Wall time: 1.4 s


Note that the NCBI samples file was downloaded on March 9, 2018. Alternatively, if you want to conduct the evaluation with the most recent NCBI samples, run the following cell:

In [None]:
# OPTIONAL: Download the most recent NCBI biosamples to the workspace
import zipfile
import urllib.request
import sys
import os
import time
import scripts.util as util
import scripts.constants as c

url = c.NCBI_DOWNLOAD_URL
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)
dest_file_name = c.NCBI_SAMPLES_FILE_DEST
print('Source URL: ' + url)
print('Destination path: ' + dest_path)
if not os.path.exists(dest_path):
    os.makedirs(dest_path)
urllib.request.urlretrieve(url, os.path.join(dest_path, dest_file_name), reporthook=util.log_progress)

### <a name="s1-b"></a>1.b. EBI BioSamples
We wrote a script ([ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py)) to download all samples metadata from the [EBI BioSamples database](https://www.ebi.ac.uk/biosamples/) using the [EBI BioSamples API](https://www.ebi.ac.uk/biosamples/help/api.html). We stored the results as a ZIP file [2018-03-09-ebi_samples.zip](data/samples/ebi_samples/original/2018-03-09-ebi_samples.zip) that contains 412 JSON files with metadata for 4.1M samples in total. Extract the file to the workspace:

In [2]:
import zipfile, os
import scripts.constants as c

source_path = c.EBI_SAMPLES_ORIGINAL_FILE_PATH
dest_path = os.path.join(c.WORKSPACE_FOLDER, c.EBI_SAMPLES_ORIGINAL_PATH)
with zipfile.ZipFile(c.EBI_SAMPLES_ORIGINAL_FILE_PATH, 'r') as zip_obj:
    zip_obj.extractall(dest_path)

Note that these EBI samples were downloaded on March 9, 2018. If you want to run the evaluation with the most recent EBI samples, you can run [ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py) again:

In [None]:
# OPTIONAL: download all the EBI samples from the EBI's API
%run ./scripts/ebi_biosamples_1_download_split

## <a name="s2"></a>Step 2: Generation of template instances

### <a name="s2-1"></a>2.1. Determine relevant attributes and create CEDAR templates

#### <a name="s2-1-a"></a>2.1.a. NCBI BioSample

For NCBI BioSample, we created a CEDAR template with all the attributes defined by the [NCBI BioSample Human Package v1.0](https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Human.1.0&action=definition), which are: *biosample_accession, sample_name, sample_title, bioproject_accession, organism, isolate, age, biomaterial_provider, sex, tissue, cell_line, cell_subtype, cell_type, culture_collection, dev_stage, disease, disease_stage, ethnicity, health_state, karyotype, phenotype, population, race, sample_type, treatment, description*.

#### <a name="s2-1-b"></a>2.1.b. EBI BioSamples

The EBI BioSamples API's output format defines some top-level attributes and makes it possible to add new attributes that describe sample characteristics:
```
{
    "accession": "...",
    "name": "...",
    "releaseDate": "...",
    "updateDate": "...",
    "characteristics": { // key-value pairs (e.g., organism, age, sex, organismPart, etc.)
    	...
    },
    "organization": "...",
    "contact": "..."
}
```

Based on this format, we defined a metadata template containing 14 fields with general metadata about biological samples and some additional fields that capture specific characteristics of human samples: *accession, name, releaseDate, updateDate, organization, contact, organism, age, sex, organismPart, cellLine, cellType, diseaseState, ethnicity*.

We focused our analysis on the subset of fields that meet two key requirements: (1) they are present in both templates and, therefore, can be used to evaluate cross-template recommendations; and (2) they contain categorical values, that is, they represent information about discrete characteristics. We selected 6 fields that met these criteria. These fields are: *sex, organism part, cell line, cell type, disease, and ethnicity*. The names used to refer to these fields in both CEDAR's NCBI BioSample template and CEDAR's EBI BioSamples template are shown in the following table:

|Characteristic|NCBI BioSample attribute name|EBI BioSamples attribute name|
|---|---|---|
|sex|sex|sex|
|organism part|tissue|organismPart|
|cell line|cell_line|cellLine|
|cell type|cell_type|cellType|
|disease|disease|diseaseState|
|ethnicity|ethnicity|ethnicity|

### <a name="s2-2"></a>2.2. Select samples

We filtered the samples based on two criteria:
* The sample is from "Homo sapiens" (organism=Homo sapiens).
* The sample has non-empty values for at least 3 of the 6 fields in the previous table.

#### <a name="s2-2-a"></a>2.2.a. NCBI BioSample

Script used: [ncbi_biosample_1_filter.py](scripts/ncbi_biosample_1_filter.py). 

In [1]:
# Filter the NCBI samples
%run ./scripts/ncbi_biosample_1_filter.py

Input file: ./workspace/data/samples/ncbi_samples/original/biosample_set.xml.gz
Processing NCBI samples...
Processed samples: 5000
Selected samples: 0
Processed samples: 10000
Selected samples: 2
Processed samples: 15000
Selected samples: 59
Processed samples: 20000
Selected samples: 67
Processed samples: 25000
Selected samples: 67
Processed samples: 30000
Selected samples: 67
Processed samples: 35000
Selected samples: 67
Processed samples: 40000
Selected samples: 67
Processed samples: 45000
Selected samples: 67


KeyboardInterrupt: 

The result is an XML file with 157,653 samples ([biosample_result_filtered.xml](data/samples/ncbi_samples/filtered/biosample_result_filtered.xml)). 

<font color='blue'>**Shortcut:**</font> copy the precomputed NCBI filtered samples to the workspace:

In [3]:
# Shortcut: reuse existing filtered NCBI samples 
import os
from shutil import copyfile
import scripts.arm_constants as c

src = c.NCBI_FILTER_OUTPUT_FILE_PRECOMPUTED
dst = c.NCBI_FILTER_OUTPUT_FILE
if not os.path.exists(os.path.dirname(dst)):
    os.makedirs(os.path.dirname(dst))
copyfile(src, dst)

'./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml'

#### <a name="s2-2-b"></a>2.2.b. EBI BioSamples

In the case of the EBI samples, we used the script [ebi_biosamples_2_filter.py](scripts/ebi_biosamples_2_filter.py)

In [1]:
# Filter the EBI samples
%run ./scripts/ebi_biosamples_2_filter.py

Processing file: 00001 ebi_biosamples_1to10000.json
Accumulated selected samples: 1029
Processing file: 00002 ebi_biosamples_10001to20000.json
Accumulated selected samples: 1897
Processing file: 00003 ebi_biosamples_20001to30000.json
Accumulated selected samples: 2555
Processing file: 00004 ebi_biosamples_30001to40000.json
Accumulated selected samples: 3434
Processing file: 00005 ebi_biosamples_40001to50000.json
Accumulated selected samples: 4372
Processing file: 00006 ebi_biosamples_50001to60000.json
Accumulated selected samples: 5147
Processing file: 00007 ebi_biosamples_60001to70000.json
Accumulated selected samples: 5611
Processing file: 00008 ebi_biosamples_70001to80000.json
Accumulated selected samples: 5810
Processing file: 00009 ebi_biosamples_80001to90000.json
Accumulated selected samples: 6336
Processing file: 00010 ebi_biosamples_90001to100000.json
Accumulated selected samples: 6567
Processing file: 00011 ebi_biosamples_100001to110000.json
Accumulated selected samples: 6746


KeyboardInterrupt: 

Results: 14 JSON files with a total of 135,187 samples, which are available [in this folder](data/samples/ebi_samples/filtered/). 

<font color='blue'>**Shortcut:**</font> copy the precomputed EBI filtered samples to the workspace:

In [4]:
# Shortcut: reuse existing filtered EBI samples 
import os
from shutil import copyfile
import scripts.arm_constants as c

src = c.EBI_FILTER_OUTPUT_FOLDER_PRECOMPUTED
dst = c.EBI_FILTER_OUTPUT_FOLDER
if not os.path.exists(dst):
    os.makedirs(dst)

for file_name in os.listdir(src):
    print(file_name)
    print(os.path.join(src, file_name))
    print(os.path.join(dst, file_name))
    copyfile(os.path.join(src, file_name), os.path.join(dst, file_name))

ebi_biosamples_filtered_3_20000to29999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json
ebi_biosamples_filtered_1_0to9999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json
ebi_biosamples_filtered_2_10000to19999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json
ebi_biosamples_filtered_4_30000to39999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json
./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json
ebi_biosamples_filtered_10_90000to99999.json
./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_10_90000to99999.json
./workspace/data/samples/ebi_samp

### <a name="s2-3"></a>2.3. Generate CEDAR instances

We transformed the NCBI and EBI samples obtained from the previous step to CEDAR template instances conforming to [CEDAR's JSON-based Template Model](https://metadatacenter.org/tools-training/outreach/cedar-template-model).

For NCBI samples, we used the script [ncbi_biosample_2_to_cedar_instances.py](scripts/ncbi_biosample_2_to_cedar_instances.py):

In [1]:
%%time
# Generate CEDAR instances from NCBI samples
%run ./scripts/ncbi_biosample_2_to_cedar_instances.py

Reading file: ./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml
Extracting all samples from file (no. samples: 157653)
Randomly picking 135187 samples
Generating CEDAR instances...
No. instances generated: 10000(7%)
No. instances generated: 20000(15%)
No. instances generated: 30000(22%)
No. instances generated: 40000(30%)
No. instances generated: 50000(37%)
No. instances generated: 60000(44%)
No. instances generated: 70000(52%)
No. instances generated: 80000(59%)
No. instances generated: 90000(67%)
No. instances generated: 100000(74%)
No. instances generated: 110000(81%)
No. instances generated: 120000(89%)
No. instances generated: 130000(96%)
Finished
CPU times: user 2min 30s, sys: 44.2 s, total: 3min 14s
Wall time: 4min 7s


CEDAR's NCBI instances will be saved to [workspace/data/cedar_instances/ncbi_cedar_instances](workspace/data/cedar_instances/ncbi_cedar_instances).

For EBI samples, we used the script [ebi_biosamples_3_to_cedar_instances.py](scripts/ebi_biosamples_3_to_cedar_instances.py):

In [2]:
%%time
# Generate CEDAR instances from EBI samples
%run ./scripts/ebi_biosamples_3_to_cedar_instances.py

Reading EBI biosamples from folder: ./workspace/data/samples/ebi_samples/filtered
Total no. samples: 135187
Generating CEDAR instances...
No. instances generated: 10000(7%)
No. instances generated: 20000(15%)
No. instances generated: 30000(22%)
No. instances generated: 40000(30%)
No. instances generated: 50000(37%)
No. instances generated: 60000(44%)
No. instances generated: 70000(52%)
No. instances generated: 80000(59%)
No. instances generated: 90000(67%)
No. instances generated: 100000(74%)
No. instances generated: 110000(81%)
No. instances generated: 120000(89%)
No. instances generated: 130000(96%)
Finished
CPU times: user 1min 43s, sys: 46.4 s, total: 2min 30s
Wall time: 3min 28s


CEDAR's EBI instances will be saved to [workspace/data/cedar_instances/ebi_cedar_instances](workspace/data/cedar_instances/ebi_cedar_instances).

All the CEDAR instances using to evaluate the system are available at [data/cedar_instances](data/cedar_instances).

## <a name="s3"></a>Step 3: Semantic annotation

We used the [NCBO Annotator](https://bioportal.bioontology.org/annotator) via the [NCBO BioPortal API](http://data.bioontology.org/documentation) to automatically annotate a total of 270,374 template instances (135,187 instances for each template).

### <a name="s3-1"></a>3.1. Extraction of unique values from CEDAR instances

To avoid making multiple calls to the NCBO Annotator API for the same terms, we first extracted all the unique values in the CEDAR instances.

In [3]:
%%time
# Extract unique values from NCBI and EBI instances
%run ./scripts/cedar_annotator/1_unique_values_extractor.py

Extracting unique values from CEDAR instances...
No. instances processed: 10000
No. instances processed: 20000
No. instances processed: 30000
No. instances processed: 40000
No. instances processed: 50000
No. instances processed: 60000
No. instances processed: 70000
No. instances processed: 80000
No. instances processed: 90000
No. instances processed: 100000
No. instances processed: 110000
No. instances processed: 120000
No. instances processed: 130000
No. instances processed: 140000
No. instances processed: 150000
No. instances processed: 160000
No. instances processed: 170000
No. instances processed: 180000
No. instances processed: 190000
No. instances processed: 200000
No. instances processed: 210000
No. instances processed: 220000
No. instances processed: 230000
No. instances processed: 240000
No. instances processed: 250000
No. instances processed: 260000
No. instances processed: 270000
No. unique values extracted: 26556


We processed 270,374 instances and obtained 26,556 unique values (see [unique_values.txt](workspace/data/cedar_instances_annotated/unique_values/unique_values.txt)).


### <a name="s3-2"></a>3.2. Annotation of unique values and generation of mappings

We invoked the NCBO Annotator for the unique values obtained from the previous step. Additionally, we took advantage of the output provided by the Annotator API to extract all the different term URIs that map to each term in BioPortal and store all these equivalences into a mappings file. 

Script used: [2_unique_values_annotator.py](scripts/cedar_annotator/2_unique_values_annotator.py)

Note that when running the following cell, you will be asked to enter your BioPortal API key. If you don't have one, follow [these instructions](https://bioportal.bioontology.org/help#Getting_an_API_key).

In [None]:
%%time
# Enter your BioPortal API key
bp_api_key = input('Please, enter you BioPortal API key and press Enter:')
# Annotate unique values and generate mappings file
%run ./scripts/cedar_annotator/2_unique_values_annotator.py --bioportal-api-key $bp_api_key

<font color='blue'>**Shortcut:**</font> if you don't have access to the NCBO Annotator or you don't want to wait for the annotation process to finish, copy the files with the annotated values to your workspace:

In [1]:
# Shortcut: reuse previously generated annotations for the unique values
import os
from shutil import copyfile
import scripts.cedar_annotator.annotation_constants as c

def my_copy(src, dst):
    if not os.path.exists(os.path.dirname(dst)):
        os.makedirs(os.path.dirname(dst))
    copyfile(src, dst)
    print (src + ' copied to ' + dst)

src1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1_PRECOMPUTED
dst1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1
src2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2_PRECOMPUTED
dst2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2

my_copy(src1, dst1)
my_copy(src2, dst2)

./data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json
./data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json


### <a name="s3-3"></a>3.3. Annotation of CEDAR instances

This process uses the annotations generated in the previous step to annotate the values of the CEDAR instances without making any additional calls to the BioPortal API. The resulting instances are saved to [workspace/data/cedar_instances_annotated](workspace/data/cedar_instances_annotated).

Script: [3_cedar_instances_annotator.py](scripts/cedar_annotator/3_cedar_instances_annotator.py)

In [1]:
%%time
# Generate annotated CEDAR instances
%run ./scripts/cedar_annotator/3_cedar_instances_annotator.py

Processing instances folder: ./workspace/data/cedar_instances/ncbi_cedar_instances/training
No. annotated instances: 10000
No. annotated instances: 20000
No. annotated instances: 30000
No. annotated instances: 40000
No. annotated instances: 50000
No. annotated instances: 60000
No. annotated instances: 70000
No. annotated instances: 80000
No. annotated instances: 90000
No. annotated instances: 100000
No. annotated instances: 110000

No. total values: 379789
No. non annotated values: 55518 (15%)
Processing instances folder: ./workspace/data/cedar_instances/ncbi_cedar_instances/testing
No. annotated instances: 120000
No. annotated instances: 130000

No. total values: 446822
No. non annotated values: 65348 (15%)
Processing instances folder: ./workspace/data/cedar_instances/ebi_cedar_instances/training
No. annotated instances: 140000
No. annotated instances: 150000
No. annotated instances: 160000
No. annotated instances: 170000
No. annotated instances: 180000
No. annotated instances: 190000

All the CEDAR instances using to evaluate the system (both in plain text and annotated) are available at [data/cedar_instances](data/cedar_instances).

## <a name="s4"></a>Step 4: Generation of experimental data sets

When we generated the CEDAR instances (step 2.3) and the annotated CEDAR instances (step 3.3), we partitioned the resulting instances for each database (NCBI, EBI) into two datasets, with 85% of the data for training and the remaining 15% for testing.

## <a name="s5"></a>Step 5: Training

We mined association rules from the training sets to discover the hidden relationships between metadata fields. We extracted the rules using a local installation of the CEDAR Workbench. We set up the Value Recommender service to read the instance files from a local folder by updating [its constants file](https://github.com/metadatacenter/cedar-valuerecommender-server/blob/master/cedar-valuerecommender-server-core/src/main/java/org/metadatacenter/intelligentauthoring/valuerecommender/util/Constants.java) as follows:

```Java
READ_INSTANCES_FROM_CEDAR = false // Read training instances from a local folder
```

```Java
// Apriori configuration:
public static final int APRIORI_MAX_NUM_RULES = 1000000;
public static int MIN_SUPPORTING_INSTANCES = 5;
public static final double MIN_CONFIDENCE = 0.3;
public static final double MIN_LIFT = 1.2;
public static final double MIN_LEVERAGE = 1.1;
public static final double MIN_CONVICTION = 1.1;
// Metric types: 0 = Confidence | 1 = Lift | 2 = Leverage | 3 = Conviction
public static final int METRIC_TYPE_ID = 0; 
public static final String SUPPORT_METRIC_NAME = "Support";
public static final String CONFIDENCE_METRIC_NAME = "Confidence";
public static final String LIFT_METRIC_NAME = "Lift";
public static final String LEVERAGE_METRIC_NAME = "Leverage";
public static final String CONVICTION_METRIC_NAME = "Conviction";
public static final boolean VERBOSE_MODE = true;
```

You will have to run the rule extraction process four times, once for each training set. Before each execution, update the variable `CEDAR_INSTANCES_PATH` with the full path of the corresponding training set:
* Text-based values:
    * To extract the NCBI rules: `.../workspace/data/cedar_instances_annotated/ncbi_cedar_instances/training`
    * To extract the EBI rules: `.../workspace/data/cedar_instances_annotated/ebi_cedar_instances/training`
* Ontology-based values:
    * To extract the NCBI rules: `.../workspace/data/cedar_instances/ncbi_cedar_instances/training`
    * To extract the EBI rules: `.../workspace/data/cedar_instances/ebi_cedar_instances/training`

Internally, CEDAR's Value Recommender uses a [WEKA's implementation of the Apriori algorithm](https://www.cs.waikato.ac.nz/ml/weka/) with a minimum support of 5 instances and a confidence of 0.3. The final set of rules were indexed using Elasticsearch.

Update those constants, compile the `cedar-valuerecommender-server` project and start it locally. You can trigger the rule generation process from the command line using the following curl command:
```
curl --request POST \
  --url https://valuerecommender.metadatacenter.orgx/command/generate-rules/<TEMPLATE_ID> \
  --header 'authorization: apiKey <CEDAR_ADMIN_API_KEY>' \
  --header 'content-type: application/json' \
  --data '{}'
```

where `CEDAR_ADMIN_API_KEY` is the API key of the *cedar-admin* user in your local CEDAR system, and `TEMPLATE_ID` is the local identifier of the template that you want to extract rules for, that is, either the identifier of the NCBI BioSample template or the EBI BioSamples template.

### <a name="s5-results"></a>5.1. Rules generated

The following table shows the number of rules produced for each training set and type of metadata. It also provides a link to a .zip file with the produced rules. These files are also available in the [data/rules/](data/rules/) folder.

| Training set DB | Type of metadata | No. rules generated | No. rules after filtering | File name       |
|-----------------|------------------|---------------------|---------------------------|-----------------|
| NCBI            | Text-based       | 52,192              | 30,295                    | [ncbi-text-rules.zip](data/rules/ncbi-text-rules.zip)   |
| EBI             | Text-based       | 36,915              | 24,983                    | [ebi-text-rules.zip](data/rules/ebi-text-rules.zip)    |
| NCBI            | Ontology-based   | 18,223              | 12,400                    | [ncbi-ont-rules.zip](data/rules/ncbi-ont-rules.zip)    |
| EBI             | Ontology-based   | 16,838              | 11,932                    | [ebi-ont-rules.zip](data/rules/ebi-ont-rules.zip)     |

We extracted the rules from Elasticsearch using [elasticdump](https://www.npmjs.com/package/elasticdump). 

Commands used to export the rules and mappings from Elasticsearch to JSON format:

- elasticdump --input=http://localhost:9200/cedar-rules --output=./ncbi-text-mappings.json --type=mapping
- elasticdump --input=http://localhost:9200/cedar-rules --output=./ncbi-text-data.json --type=data

Commands used to import the rules and mappings into Elasticsearch:

- elasticdump --input=./ncbi-text-mappings.json --output=http://localhost:9200/cedar-rules  --type=mappings
- elasticdump --input=./ncbi-text-data.json --output=http://localhost:9200/cedar-rules --type=data

#### Some useful Elasticsearch operations

Create an empty rules index in Elasticsearch: run the console command `cedarat rules-regenerateIndex`.

Create a backup of the generated rules in ES:
```json
POST _reindex
{
  "source": {
    "index": "cedar-rules"
  },
  "dest": {
    "index": "cedar-rules-backup"
  }
}
```

Restore the index backup
```json
POST _reindex
{
  "source": {
    "index": "cedar-rules-backup"
  },
  "dest": {
    "index": "cedar-rules-rules"
  }
}
```

## <a name="s6"></a>Step 6: Testing

In this step, we used the rules generated in the previous step to evaluate the performance of CEDAR's Value Recommender when predicting values from the test sets.

Extract most common values: used for baseline recommendations


Update:
EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI
EVALUATION_USE_ANNOTATED_VALUES = True

Update template identifiers in the constants file:
EVALUATION_NCBI_TEMPLATE_ID
EVALUATION_EBI_TEMPLATE_ID



In [None]:
%%time
# Enter your CEDAR API key
cedar_api_key = input('Please, enter you CEDAR API key and press Enter: ')
# Run evaluation
%run ./scripts/arm_evaluation_main.py --cedar-api-key $cedar_api_key

## <a name="s7"></a>Step 7: Analysis of results




Links to the CSV files


Link to the R script

<img src="data/results/plot_MRR_2019_02_17_14_30_54.png" alt="Mean Reciprocal Rank" style="width: 800px;"/>

<img src="data/results/plot_MRR_per_field_2019_02_17_14_31_01.png" alt="Mean Reciprocal Rank per field" style="width: 800px;"/>

## <a name="additional-experiments"></a>Additional experiments:

### <a name="additional-experiment-1"></a>Additional experiment 1: Confidence vs Lift

Used the rules generated for the NCBI text set

confidence
confidence + support 
lift
confidence + lift

| 1st criterion | 2nd criterion | MRR (top 5) |
|---------------|---------------|-------------|
| confidence    | support       | 0.54        |
| confidence    | lift          | 0.53        |
| lift          | confidence    | 0.32        |
| lift          | support       | 0.32        |


### <a name="additional-experiment-2"></a>Additional experiment 2: All fields

NCBI template with all the fields (26)
Min. confidence: 0.3
Min. support: 10 instances
Training set size: 5000 instances
Testing set size: 500 instances


| No. fields | No. rules generated | No. rules after filtering | Rules generation time (seg) | Mean recommendation time (ms)  | MRR   |
|------------|---------------------|---------------------------|----------------------------|--------------------------------|-------|
| 6          | 775                 | 572                       | 5.10                       | 43.64                          | 0.318 |
| 26 (all)   | 233,363             | 30,559                    | 40.81                      | 44.79                          | 0.315 |



## <a name="links"></a>Links

* [CEDAR's Value Recommender documentation](https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Value-Recommender)
* [BioSample demo template](https://cedar.metadatacenter.org/instances/create/https://repo.metadatacenter.org/templates/6d9f4a83-a7ba-42be-a6af-f3cad7b2f7e3?folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2Fdc2ee55c-b891-4576-ba06-bfa3cf11143d)
* [Sets of rules generated during the evaluation](#s5-results)
* Evaluation results (.csv files) [[text-based]](data/results/text) [[ontology-based]](data/results/annotated)
* [CEDAR User Guide](https://metadatacenter.github.io/cedar-manual/) _(in progress)_

## <a name="contact"></a>Contact
For any questions about this notebook or about CEDAR's Value Recommender, please contact Marcos Martínez-Romero (marcosmr@stanford.edu).