Files:
- Dropbox Folder: ⁨Dropbox⁩/⁨01.INVESTIGACION⁩/⁨[2019]⁩/⁨2019-09-17/AUTOMATED METADATA CURATION⁩
- Tool: PyCharm
- GitHub repo: cedar-experiments-curation
- Workspace: workspace folder in the current execution folder
- ARM Jupyter notebook: https://nbviewer.jupyter.org/github/metadatacenter/cedar-experiments-valuerecommender2019/blob/master/ValueRecommenderEvaluation.ipynb 


# _Automatic metadata curation_

[Stanford Center for Biomedical Informatics Research](https://bmir.stanford.edu/), 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA

\* Correspondence: marcosmr@stanford.edu

## Purpose of this document

This document is a [Jupyter notebook](http://jupyter.org/) that describes ...

The scripts used to generate the results and figures in the paper are in the [scripts folder](./scripts). The results generated when running the code cells in this notebook will be saved to a local `workspace` folder.


## Table of contents
TODO

## <a name="s0"></a>Viewing and running this notebook

GitHub will automatically generate a static online view of this notebook. However, current GitHub's rendering does not support some features, such as the anchor links that connect the 'Table of contents' to the different sections. A more reliable way to view the notebook file online is by using [nbviewer](https://nbviewer.jupyter.org/), which is the official viewer of the Jupyter Notebook project. [Click here](https://nbviewer.jupyter.org/github/metadatacenter/cedar-experiments-curation/blob/master/AAA.ipynb) to open our notebook using nbviewer.

The interactive features of our notebook will not work neither from GitHub nor nbviewer. For a fully interactive version of this notebook, you can set up a Jupyter Notebook server locally and start it from the local folder where you cloned the repository. For more information, see [Jupyter's official documentation](https://jupyter.org/install.html). Once your local Jupyter Notebook server is running, go to [http://localhost:8888/](http://localhost:8888/) and click on `AAA.ipynb` to open our notebook. You can also run the notebook on [Binder](https://mybinder.org/) by clicking [here](https://mybinder.org/v2/gh/metadatacenter/AAA).

## <a name="s1"></a>Step 1: Dataset download
On Sep 29, 2019, we downloaded the full content of the [NCBI BioSample database](https://www.ncbi.nlm.nih.gov/biosample/) from the [NCBI BioSample FTP repository](https://ftp.ncbi.nih.gov/biosample/) as a .gz file.


In [None]:
%%time
# Download the most recent NCBI biosamples to the workspace
import os
import urllib.request
import scripts.constants as c
import scripts.util as util

print('Source URL: ' + c.NCBI_DOWNLOAD_URL)
if not os.path.exists(c.NCBI_SAMPLES_FOLDER_DEST):
    os.makedirs(c.NCBI_SAMPLES_FOLDER_DEST)
dest_path = os.path.join(c.NCBI_SAMPLES_FOLDER_DEST, c.NCBI_SAMPLES_FILE_DEST)
print('Destination file: ' + dest_path)
if os.path.exists(dest_path):
    if util.confirm("The destination file already exist. Do you want to overwrite it [y/n]?"):
        urllib.request.urlretrieve(c.NCBI_DOWNLOAD_URL, dest_path, reporthook=util.log_progress)
else:
    urllib.request.urlretrieve(c.NCBI_DOWNLOAD_URL, dest_path, reporthook=util.log_progress)

Source URL: https://ftp.ncbi.nih.gov/biosample/biosample_set.xml.gz
Destination file: workspace/samples/source/biosample_set.xml.gz
The destination file already exist. Do you want to overwrite it [y/n]?y
...19%, 179 MB, 12269 KB/s, 14 seconds passed

## <a name="s2"></a>Step 2: Generation of template instances

### <a name="s2-1"></a>2.1. Determine relevant attributes and create CEDAR templates

We created a CEDAR template with all the attributes defined by the [NCBI BioSample Human Package v1.0](https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Human.1.0&action=definition), which are: *biosample_accession, sample_name, sample_title, bioproject_accession, organism, isolate, age, biomaterial_provider, sex, tissue, cell_line, cell_subtype, cell_type, culture_collection, dev_stage, disease, disease_stage, ethnicity, health_state, karyotype, phenotype, population, race, sample_type, treatment, description*. The template is available in your workspace at [data/cedar_template/ncbi_template.json](data/cedar_template/ncbi_template.json).

We focused our analysis on the subset of fields that meet two key requirements: (1) they usually contain values; and (2) they contain categorical values, that is, they represent information about discrete characteristics. We selected 6 fields that met these criteria. These fields are: *sex, tissue, cell line, cell type, disease, and ethnicity*.

### <a name="s2-2"></a>2.2. Select samples

We filtered the samples based on two criteria:
* The sample is from "Homo sapiens" (organism=Homo sapiens).
* The sample has non-empty values for at least 3 of the 6 fields listed in the previous section.

#### <a name="s2-2-a"></a>2.2.a. NCBI BioSample

Script used: [step2_filtering.py](scripts/step2_filtering.py). 

In [None]:
# Filter the NCBI samples
%run ./scripts/step2_filtering.py

Execution results:
Finished processing NCBI samples
- Total samples processed: 11,625,524
- Total samples selected: 262,114

The result is an XML file with 262,114 samples ([biosample_filtered.xml](./workspace/data/samples/filtered/biosample_filtered.xml)). 

--------end of document-------



## Steps to reproduce

### Rules generation
Take the text instances from the arm experiment and run the server with the READ_INSTANCES_FROM_CEDAR constant set to false, and with the right path for the constant CEDAR_INSTANCES_PATH.

### Rules ingestion

1. Delete the existing rules in Elasticsearch: `cedarat rules-regenerateIndex`
2. Download the NCBI (text-based) rules: https://drive.google.com/file/d/1ngCTGf4To1NZ1puRsB3aaCvtZIAERktY/view?usp=sharing
3. Extract the rules to a local folder. Then, import the 30,295 rules using:
    
    `elasticdump --input=./ncbi-text-rules-data.json --output=http://localhost:9200/cedar-rules --type=data`
    
    
4. Check that the rules have been imported correctly using Kibana (http://localhost:5601):
    
    `GET cedar-rules/_search`
    
5. Script used to generate recommendations: ...    

Notes:
- I will use the rules generated from the training set. Then, I will use the test set to evaluate the curation process.

Other possible topics to mention:
- Configuration settings


## References

* ARM-evaluation Jupyter notebook: https://nbviewer.jupyter.org/github/metadatacenter/cedar-experiments-valuerecommender2019/blob/master/ValueRecommenderEvaluation.ipynb#s5-results

In [7]:
elasticdump --input=./ncbi-text-data.json --output=http://localhost:9200/cedar-rules --type=data

SyntaxError: invalid syntax (<ipython-input-7-68ba3679050f>, line 1)

In [None]:
elasticdump --input=./ncbi-text-data.json --output=http://localhost:9200/cedar-rules --type=data