# “dx extract_dataset” in Bash
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_dataset` for:
* Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform
* Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML ()
* Kernel: Bash
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID

### dxpy version
extract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]

In [None]:
pip3 show dxpy

### 1. Assign environment variables

In [6]:
dx env

Auth token used		gjYgpQ5ppkkFVx8Q3jZv7F73z33F2Qx3fK5jFYyY

API server protocol	http

API server host		10.0.3.1

API server port		8124

Current workspace	project-G3fz4600F5X7FkJz6qyZfb3g

Current folder		None

Current user		None


In [8]:
# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id
# Assign project-id of dataset
#pid="project-G3fz4600F5X7FkJz6qyZfb3g"
# Assign dataset record-id
rid="record-G406j8j0x8kzxv3G08k64gVV"

# Assign joint dataset project-id:record-id
dataset="${rid}"

echo $dataset

project-G3fz4600F5X7FkJz6qyZfb3g:record-G406j8j0x8kzxv3G08k64gVV


### 2. Call “dx extract_dataset” using a supplied dataset

In [9]:
dx extract_dataset ${dataset} -ddd --delimiter ","



#### Preview data in the three dictionary (*.csv) files

In [10]:
head -5 *.csv

==> apollo_ukbrap_synth_pheno_geno_100k.codings.csv <==

coding_name,code,meaning,concept,display_order,parent_code

data_coding_493,-121,Do not know,,1,

data_coding_493,-131,Sometimes,,2,

data_coding_493,-141,Often,,3,

data_coding_493,0,Rarely/never,,4,



==> apollo_ukbrap_synth_pheno_geno_100k.data_dictionary.csv <==

entity,name,type,primary_key_type,coding_name,concept,description,folder_path,is_multi_select,is_sparse_coding,linkout,longitudinal_axis_type,referenced_entity_field,relationship,title,units

participant,p22608_a24,integer,,data_coding_493,,,Online follow-up > Work environment > Employment history,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22608,,,,Workplace very hot | Array 24,

participant,p2784_i1,integer,,data_coding_100349,,,UK Biobank Assessment Centre > Touchscreen > Sex-specific factors > Female-specific factors,,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2784,,,,Ever taken oral contraceptive pill | Instance 1,

participant,p102780_i4,intege

### 3. Parse returned metadata and extract entity/field names

In [29]:
entity_field=()
while IFS="," read -r entity field
do
    entity_field+=("${entity}.${field}")
done < <(cut -d "," -f 1,2 *.data_dictionary.csv | tail -n +2)
echo ${entity_field[@]:0:10}

participant.p22608_a24 participant.p2784_i1 participant.p102780_i4 participant.p41217 participant.p22704_a7 participant.p20011_i1_a18 participant.p20081_i2 participant.p3773_i0 participant.p20112_i2_a5 participant.p4237_i0_a14


In [21]:
entity_field=()
while IFS="," read -r entity field
do
    entity_field+=("${field}")
done < <(cut -d "," -f 1,2 *.data_dictionary.csv | tail -n +2)
echo ${entity_field[@]:0:10}

p22608_a24 p2784_i1 p102780_i4 p41217 p22704_a7 p20011_i1_a18 p20081_i2 p3773_i0 p20112_i2_a5 p4237_i0_a14


### 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

In [28]:
entity_field_input=$(IFS=, ; echo "${entity_field[*]}")
#echo ${entity_field_input}

In [13]:
dx extract_dataset ${dataset} --fields ${entity_field_input} -o extracted_data.csv

bash: /opt/conda/bin/dx: Argument list too long


: 126

In [33]:
dx extract_dataset ${dataset} --fields "participant.eid,participant.p31,participant.p21022,participant.p22009_a1,participant.p22009_a2,participant.p41202" -o extracted_data.csv

#### Print data in the retrieved data file

In [34]:
head -3 extracted_data.csv

participant.p31,participant.p21022,participant.p22009_a1,participant.p22009_a2,participant.p41202

1,63,-13.0594,2.66713,"[""K297"",""I802"",""K29"",""Block K20-K31"",""Chapter XI"",""I80"",""Block I80-I89"",""Chapter IX""]"

0,58,-14.7965,6.12062,"[""I251"",""K409"",""I25"",""Block I20-I25"",""Chapter IX"",""K40"",""Block K40-K46"",""Chapter XI""]"


In [2]:
#extract for a cohort

dx extract_dataset "record-G5Ky4Gj08KQYQ4P810fJ8qPp" --fields "participant.eid,participant.p31,participant.p21022,participant.p22009_a1,participant.p22009_a2,participant.p41202" -o femele_coffee_data.csv

### 5. Upload extracted dictionaries and data back to the project

In [None]:
dx upload *.csv