# Perspective User: Bioinformatician
Use case would be “A recent GWAS paper said genetic variant rs1361754 is important for heart health, I want to know if that rs-number effects gene expression a.k.a. has any eQTLs.

#### Overview of Steps

#1) Use the GUID **https://doi.org/10.25491/cq8s-f809** to download the GTEx file “GTEx_Analysis_v7_eQTL.tar.gz”
#2) tar xf GTEx_Analysis_v7_eQTL.tar.gz
#2) Convert the dbSNP variant id “rs1361754” to GTEx’s variant id “1_205801872_A_G_b37”
#3) Grep all significant eQTLs with the variant id “1_205801872_A_G_b37” from all tissue files
#4) Convert Gencode gene id’s in the grep results to HGNC gene symbols
#5) Upload results to Cloud
#6) Create a Minid for resulting file

# Accessing the File from the Identifier 
- Resolve the identifier to the landing service
- Find cloud locations from the landing page
- Find cloud locations from json-ld Identifier Metadata

## Landing Service

https://doi.org/10.25491/cq8s-f809

Resolves to the landing page

https://ors.datacite.org/doi:/10.25491/cq8s-f809

## Resolving Cloud Locations Through ORS

Obtain an Access Code, through Oauth2 flow
https://ors.datacite.org/login

In [1]:
import requests
import json

get_response = requests.get(
    url = 'https://ors.datacite.org/doi:/10.25491/cq8s-f809',
    headers = {"Accept": "application/ld+json"}
)

In [3]:
json.loads(get_response.content.decode('utf-8')).get('contentUrl')

['https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz']

## Running the Use Case

In [5]:
%%bash
# Download the Analysis Summary Results
curl -X GET https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz > GTEx_Analysis_v7_eQTL.tar.gz
tar xf GTEx_Analysis_v7_eQTL.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0  915M    0 4094k    0     0  6779k      0  0:02:18 --:--:--  0:02:18 6768k  1  915M    1 15.1M    0     0  9694k      0  0:01:36  0:00:01  0:01:35 9688k  2  915M    2 26.3M    0     0  10.1M      0  0:01:30  0:00:02  0:01:28 10.1M  4  915M    4 37.4M    0     0  10.4M      0  0:01:27  0:00:03  0:01:24 10.4M  5  915M    5 48.5M    0     0  10.5M      0  0:01:26  0:00:04  0:01:22 10.5M  6  915M    6 59.7M    0     0  10.6M      0  0:01:25  0:00:05  0:01:20 11.1M  7  915M    7 70.9M    0     0  10.7M      0  0:01:25  0:00:06  0:01:19 11.1M  8  915M    8 82.1M    0     0  10.8M      0  0:01:24  0:00:07  0:01:17 11.1M 10  915M   10 93.3M    0     0  10.8M      0  0:01:24  0:00:08  0:01:16 11.1M 11  915M   11  103M    0     0  10.7M      0  0:01

In [35]:
%%bash
# Convert dbSNP variant id “rs1361754” to GTEx’s variant id “1_205801872_A_G_b37”
curl -k "https://gtexportal.org/rest/v1/reference/variant?format=tsv&snpId=rs1361754&datasetId=gtex_v7" > snp_reference.tsv
snp=`cat snp_reference.tsv | awk '{print $(NF-1)}' | tail -n 1`

# Shortcut for duration GTEx API is down
snp='1_205801872_A_G_b37'

# Grab all significant eQTLs from with variant id “1_205801872_A_G_b37”
rs="rs1361754"
zgrep $snp GTEx_Analysis_v7_eQTL/*.v7.signif_variant_gene_pairs.txt.gz > ${rs}_Sig_eQTLs.tsv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    22  100    22    0     0    173      0 --:--:-- --:--:-- --:--:--   173


In [36]:
%%bash
rs="rs1361754"

# Convert Gencode id's in the grep results to HGNC gene symbols
zcat < GTEx_Analysis_v7_eQTL/Heart_Left_Ventricle.v7.signif_variant_gene_pairs.txt.gz | tail -n +2 | sed 's/^/tissue\t/'  > ${rs}_Sig_eQTLs_Standard_IDs.tsv
cat <<EOF | perl - ${rs}_Sig_eQTLs.tsv

use strict;
use warnings;

my %s = ();
my %g;

\$" = "\t";

while(<>) {
  chomp;
  s#GTEx_Analysis_v7_eQTL/([A-Za-z0-9-_]+)\.v7\.signif_variant_gene_pairs\.txt\.gz:#\$1\t#;

  my @a = split(/\t/, \$_);


  if(exists \$s{\$a[1]}) {
    \$a[1] = \$s{\$a[1]};
  } else {
  
    \$a[1]     = \`curl -k "https://gtexportal.org/rest/v1/reference/variant?format=tsv&variantId=\$a[1]&datasetId=gtex_v7" 2>/dev/null | tail -n 1\`;
    my @b     = split(/\t/, \$a[1]);
    \$s{\$a[1]} = \$b[6];
    \$a[1]     = \$b[6];
  }

  if(exists \$g{\$a[2]}) {
    \$a[2] = \$g{\$a[2]};
  } else {
    \$a[2]     = \`curl -k "https://gtexportal.org/rest/v1/reference/gene?format=tsv&gencodeVersion=v19&genomeBuild=GRCh37/hg19&geneId=\$a[2]" 2>/dev/null | tail -n 1\`;
    my @b     = split(/\t/, \$a[2]);
    \$g{\$a[2]} = \$b[1];
    \$a[2]     = \$b[1];
  }

  print "@a\n";
}


Adipose_Subcutaneous		HAVANA	57284	282	379	0.492208	1.38088e-08	0.251831	0.0432154	5.29333e-05	8.27056e-36	3.03426e-30
Adipose_Subcutaneous		HAVANA	18996	282	379	0.492208	5.32804e-05	0.13746	0.0335557	6.34316e-05	1.29966e-05	0.0277763
Adipose_Subcutaneous		HAVANA	-17388	282	379	0.492208	5.42255e-31	-0.768671	0.0594839	6.27343e-05	2.30816e-57	6.48817e-51
Adipose_Visceral_Omentum		HAVANA	57284	230	295	0.471246	1.8797e-06	0.192055	0.0393847	3.51241e-05	2.54581e-22	1.10673e-17
Adipose_Visceral_Omentum		HAVANA	-17388	230	295	0.471246	2.98553e-30	-0.81516	0.0625338	3.78067e-05	9.3754e-49	1.71294e-42
Adrenal_Gland		HAVANA	-17388	122	160	0.457143	1.03308e-13	-0.649344	0.0785625	2.33999e-05	2.23639e-19	2.73137e-15
Artery_Aorta		HAVANA	57284	197	262	0.490637	2.02759e-09	0.350166	0.0559109	4.13894e-05	1.10871e-28	1.2337e-23
Artery_Aorta		HAVANA	-17388	197	262	0.490637	1.20771e-25	-0.850754	0.0710599	3.80813e-05	3.71202e-43	4.0522e-37
Artery_Coronary		HAVANA	57284	114	148	0.486842	5.87525e-06	0.37

Use of uninitialized value $a[1] in join or string at - line 36, <> line 1.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 2.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 3.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 4.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 5.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 6.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 7.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 8.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 9.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 10.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 11.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 12.
Use of uninitialized value $a[1] in join or string at - line 36, <> line 13.
Use of u

## Uploading the File to the Cloud

In place of a permenant data repository or storage within a fullstack, we will upload our results to a public bucket on AWS

In [1]:
import boto3
s3_client = boto3.client("s3")

# create a public bucket
bucket = s3.create_bucket(Bucket='dcppctest')
bucket.set_acl('public-read')

# create public object
with open("rs1361754_Sig_eQTLs.tsv", "rb") as resultFile:
    s3_client.upload_fileobj(resultFile, "dcppctest", "rs1361754_Sig_eQTLs.tsv", 
                             ExtraArgs={'ACL':'public-read'})

## Minting A Minid

Metadata fields are mapped to schema.org as described in the KC2 Core metadata spec

In [9]:
# preform a md5 checksum
import hashlib

md5 = hashlib.md5()
with open('rs1361754_Sig_eQTLs.tsv', 'r') as analysis_file:
    md5.update(analysis_file.read().encode('utf-8'))
    
analysis_md5 = md5.hexdigest()

In [12]:
analysis = {
    "@id": "ark:/13030/d3sodiumtest",
    "identifier": "ark:/13030/d3sodiumtest",
    "checksum": analysis_md5,
    "checksumMethod": "md5",
    "url": "https://ors.datacite.org/ark:/13030/d3sodiumtest",
    "dateCreated": "7/13/18",
    "contentUrl": ["http://s3.amazonaws.com/dcppctest/rs1361754_Sig_eQTLs.tsv"],
    "name": "Significant eQTLs of rs1361754",
    "author": "Max Levinson",
}
analysis

{'@id': 'ark:/13030/d3sodiumtest',
 'author': 'Max Levinson',
 'checksum': 'cd1c9c120df5460ae556c083a5b8ff89',
 'checksumMethod': 'md5',
 'contentUrl': ['http://s3.amazonaws.com/dcppctest/rs1361754_Sig_eQTLs.tsv'],
 'dateCreated': '7/13/18',
 'identifier': 'ark:/13030/d3sodiumtest',
 'name': 'Significant eQTLs of rs1361754',
 'url': 'https://ors.datacite.org/ark:/13030/d3sodiumtest'}

In [30]:
ACCESS = "?code=TEST"
response = requests.put(
    url = "https://ors.datacite.org/ark/put"+ACCESS+"&status=public",
    data = json.dumps(analysis)
)
json.loads(response.content.decode('utf-8'))

{'@id': 'ark:/13030/d3sodiumtest',
 'code': 200,
 'message': 'Succsessfully Updated all Identifier metadata ark:/13030/d3sodiumtest',
 'updated_keys': ['NIHdc.id',
  'NIHdc.identifier',
  'NIHdc.checksum',
  'NIHdc.checksumMethod',
  'NIHdc.url',
  'NIHdc.dateCreated',
  'NIHdc.contentUrl',
  'NIHdc.name',
  'NIHdc.author',
  '_target',
  '_status',
  '_profile']}

## Analysis now has a landing page!

https://n2t.net/ark:/13030/d3sodiumtest resolves to our landing page

https://ors.datacite.org/ark:/13030/d3sodiumtest

In [28]:
# get the location from the identifier
ors_get_ark = requests.get(
    url = "https://ors.datacite.org/ark:/13030/d3sodiumtest",
    headers = {"Accept": "application/ld+json"}
)

In [29]:
json.loads(ors_get_ark.content.decode('utf-8'))

{'@context': 'https://schema.org',
 '@id': 'https://n2t.net/ark:/13030/d3sodiumtest',
 'author': 'Max Levinson',
 'checksum': 'cd1c9c120df5460ae556c083a5b8ff89',
 'checksumMethod': 'md5',
 'contentUrl': 'http://s3.amazonaws.com/dcppctest/rs1361754_Sig_eQTLs.tsv',
 'dateCreated': '7/13/18',
 'identifier': 'https://n2t.net/ark:/13030/d3sodiumtest',
 'name': 'Significant eQTLs of rs1361754',
 'url': 'https://ors.datacite.org/ark:/13030/d3sodiumtest'}