## Data Trials for Bioinformatics Dupuytren's

 Here is the site for the API https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/
 
 I am getting all SNPs(single nucleotide variants on Chromosome 14, since that Chromosome is suspected to be
 associated with Dupuytren's Disease

 The data only produces a 2KB file -- oh i realized why -- the next value
 It is still downloading, so I'll update the size once I figure it out

 Each result contains the following data
 
   name    :   the name of the SNP (ex.  rs28973059)
 
   pos     :   the position of the SNP on the chromosome
   
   ref     :   the "normal" nucleotide
   
   alt     :   the "variant" nucleotide
   
   chr     :   the chromosome number (example format:  "chr14")
   
   method  :   the method
   
   tissue  :   tissue where found
   
   value   :   ??

Necessary imports

In [13]:
import requests
import json
import pandas as pd



Create the variables for the API

In [24]:


# get all SNPs from chromosome 14
query = '&chr=chr14'
base = 'https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/snpdata/?chr=chr14&format=json'
url = base  + query


Request the data and wait for it

In [9]:

# send the request
r = requests.get(base)



Get the json data

In [10]:
# get the data
json_data = r.json()


Print an example of the first page of data


In [11]:
print("Example of first page of data:  ")

for key, value in json_data.items():
    print(key + ":", value)

Example of first page of data:  
count: 58520665
next: https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/snpdata/?chr=chr14&format=json&limit=10&offset=10
previous: None
results: [{'name': 'rs28973059', 'pos': 19000060, 'ref': 'C', 'alt': 'G', 'chr': 'chr14', 'method': 'CAPE dsQTL', 'tissue': 'H1 Cells', 'value': 0.0184}, {'name': 'rs28973059', 'pos': 19000060, 'ref': 'C', 'alt': 'G', 'chr': 'chr14', 'method': 'CAPE eQTL', 'tissue': 'H1 Cells', 'value': 0.7141}, {'name': 'rs28973059', 'pos': 19000060, 'ref': 'C', 'alt': 'G', 'chr': 'chr14', 'method': 'CAPE dsQTL', 'tissue': 'H1 BMP4 Derived Mesendoderm Cultured Cells', 'value': 0.0428}, {'name': 'rs28973059', 'pos': 19000060, 'ref': 'C', 'alt': 'G', 'chr': 'chr14', 'method': 'CAPE eQTL', 'tissue': 'H1 BMP4 Derived Mesendoderm Cultured Cells', 'value': 0.6402}, {'name': 'rs28973059', 'pos': 19000060, 'ref': 'C', 'alt': 'G', 'chr': 'chr14', 'method': 'CAPE dsQTL', 'tissue': 'H1 BMP4 Derived Trophoblast Cultured Cells', 'value': 0.157

Extract the data in a human-readable way for one example

In [26]:
print("There are " + str(json_data["count"]) + " entries.")
print("")
data = pd.DataFrame.from_records(json_data["results"])
data


There are 58520665 entries.



Unnamed: 0,name,pos,ref,alt,chr,method,tissue,value
0,rs28973059,19000060,C,G,chr14,CAPE dsQTL,H1 Cells,0.0184
1,rs28973059,19000060,C,G,chr14,CAPE eQTL,H1 Cells,0.7141
2,rs28973059,19000060,C,G,chr14,CAPE dsQTL,H1 BMP4 Derived Mesendoderm Cultured Cells,0.0428
3,rs28973059,19000060,C,G,chr14,CAPE eQTL,H1 BMP4 Derived Mesendoderm Cultured Cells,0.6402
4,rs28973059,19000060,C,G,chr14,CAPE dsQTL,H1 BMP4 Derived Trophoblast Cultured Cells,0.1579
5,rs28973059,19000060,C,G,chr14,CAPE eQTL,H1 BMP4 Derived Trophoblast Cultured Cells,0.3605
6,rs28973059,19000060,C,G,chr14,CAPE dsQTL,H1 Derived Mesenchymal Stem Cells,0.3209
7,rs28973059,19000060,C,G,chr14,CAPE eQTL,H1 Derived Mesenchymal Stem Cells,0.2348
8,rs28973059,19000060,C,G,chr14,CAPE dsQTL,H1 Derived Neuronal Progenitor Cultured Cells,0.2755
9,rs28973059,19000060,C,G,chr14,CAPE eQTL,H1 Derived Neuronal Progenitor Cultured Cells,0.4838


In [27]:

while json_data["next"] != "None" and len(data.index)<30:
    # if the "next" value contains another API call, then keep going
    r = requests.get(json_data["next"])
    json_data = r.json()
    # add new results to the existing
    print("Next is " + json_data['next'])

    data = data.append(json_data["results"], ignore_index=True, sort=False)
    

Next is https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/snpdata/?chr=chr14&format=json&limit=10&offset=20
Next is https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/snpdata/?chr=chr14&format=json&limit=10&offset=30


The first issue I see with the data is that this returns results that have been uploaded by multiple sources.  So basically the only difference is the prediction value and there will also be a difference in what tissue it came from.  So, I think I could remove the method, tissue, and value columns and then filter to distinct rows.  I am considering whether the tissue column would be useful later or not.  Also, if there is a discrepancy with position, etc, the value might come in handy... so perhaps take the distinct rows first and use the value to determine which rows to keep if there is a discrepancy...orrr....maybe all rows with a discrepancy need to be kept (but if so, the value probably needs to be kept, as well)


In [28]:
data = data.drop(columns=['tissue', 'value', 'method'])


Unnamed: 0,name,pos,ref,alt,chr
0,rs28973059,19000060,C,G,chr14
1,rs28973059,19000060,C,G,chr14
2,rs28973059,19000060,C,G,chr14
3,rs28973059,19000060,C,G,chr14
4,rs28973059,19000060,C,G,chr14
5,rs28973059,19000060,C,G,chr14
6,rs28973059,19000060,C,G,chr14
7,rs28973059,19000060,C,G,chr14
8,rs28973059,19000060,C,G,chr14
9,rs28973059,19000060,C,G,chr14


In [31]:
data = data.drop_duplicates()
print(data)
print("There are " + str(len(data.index)) + " entries.")

           name       pos ref alt    chr
0    rs28973059  19000060   C   G  chr14
10  rs200799719  19007013   T   A  chr14
11  rs201970796  19007014   C   T  chr14
12   rs28823376  19007199   C   T  chr14
13   rs28868990  19009012   C   G  chr14
14   rs28870189  19009827   C   T  chr14
15   rs28787113  19009874   A   C  chr14
16   rs28867444  19010114   T   C  chr14
17   rs28769305  19010845   T   G  chr14
18  rs567404177  19011159   G   A  chr14
19   rs28838588  19012416   T   C  chr14
20   rs59229897  19013465   T   A  chr14
21   rs28835071  19014236   C   T  chr14
22   rs28826949  19014674   T   G  chr14
23   rs28838035  19016546   T   C  chr14
24   rs28843138  19017334   G   T  chr14
25   rs28872628  19018235   T   G  chr14
26   rs28853679  19019260   T   C  chr14
27   rs28851941  19019261   G   A  chr14
28   rs71406421  19020534   G   A  chr14
29   rs28861970  19020582   C   T  chr14
There are 21 entries.


In [32]:
variant_ids = data["name"].tolist()
variant_ids

['rs28973059',
 'rs200799719',
 'rs201970796',
 'rs28823376',
 'rs28868990',
 'rs28870189',
 'rs28787113',
 'rs28867444',
 'rs28769305',
 'rs567404177',
 'rs28838588',
 'rs59229897',
 'rs28835071',
 'rs28826949',
 'rs28838035',
 'rs28843138',
 'rs28872628',
 'rs28853679',
 'rs28851941',
 'rs71406421',
 'rs28861970']

In [58]:
var_string = str(variant_ids)
var_string = var_string.replace('\'','"')
var_string

'["rs28973059", "rs200799719", "rs201970796", "rs28823376", "rs28868990", "rs28870189", "rs28787113", "rs28867444", "rs28769305", "rs567404177", "rs28838588", "rs59229897", "rs28835071", "rs28826949", "rs28838035", "rs28843138", "rs28872628", "rs28853679", "rs28851941", "rs71406421", "rs28861970"]'

In [76]:
import sys

server = "https://rest.ensembl.org"
ext = "/variation/homo_sapiens"
headers={ "Content-Type" : "application/json", "Accept" : "application/json"}
thedata = "{ 'ids' : " + var_string + "}"
thedata = thedata.replace('\\\'','"')
thedata = thedata.replace('\'','"')
thedata
options = {
        'population_genotypes':'1',
        'pops':'1'
    }

In [77]:
r = requests.post(server+ext, headers=headers, data=thedata, params=options)
 
if not r.ok:
  r.raise_for_status()
  sys.exit()
 
decoded = r.json()
print(repr(decoded))

{'rs28843138': {'most_severe_consequence': 'intergenic_variant', 'minor_allele': 'T', 'var_class': 'SNP', 'populations': [{'frequency': 0.6047, 'allele_count': 8158, 'population': 'gnomADg:amr', 'allele': 'G'}, {'population': 'gnomADg:amr', 'allele': 'T', 'frequency': 0.3953, 'allele_count': 5332}, {'population': 'gnomADg:mid', 'allele': 'T', 'allele_count': 102, 'frequency': 0.3643}, {'allele': 'G', 'population': 'gnomADg:mid', 'frequency': 0.6357, 'allele_count': 178}, {'frequency': 0.3651, 'allele_count': 1519, 'population': 'gnomADg:sas', 'allele': 'T'}, {'allele': 'G', 'population': 'gnomADg:sas', 'frequency': 0.6349, 'allele_count': 2641}, {'frequency': 0.8169, 'allele_count': 30980, 'population': 'gnomADg:afr', 'allele': 'G'}, {'population': 'gnomADg:afr', 'allele': 'T', 'allele_count': 6944, 'frequency': 0.1831}, {'allele_count': 675, 'frequency': 0.3717, 'allele': 'T', 'population': 'gnomADg:oth'}, {'population': 'gnomADg:oth', 'allele': 'G', 'frequency': 0.6283, 'allele_count