# Get NMDC Metadata and Data Objects

This notebook describes and provides example code to:

1. Filter NMDC metadata to obtain IDs and fetch attributes, using API endpoints.
2. Download collected metadata to files, and data objects to files.
3. Fetch and load collected metadata, or data object bytes, to in-memory Python objects.

## Dependencies

The following modules, constants, and helper functions are used by one or more use case cells below, so be sure to run this cell first:

In [1]:
from io import BytesIO
import json
from operator import itemgetter
from pathlib import Path
from pprint import pprint
import shutil
import subprocess
from tqdm.notebook import tqdm
from urllib.parse import parse_qsl, urlencode

import requests
from toolz import keyfilter, merge, concat

HOST = "https://api.microbiomedata.org"

def get_json(path, host=HOST, **kwargs):
    r = requests.get(host + path, **kwargs)
    r.raise_for_status()
    return r.json()

def pick(allowlist, d):
    return keyfilter(lambda k: k in allowlist, d)

meta = itemgetter("meta")
results = itemgetter("results")

## Filter/fetch metadata

### Use Case: fetch metadata directly associated with an ID with unknown type

In [3]:
get_json("/nmdcschema/ids/nmdc:bsm-13-amrnys72")

{'id': 'nmdc:bsm-13-amrnys72',
 'name': 'Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T4_25-Nov-14',
 'description': 'Sterilized sand packs were incubated back in the ground and collected at time point T4.',
 'env_broad_scale': {'has_raw_value': 'ENVO:01000253',
  'term': {'id': 'ENVO:01000253'}},
 'env_local_scale': {'has_raw_value': 'ENVO:01000621',
  'term': {'id': 'ENVO:01000621'}},
 'env_medium': {'has_raw_value': 'ENVO:01000017',
  'term': {'id': 'ENVO:01000017'}},
 'type': 'nmdc:Biosample',
 'collection_date': {'has_raw_value': '2014-11-25'},
 'depth': {'has_raw_value': '0.5',
  'has_numeric_value': 0.5,
  'has_unit': 'meter'},
 'geo_loc_name': {'has_raw_value': 'USA: Columbia River, Washington'},
 'lat_lon': {'has_raw_value': '46.37228379 -119.2717467',
  'latitude': 46.37228379,
  'longitude': -119.2717467},
 'ecosystem': 'Engineered',
 'ecosystem_category': 'Artificial ecosystem',
 'ecosystem_type': 'Sand microcosm',
 '

### Use case: fetch metadata for an ID from a known [nmdc:Database](https://microbiomedata.github.io/nmdc-schema/Database/) collection.

In [4]:
get_json("/nmdcschema/biosample_set/nmdc:bsm-13-w2cwcx50")

{'id': 'nmdc:bsm-13-w2cwcx50',
 'name': 'Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T4_12-Aug-14',
 'description': 'Sterilized sand packs were incubated back in the ground and collected at time point T4.',
 'env_broad_scale': {'has_raw_value': 'ENVO:01000253',
  'term': {'id': 'ENVO:01000253'}},
 'env_local_scale': {'has_raw_value': 'ENVO:01000621',
  'term': {'id': 'ENVO:01000621'}},
 'env_medium': {'has_raw_value': 'ENVO:01000017',
  'term': {'id': 'ENVO:01000017'}},
 'type': 'nmdc:Biosample',
 'collection_date': {'has_raw_value': '2014-08-12'},
 'depth': {'has_raw_value': '0.5',
  'has_numeric_value': 0.5,
  'has_unit': 'meter'},
 'geo_loc_name': {'has_raw_value': 'USA: Columbia River, Washington'},
 'lat_lon': {'has_raw_value': '46.37228379 -119.2717467',
  'latitude': 46.37228379,
  'longitude': -119.2717467},
 'ecosystem': 'Engineered',
 'ecosystem_category': 'Artificial ecosystem',
 'ecosystem_type': 'Sand microcosm',
 '

### Use Case: filter metadata from a known nmdc:Database collection using the MongoDB Query Language.

In [5]:
def get_json_mql(path, filter_):
    return get_json(path, params={"filter": json.dumps(filter_)})

def resources_count(json_response):
    return len(json_response["resources"])

resources_count(get_json_mql(
    "/nmdcschema/biosample_set",
    {"ecosystem": "Engineered"}
))

19

### Use Case: filter metadata from studies, biosamples, data_objects, or any activities collection using a readable URL with a Solr-like query language.

In [6]:
def id_and_ecosystem_fields(doc):
    return pick(
        ["id"] + [f for f in doc if f.startswith("ecosystem")],
        doc)

print("\nStudies filter:\n")
json_response = get_json("/studies?filter=ecosystem_type:Soil")
pprint(meta(json_response))
pprint([id_and_ecosystem_fields(r) for r in results(json_response)])

print("\nData Objects filter and sort:\n")

json_response = get_json(
    "/data_objects?"
    "filter=description.search:GFF"
    "&"
    "sort=file_size_bytes:desc"
)
pprint(meta(json_response))
pprint([pick(
    ["description", "file_size_bytes", "id", "url"]
    , r
) for r in results(json_response)][:5])

print("\nActivities filter and sort:\n")

json_response = get_json(
    "/activities?"
    "filter=started_at_time:>2022-01-01"
    ","
    "execution_resource.search:NERSC"
    "&"
    "sort=ended_at_time:desc"
)
pprint(meta(json_response))
pprint([
    pick([
        "id",
        "started_at_time",
        "ended_at_time",
        "execution_resource",
        "type"],
        r
    ) for r in results(json_response)][:5]
)


Studies filter:

{'count': 3,
 'db_response_time_ms': 1,
 'mongo_filter_dict': {'ecosystem_type': 'Soil'},
 'mongo_sort_list': None,
 'page': 1,
 'per_page': 25}
[{'ecosystem': 'Environmental',
  'ecosystem_category': 'Terrestrial',
  'ecosystem_subtype': 'Unclassified',
  'ecosystem_type': 'Soil',
  'id': 'nmdc:sty-11-076c9980'},
 {'ecosystem': 'Environmental',
  'ecosystem_category': 'Terrestrial',
  'ecosystem_subtype': 'Meadow',
  'ecosystem_type': 'Soil',
  'id': 'nmdc:sty-11-dcqce727'},
 {'ecosystem': 'Environmental',
  'ecosystem_category': 'Terrestrial',
  'ecosystem_subtype': 'Unclassified',
  'ecosystem_type': 'Soil',
  'id': 'nmdc:sty-11-r2h77870'}]

Data Objects filter and sort:

{'count': 0,
 'db_response_time_ms': 233,
 'mongo_filter_dict': {'description': {'$regex': 'GFF'}},
 'mongo_sort_list': [['file_size_bytes', -1]],
 'page': 1,
 'per_page': 25}
[]

Activities filter and sort:

{'count': 7046,
 'db_response_time_ms': 3989,
 'mongo_filter_dict': {'execution_resource'

## Download (meta)data

### Use case: download metadata of all biosamples for study.

In [10]:
def write_jsonlines_file(path, all_results):
    with open(path, "w") as f:
        f.writelines([json.dumps(doc)+"\n" for doc in all_results])

cursor = "*"
all_results = []
while cursor is not None:
    json_response = get_json(
        f"/biosamples?filter=part_of:nmdc:sty-11-zs2syx06&cursor={cursor}"
    )
    m, rs = meta(json_response), results(json_response)
    cursor = m['next_cursor']
    print("fetched", len(rs), f"results out of {m['count']} total")
    all_results.extend(rs)

path = "~/biosamples_part_of_nmdc:sty-11-zs2syx06.jsonl"

write_jsonlines_file(
    Path(path).expanduser(),
    all_results
)

subprocess.check_output(
    f"head -1 {path}",
    shell=True,
)

fetched 25 results out of 60 total
fetched 25 results out of 60 total
fetched 10 results out of 60 total


b'{"add_date": "2015-02-26", "collection_date": {"has_raw_value": "2014-09-03"}, "depth": {"has_maximum_numeric_value": 0.4, "has_minimum_numeric_value": 0.3, "has_numeric_value": 0.3, "has_raw_value": "0.3 to 0.4 meters", "has_unit": "metre"}, "description": "Grasslands soil microbial communities from the Angelo Coastal Reserve, plot 9. There is a duplicate submission for this entry in NCBI. The NCBI identifiers for a duplicate are PRJNA449266 and SAMN08902854", "ecosystem": "Environmental", "ecosystem_category": "Terrestrial", "ecosystem_subtype": "Grasslands", "ecosystem_type": "Soil", "elev": 432, "env_broad_scale": {"has_raw_value": "grassland biome [ENVO:01000177]", "term": {"id": "ENVO:01000177"}}, "env_local_scale": {"has_raw_value": "biosphere reserve [ENVO:00000376]", "term": {"id": "ENVO:00000376"}}, "env_medium": {"has_raw_value": "grassland soil [ENVO:00005750]", "term": {"id": "ENVO:00005750"}}, "geo_loc_name": {"has_raw_value": "USA: California: Angelo Coastal Reserve"},

### Use case: download all data objects for biosample

In [11]:
def download_file(url, directory="~/"):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(Path(directory + local_filename).expanduser(), 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

id_biosample = "igsn:IEWFS000A" #UPDATE
rs_ompro = results(get_json(f"/activities?filter=type:nmdc:OmicsProcessing,has_input:{id_biosample}"))
for id_ompro in tqdm([d["id"] for d in rs_ompro]):
    rs_act = results(get_json(f"/activities?filter=was_informed_by:{id_ompro}"))
    for data_object_ids, activity_type in [(d["has_output"], d["type"]) for d in rs_act]:
        for data_object_id in data_object_ids:
            do = results(get_json(f"/data_objects?filter=id:{data_object_id}"))[0]
            print(f'downloading biosample {id_biosample} > omics processing activity {id_ompro} '
                  f'> {activity_type} activity > data object {data_object_id} from {do["url"]}...')
            download_file(do["url"])

0it [00:00, ?it/s]

## Load (meta)data objects to in-memory Python objects

### Use case: load metadata of all biosamples for study.

In [12]:
cursor = "*"
all_results = []
while cursor is not None:
    json_response = get_json(
        f"/biosamples?filter=part_of:nmdc:sty-11-zs2syx06&cursor={cursor}"
    )
    m, rs = meta(json_response), results(json_response)
    cursor = m['next_cursor']
    print("fetched", len(rs), f"results out of {m['count']} total")
    all_results.extend(rs)

pprint([pick(["id","lat_lon"], r) for r in all_results][:5])

fetched 25 results out of 60 total
fetched 25 results out of 60 total
fetched 10 results out of 60 total
[{'id': 'nmdc:bsm-11-04qjyv47',
  'lat_lon': {'has_raw_value': '39.7392 -123.6308',
              'latitude': 39.7392,
              'longitude': -123.6308}},
 {'id': 'nmdc:bsm-11-05082t91',
  'lat_lon': {'has_raw_value': '39.7392 -123.6308',
              'latitude': 39.7392,
              'longitude': -123.6308}},
 {'id': 'nmdc:bsm-11-2fjtje68',
  'lat_lon': {'has_raw_value': '39.7392 -123.6308',
              'latitude': 39.7392,
              'longitude': -123.6308}},
 {'id': 'nmdc:bsm-11-3zrd9503',
  'lat_lon': {'has_raw_value': '39.7392 -123.6308',
              'latitude': 39.7392,
              'longitude': -123.6308}},
 {'id': 'nmdc:bsm-11-4x1n6x51',
  'lat_lon': {'has_raw_value': '39.7392 -123.6308',
              'latitude': 39.7392,
              'longitude': -123.6308}}]


### Use case: load data object

In [9]:
def load_bytes(url):
    with requests.get(url, stream=True) as r:
        b = BytesIO()
        shutil.copyfileobj(r.raw, b)

    return b.getvalue()

b = load_bytes(get_json("/nmdcschema/data_object_set/nmdc:4b649d353b2c2385ab042682ba516d14")["url"])

for line in b.decode('utf-8').split("\n"):
    print(line)

Index,m/z,Calibrated m/z,Calculated m/z,Peak Height,Resolving Power,S/N,Ion Charge,m/z Error (ppm),m/z Error Score,Isotopologue Similarity,Confidence Score,DBE,H/C,O/C,Heteroatom Class,Ion Type,Is Isotopologue,Mono Isotopic Index,Molecular Formula,C,H,O,S,N
3,888.4887543119316,888.4883146532127,888.4878482725782,3085144.7556615113,115985.22008880168,8.926135177919196,-1,-0.5249150401183273,0.21637236558431258,0.0,0.12982341935058755,21.0,1.290909090909091,0.12727272727272726,N1 S1 O7,de-protonated,0,,C55 H71 O7 S1 N1,55,71,7,1,1
4,885.3918562712611,885.3914165190189,885.3914239992979,3271768.2235989384,116390.47009957765,9.466085953040722,-1,0.008448555979629386,0.999603533623901,0.0,0.5997621201743406,16.0,1.3478260869565217,0.3695652173913043,O17,de-protonated,0,,C46 H62 O17,46,62,17,,
6,877.03548957106,877.0350496100808,877.035237742867,3413803.0920212218,93999.77607505002,9.877030182866147,-1,0.2145099513797522,0.7744236389320375,0.0,0.4646541833592225,32.0,0.5238095238095238,0.476