## Mapping Enum(ish) Measurement Technique values

Some repositories provide (fairly) consistent measurement technique values. For example, NCBI GEO has consistently used measurementTechnique terms. For these types of repositories, it's not necessary to perform NLP extractions of the measurement techniques. Rather, we simply need to map the values. 

This notebook is for extracting the consistent techniques from repositories that have mostly consistent values and mapping those values see the corresponding GH issue: https://github.com/NIAID-Data-Ecosystem/nde-crawlers/issues/157

The repositories which will be handled by this notebook include:
- LINCS
- SRA

In [16]:
import os
import pandas as pd
import json
import requests
import math

In [17]:
script_path = os.getcwd()
parent_path = os.path.abspath(os.path.join(script_path, os.pardir))
result_path = os.path.join(script_path,'results')

#### Generate MeasTechList for LINCS

In [3]:
%%time

## Perform the initial query

query_url = 'https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"LINCS"&fields=_id,measurementTechnique&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
print(len(cleanr['hits']))

424
CPU times: total: 141 ms
Wall time: 1.14 s


In [5]:
df1 = pd.DataFrame(cleanr['hits'])
total_hits = cleanr['total']
print(total_hits)
print(df1.head(n=2))

424
        _id       _ignored    _score  \
0  lds-1007  [all.keyword]  11.00529   
1  lds-1023  [all.keyword]  11.00529   

                                measurementTechnique  
0  {'description': 'Biochemical', 'name': 'KINOME...  
1  {'description': 'Biochemical', 'name': 'KINOME...  


In [9]:
def popout_name(meastech):
    measname = []
    if isinstance(meastech,dict):
        measname.append(meastech['name'])
    elif isinstance(meastech,list):
        for eachmeas in meastech:
            measname.append(eachmeas['name'])
    else:
        measname= meastech
    return measname

In [10]:
df1['measname'] = df1.apply(lambda row: popout_name(row['measurementTechnique']),axis=1)
df2 = df1.explode('measname')

In [15]:
frequency_df = df2.groupby('measname').size().reset_index(name='counts')
print(len(frequency_df))
print(frequency_df.head(n=2))
frequency_df.to_csv(os.path.join(result_path,'LINCS_freq.tsv'),sep='\t',header=True)

32
                                            measname  counts
0                ATAC-seq epigenetic profiling assay       4
1  Aggregated small molecule biochemical target a...       1


#### Pull measTech for SRA

Note, it looks like the measurementTechnique data for SRA is currently not parsed/crawled

In [None]:
%%time

## Perform the initial query

query_url = 'https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"NCBI+SRA"&fields=_id,measurementTechnique&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
print(len(cleanr['hits']))

In [None]:
i = 0
k = 3 
#k = math.ceil(total_hits/1000)
while i < k:
    #r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    r2 = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    tmp = json.loads(r2.text)
    scroll_id = tmp['_scroll_id']
    tmpdf = pd.DataFrame(tmp['hits'])
    df1 = pd.concat((df1,tmpdf),ignore_index=True)
    #print(len(df1))
    i = i+1
    time.sleep(0.25)