## Mapping Enum(ish) Measurement Technique values

Some repositories provide (fairly) consistent measurement technique values. For example, NCBI GEO has consistently used measurementTechnique terms. For these types of repositories, it's not necessary to perform NLP extractions of the measurement techniques. Rather, we simply need to map the values. 

This notebook is for extracting the consistent techniques from repositories that have mostly consistent values and mapping those values see the corresponding GH issue: https://github.com/NIAID-Data-Ecosystem/nde-crawlers/issues/157

The repositories which will be handled by this notebook include:
- LINCS
- SRA

In [None]:
import os
import pandas as pd
import json
import requests
import math

In [None]:
script_path = os.getcwd()
parent_path = os.path.abspath(os.path.join(script_path, os.pardir))
result_path = os.path.join(script_path,'results')

#### Generate MeasTechList for LINCS and other smaller db's (<1000 hits)

In [None]:
repos = ["LINCS", "RADx+Data+Hub"]
repo_name = repos[1]

In [None]:
%%time

## Perform the initial query

query_url = f'https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"{repo_name}"&fields=_id,measurementTechnique&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
print(len(cleanr['hits']))

In [None]:
df1 = pd.DataFrame(cleanr['hits'])
total_hits = cleanr['total']
print(total_hits)
print(df1.head(n=2))

In [None]:
def popout_name(meastech):
    measname = []
    if isinstance(meastech,dict):
        measname.append(meastech['name'])
    elif isinstance(meastech,list):
        for eachmeas in meastech:
            measname.append(eachmeas['name'])
    else:
        measname= meastech
    return measname

In [None]:
df1['measname'] = df1.apply(lambda row: popout_name(row['measurementTechnique']),axis=1)
df2 = df1.explode('measname')

In [None]:
frequency_df = df2.groupby('measname').size().reset_index(name='counts')
print(len(frequency_df))
print(frequency_df.head(n=2))
frequency_df.to_csv(os.path.join(result_path,f'{repo_name}_freq.tsv'),sep='\t',header=True)

#### Pull measTech for SRA, BioStudies, and other repos with lots of records

Note, it looks like the measurementTechnique data for SRA is currently not parsed/crawled

In [None]:
repos = ["BioStudies", "NCBI+SRA", "NICHD+DASH","The+Database+of+Genotypes+and+Phenotypes"]
repo_name = repos[3]
print(repo_name)

In [None]:
%%time

## Perform the initial query

query_url = f'https://api-staging.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"{repo_name}"&fields=_id,measurementTechnique&fetch_all=true'
print(query_url)
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
print(len(cleanr['hits']))
scroll_id = cleanr['_scroll_id']
total_hits = cleanr['total']
df1 = pd.DataFrame(cleanr['hits'])

In [None]:
i = 0
#k = 3 
k = math.ceil(total_hits/500)
while i < k:
    try:
        #r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
        r2 = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
        tmp = json.loads(r2.text)
        scroll_id = tmp['_scroll_id']
        tmpdf = pd.DataFrame(tmp['hits'])
        df1 = pd.concat((df1,tmpdf),ignore_index=True)
        #print(len(df1))
    except:
        print("attempt ", i, " failed")
    i = i+1    

df1['measList'] = df1.apply(lambda row: popout_name(row['measurementTechnique']), axis=1)
df2 = df1.drop(columns=['_score','measurementTechnique','_ignored']).copy()
print(df2.head(n=2))

In [None]:
df3 = df2.explode('measList')
print(len(df3))
df4 = df3.groupby('measList').size().reset_index(name='Counts')
df4.rename(columns={"measList":"measurementTechnique"}, inplace=True)
df4.sort_values('Counts',ascending=False, inplace=True)
df4.to_csv(os.path.join(result_path,f'{repo_name}_freq.tsv'),sep='\t',header=True)