# Investigation duplication issues in DDE for De-duplication

Many records submitted via various portals in the DDE have data submitted via some other repository, which means that there may be duplication between DDE and other repositories. This notebook is to get an idea of how much overlap there actually which will help us determine the best path forward



In [1]:
import os
import requests
import json
import pandas as pd

In [48]:
script_path = os.getcwd()
result_path = os.path.join(script_path,'results')
p1_path = os.path.abspath(os.path.join(script_path, os.pardir))
parent_path = os.path.abspath(os.path.join(p1_path, os.pardir))
correction_path = os.path.join(parent_path,'nde-metadata-corrections','correction_lists')
print(correction_path)

C:\Users\gtsueng\Anaconda3\envs\nde\nde-metadata-corrections\correction_lists


In [6]:
sysbioAPI= "https://api-staging.data.niaid.nih.gov/v1/query?&q=_exists_%3AsourceOrganization.name+AND+includedInDataCatalog.name%3A%22Data+Discovery+Engine%22&fields=sourceOrganization,identifier&size=500"
r = requests.get(sysbioAPI)
jr = json.loads(r.text)
print(jr.keys())

dict_keys(['took', 'total', 'max_score', 'hits'])


In [17]:
results = pd.DataFrame(jr['hits'])
results['sourceOrganization.name'] = [x[0]['name'] for x in results['sourceOrganization']]
clean_results = results.drop(['_ignored','_score','sourceOrganization'],axis=1)
print(clean_results.head(n=2))

                    _id   identifier sourceOrganization.name
0  dde_026a2652e9c13d8f    GSE168215   NIAID Systems Biology
1  dde_046fd968814fd76f  PRJNA423123   NIAID Systems Biology


In [19]:
sdPubFreq = clean_results.groupby('sourceOrganization.name').size().reset_index(name="counts")
sdPubFreq.sort_values(by="counts",ascending=False,inplace=True)
print(sdPubFreq)

  sourceOrganization.name  counts
1   NIAID Systems Biology     368
0     NIAID CREID Network      12


In [20]:
sdPubFreq.to_csv(os.path.join(result_path,'program_frequency.tsv'),sep='\t',header=True)
clean_results.to_csv(os.path.join(result_path,'program_id.tsv'),sep='\t', header=True)

In [26]:
identifierlist = [x.lower() for x in clean_results['identifier']]

id2check = identifierlist[0]
print(id2check)

found = []
not_found = []

for eachid in identifierlist:
    raw = requests.get(f"https://api.data.niaid.nih.gov/v1/query?q={eachid}&fields=_id")
    temp = json.loads(raw.text)
    try:
        hitlist = temp['hits']
        for eachhit in hitlist:
            if eachhit['_id'] == eachid:
                found.append(eachhit['_id'])
    except:
        not_found.append(eachid)

all_missing = list(set(not_found).union(set([x for x in identifierlist if x not in found])))

print(len(found), len(not_found))

gse168215
220 8


In [34]:
found_records = pd.DataFrame(found)
found_records.rename(columns={0:'lower_id'}, inplace=True)
found_records['url'] = found_records.apply(lambda row: 'https://data.niaid.nih.gov/resources?id='+str(row['lower_id']), axis=1)
print(found_records.head(n=2))

      lower_id                                                url
0    gse168215  https://data.niaid.nih.gov/resources?id=gse168215
1  prjna423123  https://data.niaid.nih.gov/resources?id=prjna4...


In [37]:
clean_results['lower_id'] = [x.lower() for x in clean_results['identifier']]
clean_results['dde_url'] = clean_results.apply(lambda row: 'https://data.niaid.nih.gov/resources?id='+str(row['_id']), axis=1)
print(clean_results.head(n=2))

                    _id   identifier sourceOrganization.name     lower_id  \
0  dde_026a2652e9c13d8f    GSE168215   NIAID Systems Biology    gse168215   
1  dde_046fd968814fd76f  PRJNA423123   NIAID Systems Biology  prjna423123   

                                             dde_url  
0  https://data.niaid.nih.gov/resources?id=dde_02...  
1  https://data.niaid.nih.gov/resources?id=dde_04...  


In [41]:
matched = clean_results.merge(found_records,on='lower_id',how='left').fillna('not found')


In [50]:
clean_matches = matched.loc[matched['url']!='not found']
clean_matches.to_csv(os.path.join(result_path,'dde_dups_found.tsv'),sep='\t',header=True)
id_url_1 = clean_matches[['_id','url']].copy()
id_url_2 = clean_matches[['lower_id','dde_url']].copy()
id_url_1.rename(columns={'url':'sameAs'},inplace=True)
id_url_2.rename(columns={'lower_id':'_id','dde_url':'sameAs'}, inplace=True)
ready_to_use = pd.concat((id_url_1,id_url_2),ignore_index=True)
ready_to_use.to_csv(os.path.join(correction_path,'auto_dde_dedup_list.tsv'),sep='\t',header=True)

missing = matched.loc[matched['url']=='not found']
missing.to_csv(os.path.join(result_path,'dde_not_found.tsv'),sep='\t',header=True)

### Observations
1. There are one-to-many mapping issues as 1 DDE entry may have multiple associated ids from other sources.
  * Example: https://data.niaid.nih.gov/resources?id=dde_5a3ffacf5c48e713 maps to both:
    * https://data.niaid.nih.gov/resources?id=gse72008 and
    * https://data.niaid.nih.gov/resources?id=gse71759

2. There are identifiers for records that simply have not been ingested into the system yet.
  * Examples:
    * https://data.niaid.nih.gov/resources?id=dde_f8b7126fb0a4d01d has id=msv000081916 which is a MassIVE id (in Staging, not in Production)
    * https://data.niaid.nih.gov/resources?id=dde_897dc290b8b6574a has id=10.7303/syn6115677 which resolves to Synapse.org
    * https://data.niaid.nih.gov/resources?id=dde_145239e4ad606459 has id=genbank_mf782680 which belongs to genbank
    * id=qbca00000000 is an identifier from NCBI Nucleotide
    * id=pxd007774 is a identifier that can resolve via identifiers.org to the PRoteomics IDEntifier (PRIDE) database

3. Potential approaches for de-duplication:
  * use sameAs to link the related records -- they will continue to exist separately, but will link to one another
    * Pros:
      * straightforward to do
      * easy to implement, in spite of one-to-many mappings
      * no issues with messy merging
      * no issues with conflicting identifiers between repo and dde
    * Cons:
      * Duplicate records
      * Source filter for NIAID SysBio in NDE won't show breakdown of where the records are additionally stored
  * change the `_id` for the dde record so that the metadata can be merged with the appropriate record
    * Pros:
      * Single record with slightly more complete metadata (NIAID SysBio is a minimal schema, so improvement won't necessarily be drastic)
      * Source filter for NIAID SysBio in NDE can show breakdown of where the records are additionally stored
    * Cons:
      * Issues with one-to-many mappings
      * Issues with metadata merging/resolution (which source to prioritize? how to reduce duplication between sources?)
      * Issues with conflicting identifiers between repo and dde
