# Investigation duplication issues in DDE for De-duplication

Many records submitted via various portals in the DDE have data submitted via some other repository, which means that there may be duplication between DDE and other repositories. This notebook is to get an idea of how much overlap there actually which will help us determine the best path forward



In [1]:
import os
import requests
import json
import pandas as pd

In [23]:
sysbioAPI= "https://api-staging.data.niaid.nih.gov/v1/query?&q=_exists_%3AsdPublisher.name+AND+includedInDataCatalog.name%3A%22Data+Discovery+Engine%2C+NIAID+Systems+Biology%22&fields=sdPublisher,identifier&size=500"
r = requests.get(sysbioAPI)
jr = json.loads(r.text)
print(jr.keys())

dict_keys(['took', 'total', 'max_score', 'hits'])


In [24]:
results = pd.DataFrame(jr['hits'])
results['sdPublisher.name'] = [str(x).replace("{'name': '","").replace("'}","") for x in results['sdPublisher']]
clean_results = results.drop(['_ignored','_score','sdPublisher'],axis=1)
print(clean_results.head(n=2))

                    _id   identifier sdPublisher.name
0  dde_66c0ab81b12cff8c  PRJNA521559              SRA
1  dde_046fd968814fd76f  PRJNA423123              SRA


In [28]:
sdPubFreq = clean_results.groupby('sdPublisher.name').size().reset_index(name="counts")
sdPubFreq.sort_values(by="counts",ascending=False,inplace=True)
print(sdPubFreq)

                                     sdPublisher.name  counts
10                                            MassIVE     110
2                                                 GEO      89
17                                                SRA      63
14                                               NCBI      26
3                                            GEO/NCBI      10
23  {'@type': 'Organization', 'name': 'CViSB Data ...       9
18                                            Synapse       6
5                                             GenBank       4
15                                              PRIDE       3
20  [{'@type': 'Organization', 'name': 'CViSB Data...       2
1                                            Figshare       2
0                               EMBL-EBI and Mendeley       2
7                              MTB Network Portal/GEO       2
25        {'@type': 'Organization', 'name': 'Figshare       2
21  [{'@type': 'Organization', 'name': 'GenBank, {...       1
24      

In [31]:
sdPubFreq.to_csv('sysbio_sdPublisher_frequency.tsv',sep='\t',header=True)
clean_results.to_csv('sysbio_sdPublisher_id.tsv',sep='\t', header=True)

In [38]:
identifierlist = [x.lower() for x in clean_results['identifier']]

id2check = identifierlist[0]
print(id2check)

found = []
not_found = []

for eachid in identifierlist:
    raw = requests.get(f"https://api.data.niaid.nih.gov/v1/query?q={eachid}&fields=_id")
    temp = json.loads(raw.text)
    try:
        hitlist = temp['hits']
        for eachhit in hitlist:
            if eachhit['_id'] == eachid:
                found.append(eachid)
    except:
        not_found.append(eachid)

all_missing = list(set(not_found).union(set([x for x in identifierlist if x not in found])))

print(len(found), len(not_found))

prjna521559
214 6


In [39]:
print(all_missing)

['prjna307992_michigan', 'msv000081916', 'msv000079164', 'msv000081047', 'prjna638887', 'prjeb23261', 'msv000081921', 'srp148607', 'prjna238042', 'srp157243', '10.7303/syn6115677', 'gse122960\xa0', 'prjna422941', 'srp156489', 'dengue', 'msv000081920', 'jabetn00000000', 'cvisb-ebola-virus-seq', 'srp156948', 'msv000080896', 'msv000081845', 'srx2039176', 'prjna430883', 'prjna415307', 'msv000082344', 'prjna527265', 'sarscov2-virus-seq', 'msv000081918', 'msv000081894', '10.7303/syn6114189', 'growth & fitness data for prjna750080', 'gse104154\xa0', '10.25739/40y5-ce29', 'cvisb_30788396', 'rtpcr-vhf', 'gse89931', '\xa0prjna418452', 'prjeb24929', 'msv000081783', 'blood-chemistry-vhf', 'genbank_mf782679', 'msv000081889', 'msv000081892', 'msv000081930', 'phs002245.v1.p1', 'niaid natural history studies metadata', 'erp012810', 'msv000083221', 'gen bank cp022524', '10.7303/syn4935562', 'blood-counts-vhf', 'pxd026302', 'msv000080902', 'syn12179188', 'rapid-diagnostics-vhf', 'prjna384621', 'srp15785

### Observations
1. There are one-to-many mapping issues as 1 DDE entry may have multiple associated ids from other sources.
  * Example: https://data.niaid.nih.gov/resources?id=dde_5a3ffacf5c48e713 maps to both:
    * https://data.niaid.nih.gov/resources?id=gse72008 and
    * https://data.niaid.nih.gov/resources?id=gse71759

2. There are identifiers for records that simply have not been ingested into the system yet.
  * Examples:
    * https://data.niaid.nih.gov/resources?id=dde_f8b7126fb0a4d01d has id=msv000081916 which is a MassIVE id (in Staging, not in Production)
    * https://data.niaid.nih.gov/resources?id=dde_897dc290b8b6574a has id=10.7303/syn6115677 which resolves to Synapse.org
    * https://data.niaid.nih.gov/resources?id=dde_145239e4ad606459 has id=genbank_mf782680 which belongs to genbank
    * id=qbca00000000 is an identifier from NCBI Nucleotide
    * id=pxd007774 is a identifier that can resolve via identifiers.org to the PRoteomics IDEntifier (PRIDE) database

3. Potential approaches for de-duplication:
  * use sameAs to link the related records -- they will continue to exist separately, but will link to one another
    * Pros:
      * straightforward to do
      * easy to implement, in spite of one-to-many mappings
      * no issues with messy merging
      * no issues with conflicting identifiers between repo and dde
    * Cons:
      * Duplicate records
      * Source filter for NIAID SysBio in NDE won't show breakdown of where the records are additionally stored
  * change the `_id` for the dde record so that the metadata can be merged with the appropriate record
    * Pros:
      * Single record with slightly more complete metadata (NIAID SysBio is a minimal schema, so improvement won't necessarily be drastic)
      * Source filter for NIAID SysBio in NDE can show breakdown of where the records are additionally stored
    * Cons:
      * Issues with one-to-many mappings
      * Issues with metadata merging/resolution (which source to prioritize? how to reduce duplication between sources?)
      * Issues with conflicting identifiers between repo and dde
