## Duplicate record check

Determining the extent of the duplicate records issue

Zenodo versioning duplicates
1. Pull name and id fields for 50,000 zenodo records
2. Check for duplicate names on unique ids
3. Calculate rate of duplication

OmicsDI/GEO duplicates
1. Pull name and id fields 1000 GEO records
2. Search OMICS DI for matching names

Zenodo/Dryad duplicates
See OmicsDI/GEO duplicates



In [10]:
import json
import requests
import pandas as pd
import time
from datetime import datetime

In [None]:
%%time
r = requests.get('https://api.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"Zenodo"&fields=name&fetch_all=true')
cleanr = json.loads(r.text)
hits = cleanr['hits']
#print(len(cleanr['hits']))
df1 = pd.DataFrame(cleanr['hits'])
scroll_id = cleanr['_scroll_id']

In [None]:
%%time
i = 0
while i < 10:
    r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    tmp = json.loads(r2.text)
    scroll_id = tmp['_scroll_id']
    tmpdf = pd.DataFrame(tmp['hits'])
    df1 = pd.concat((df1,tmpdf),ignore_index=True)
    print(len(df1))
    i = i+1
    time.sleep(0.5)

In [None]:
## Check for replicated records (id and name)

check_for_reps = df1.groupby(['_id','name']).size().reset_index(name='counts')
replicates = check_for_reps.loc[check_for_reps['counts']>1]
nonreps = check_for_reps.loc[check_for_reps['counts']==1]
print("original length: ",len(df1)," replicates: ",len(replicates))

## Check for duplicate/version records (name only)
check_for_dups = nonreps.groupby(['name']).size().reset_index(name='dup_counts')
duplicates = check_for_dups.loc[check_for_dups['dup_counts']>1]
nondups = check_for_dups.loc[check_for_dups['dup_counts']==1]

## Stats
{"run":n,"samples":len(df1),"replicates":len(replicates),"duplicates":len(duplicates),"% dups":len(duplicates)/len(replicates)*100}


In [11]:
def fetch_zenodo_records(record_limit):
    r = requests.get('https://api.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"Zenodo"&fields=name&fetch_all=true')
    cleanr = json.loads(r.text)
    hits = cleanr['hits']
    #print(len(cleanr['hits']))
    df1 = pd.DataFrame(cleanr['hits'])
    scroll_id = cleanr['_scroll_id'] 
    i = 0
    while i < record_limit:
        r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
        tmp = json.loads(r2.text)
        scroll_id = tmp['_scroll_id']
        tmpdf = pd.DataFrame(tmp['hits'])
        df1 = pd.concat((df1,tmpdf),ignore_index=True)
        #print(len(df1))
        i = i+1
        time.sleep(0.5)  
    return df1

def check_dups(df1):
    check_for_reps = df1.groupby(['_id','name']).size().reset_index(name='counts')
    replicates = check_for_reps.loc[check_for_reps['counts']>1]
    nonreps = check_for_reps.loc[check_for_reps['counts']==1]
    check_for_dups = nonreps.groupby(['name']).size().reset_index(name='dup_counts')
    duplicates = check_for_dups.loc[check_for_dups['dup_counts']>1]
    nondups = check_for_dups.loc[check_for_dups['dup_counts']==1]
    timecheck = datetime.now()
    run_info = timecheck.strftime("%Y-%m-%d")
    tmpdict = {"samples":len(df1),"replicates":len(replicates),
               "duplicates":len(duplicates),"unique records":len(nondups),
               "% dups":len(duplicates)/len(nonreps)*100,"run date":run_info}
    duplicates.to_csv(f"duplicates_{run_info}.tsv",sep='\t',header=True)
    return tmpdict

def get_zenodo_dup_stats(repetitions, record_limit):
    n = 0
    statlist = []
    while n < repetitions:
        print("now performing run #",n)
        df1 = fetch_zenodo_records(record_limit)
        tmpdict = check_dups(df1)
        tmpdict['run number'] = n
        statlist.append(tmpdict)
        time.sleep(300)
        n=n+1
    return statlist

In [12]:
%%time
repetitions = 1
record_limit = 49
statlist = get_zenodo_dup_stats(repetitions, record_limit)
statdf = pd.DataFrame(statlist)
statdf.to_csv('dup_stats.tsv')
print(statdf)

now performing run # 0
   samples  replicates  duplicates  unique records  % dups    run date  \
0    50000           0        2345           43819    4.69  2023-08-10   

   run number  
0           0  
CPU times: total: 3.77 s
Wall time: 6min 7s


## Checking Metadata differences between OMICS-DI and GEO

1. Compare lengths of names and descriptions
2. For duplicate records in this sample, pull 'species', 'measurementTechnique', and 'infectiousAgent' fields to compare the data from the two repos

In [4]:
import pandas as pd
from pandas import read_csv
import requests
import json
import time
import math
import os

In [13]:
parent_path = os.path.dirname(os.getcwd())
print(parent_path)
data_path = os.path.join(parent_path,'Pubtator_Check','data')

C:\Users\gtsueng\Anaconda3\envs\nde\nde_misc


In [14]:
df3 = read_csv(os.path.join(data_path,'citation_df_clean.tsv'),delimiter='\t',header=0,index_col=0)

  df3 = read_csv(os.path.join(data_path,'citation_df_clean.tsv'),delimiter='\t',header=0,index_col=0)


In [15]:
print(df3.head(n=2))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  


In [16]:
#### Find duplicate records
## Since each record has a unique id, if we group by the name and citation pmid, we'll find duplicate records
df3['pmid'] = df3['pmid'].astype(str)
df3_counts = df3.groupby(['name','pmid']).size().reset_index(name='counts')
rep_subset = df3_counts.loc[df3_counts['counts']>1]
print(len(rep_subset))

57686


In [17]:
#### Check to see if the number of unique names matches that of the number of unique citation records
## Note, it does not. There are more unique names than pmids, therefore, some datasets cite the same pmid
unique_names = rep_subset['name'].unique().tolist()
unique_pmids = rep_subset['pmid'].unique().tolist()
print(len(unique_names),len(unique_pmids))

55654 44000


In [18]:
#### Check to see if there are replicates (multiples of more than 2) 
print(df3_counts.sort_values('counts',ascending=False).head(n=5))

                                                     name      pmid  counts
136334                                       Mus musculus  34830319      48
186788  The relationship betweem bacterial community s...  28018299      43
14441                                Arabidopsis thaliana  15656970      38
101028                                       Homo sapiens  28888135      26
93001                                        Homo sapiens  12704389      25


In [19]:
#### Using only name and pmid can result in multiple replicates. These may need special handling
#### The issue of replicates may be due to both OMICS-DI ingestion of GEO and versioning
#### First address the duplicates only as these will likely be due to OMICS-DI ingestion of GEO


dup_freq_subset = rep_subset.loc[rep_subset['counts']<3]
dup_subset = dup_freq_subset.merge(df3,on=['name','pmid'],how='left')
print(dup_subset.head(n=6))

                                                name      pmid  counts  \
0  'Bois noir' phytoplasma induces significant re...  19799775       2   
1  'Bois noir' phytoplasma induces significant re...  19799775       2   
2  (1) Murine CD4 T cells: naïve vs peptide treat...  21490154       2   
3  (1) Murine CD4 T cells: naïve vs peptide treat...  21490154       2   
4  (2-Benzimidazolyl)acetonitrile derivative for ...  30466066       2   
5  (2-Benzimidazolyl)acetonitrile derivative for ...  30466066       2   

                    _id                                        description  
0  OMICSDI_E-GEOD-10906  Transcriptional profiling of Vitis vinifera cv...  
1          GEO_GSE10906  Transcriptional profiling of Vitis vinifera cv...  
2      OMICSDI_GSE26908  (1) Transcriptional profiling of mouse CD4 T c...  
3          GEO_GSE26908  (1) Transcriptional profiling of mouse CD4 T c...  
4     OMICSDI_GSE115918  Natural chemical modifications to 5-formylurac...  
5         GEO_GSE11

In [20]:
#### Get pairs of ids
## Sort the data frame by pmid (to get pairs), then by _id (to ensure orderting)
## Generate one dataframe by dropping duplicates (subset pmid, keeping first)
## Generate second dataframe by dropping duplicates (subset pmid, keeping list)
## Merge the two to get pairs of data

dup_subset.sort_values(by=['pmid','_id'], inplace=True)
keep_first = dup_subset.drop_duplicates(subset='pmid',keep='first').copy()
keep_last = dup_subset.drop_duplicates(subset='pmid',keep='last').copy()
keep_first.rename(columns={'_id':'GEO_id','description':'GEO_desc'},inplace=True)
keep_last.rename(columns={'_id':'OMICS_id','description':'OMICS_desc'},inplace=True)
print(keep_first.head(n=5))
print("===================")
print(keep_last.head(n=5))
print("===================")
clean_dup_df = keep_first.merge(keep_last,on=['name','pmid','counts'],how='inner')
print(clean_dup_df.head(n=2))

                                              name      pmid  counts  \
49476                        Human Leukocytes SAGE  10419873       2   
49856   Human mammary epithelium and breast cancer  10430922       2   
22197                Diffuse large B-cell lymphoma  10676951       2   
65516      NCI cDNA microarray-human 60 cell lines  10700174       2   
104861           snf/swi mutants of S. cerevisiae.  10725359       2   

             GEO_id                                           GEO_desc  
49476   GEO_GSE5833  This series represents the human leukocyte SAG...  
49856     GEO_GSE53  Distinctive gene expression patterns in human ...  
22197     GEO_GSE60  Diffuse large B-cell lymphoma (DLBCL), the mos...  
65516   GEO_GSE2003  We used cDNA microarrays to explore the variat...  
104861    GEO_GSE21  The Saccharomyces cerevisiae Snf/Swi complex h...  
                                              name      pmid  counts  \
49477                        Human Leukocytes SAGE  10419

In [21]:
def compare_desc_length(row):
    if row['GEO_desc_len'] > row['OMICS_desc_len']:
        compare_result = 'GEO longer'
    elif row['GEO_desc_len'] < row['OMICS_desc_len']:
        compare_result = 'OMICS longer'
    elif row['GEO_desc_len'] == row['OMICS_desc_len']:
        compare_result = 'same length'
    return compare_result

In [22]:
## compare lengths of descriptions
clean_dup_df['GEO_desc_len'] = clean_dup_df['GEO_desc'].str.len()
clean_dup_df['OMICS_desc_len'] = clean_dup_df['OMICS_desc'].str.len()
clean_dup_df['compare'] = clean_dup_df.apply(lambda row : compare_desc_length(row), axis = 1)
print(clean_dup_df.head(n=2))

                                         name      pmid  counts       GEO_id  \
0                       Human Leukocytes SAGE  10419873       2  GEO_GSE5833   
1  Human mammary epithelium and breast cancer  10430922       2    GEO_GSE53   

                                            GEO_desc           OMICS_id  \
0  This series represents the human leukocyte SAG...    OMICSDI_GSE5833   
1  Distinctive gene expression patterns in human ...  OMICSDI_E-GEOD-53   

                                          OMICS_desc  GEO_desc_len  \
0  This series represents the human leukocyte SAG...           107   
1  Distinctive gene expression patterns in human ...          1496   

   OMICS_desc_len       compare  
0             502  OMICS longer  
1            1482    GEO longer  


In [23]:
print(clean_dup_df.iloc[0]['GEO_id'],clean_dup_df.iloc[0]['GEO_desc'])
print('================================')
print(clean_dup_df.iloc[0]['OMICS_id'],clean_dup_df.iloc[0]['OMICS_desc'])

GEO_GSE5833 This series represents the human leukocyte SAGE library collection. Keywords: Human Leukocytes, Blood, SAGE
OMICSDI_GSE5833 This series represents the human leukocyte SAGE library collection. Keywords: Human Leukocytes, Blood, SAGE Overall design: Leukocytes are classified as myelocytic or lymphocytic, and each class of leukocytes consists of several types of cells that have different phenotypes and different roles. To define the gene expression in these cells, we have performed SAGE using human leukocytes and have provided the gene database for these cells not only at the resting stage but also at the activated stage.


### Summary of comparison of duplicate descriptions

In [24]:
summarydf = clean_dup_df.groupby('compare').size()
print(summarydf)

rep_freq_subset = rep_subset.loc[rep_subset['counts']>3].copy()
trip_freq_subset = rep_subset.loc[rep_subset['counts']==3].copy()

print("replicates (>3): ", len(rep_freq_subset))
print("triplicates (=3): ",len(trip_freq_subset))
print("duplicates (=2): ", len(dup_freq_subset))

compare
GEO longer      11858
OMICS longer    22661
same length       520
dtype: int64
replicates (>3):  378
triplicates (=3):  4757
duplicates (=2):  52551


Issue of replicates and triplicates seems to primarily be due to the use of a species name as the name of the dataset. These types of datasets are likely to cite the same PMID paper describing the species and may consist of wholly different datasets based on the descriptions

### Investigate source of triplicate records

In [26]:

trip_subset = trip_freq_subset.merge(df3,on=['name','pmid'],how='left')
trip_subset.sort_values(by=['pmid','name'],inplace=True)
#trip_subset.to_csv(os.path.join(data_path,'triplicates_by_name_and_pmid.tsv'), sep='\t',header=True)
print(trip_subset.head(n=21))

                                                    name      pmid  counts  \
10458                                  Rattus norvegicus  11158336       3   
10459                                  Rattus norvegicus  11158336       3   
10460                                  Rattus norvegicus  11158336       3   
10905                           Saccharomyces cerevisiae  12524544       3   
10906                           Saccharomyces cerevisiae  12524544       3   
10907                           Saccharomyces cerevisiae  12524544       3   
1245                              Bacillus_anthracis_CGH  12721629       3   
1246                              Bacillus_anthracis_CGH  12721629       3   
1247                              Bacillus_anthracis_CGH  12721629       3   
8391                                        Mus musculus  12819134       3   
8392                                        Mus musculus  12819134       3   
8393                                        Mus musculus  128191

### Identify a heuristic for ommitting replicates based on name length or match to a species name

In [27]:
### Inspecting name lengths
rep_freq_subset['name_length'] = rep_freq_subset['name'].str.len()
rep_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
rep_name_mean = rep_freq_subset['name_length'].mean()
rep_name_min = rep_freq_subset['name_length'].min() 
rep_name_max = rep_freq_subset['name_length'].max()
print("replicates: ", "min: ", rep_name_min, "max: ", rep_name_max, "mean: ", rep_name_mean)

trip_freq_subset['name_length'] = trip_freq_subset['name'].str.len()
trip_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
trip_name_mean = trip_freq_subset['name_length'].mean()
trip_name_min = trip_freq_subset['name_length'].min() 
trip_name_max = trip_freq_subset['name_length'].max()
print("triplicates: ", "min: ", trip_name_min, "max: ", trip_name_max, "mean: ", trip_name_mean)

dup_freq_subset['name_length'] = dup_freq_subset['name'].str.len()
dup_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
dup_name_mean = dup_freq_subset['name_length'].mean()
dup_name_min = dup_freq_subset['name_length'].min() 
dup_name_max = dup_freq_subset['name_length'].max()
print("duplicates: ", "min: ", dup_name_min, "max: ", dup_name_max, "mean: ", dup_name_mean)

replicates:  min:  12 max:  176 mean:  74.33068783068784
triplicates:  min:  5 max:  249 mean:  92.59701492537313
duplicates:  min:  2 max:  255 mean:  91.43451123670339


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dup_freq_subset['name_length'] = dup_freq_subset['name'].str.len()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dup_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)


In [28]:
print(rep_freq_subset.head(n=3))

print(trip_freq_subset.head(n=3))

print(dup_freq_subset.head(n=3))

                name      pmid  counts  name_length
135193  Mus musculus  28385888      14           12
99065   Homo sapiens  26385698       4           12
99099   Homo sapiens  26416749       4           12
          name      pmid  counts  name_length
180591   Tbx-2  34819350       3            5
16988   BCL11B  28232744       3            6
113934  Jmjd1c  26878175       3            6
       name      pmid  counts  name_length
119628   MP  12704389       2            2
636     58C  25159868       2            3
137522  NBS  25159868       2            3


In [29]:
cutoff_test = [25, 50, 75, 90]

for eachcutoff in cutoff_test:
    tmprepdf = rep_freq_subset.loc[rep_freq_subset['name_length']>eachcutoff]
    tmprepdf.sort_values('name_length',ascending=True,inplace=True)
    print("reps at "+str(eachcutoff),": ",tmprepdf.head(n=3))
    tmptripdf = trip_freq_subset.loc[trip_freq_subset['name_length']>eachcutoff]
    tmptripdf.sort_values('name_length',ascending=True,inplace=True)
    print("trips at "+str(eachcutoff),": ",tmptripdf.head(n=3))
    tmpdupdf = dup_freq_subset.loc[dup_freq_subset['name_length']>eachcutoff]
    tmpdupdf.sort_values('name_length',ascending=True,inplace=True)
    print("trips at "+str(eachcutoff),": ",tmpdupdf.head(n=3))

reps at 25 :                                  name      pmid  counts  name_length
176656    Synechocystis sp. PCC 6803  17910763       5           26
142794   Oryza sativa Japonica Group  23180784       7           27
39666   Corynebacterium glutamicum R  29148103       4           28
trips at 25 :                                name      pmid  counts  name_length
169229  Simplified ChIP-exo assays  30030442       3           26
212890  marine sediment metagenome  25916483       3           26
17088   BET bromodomain inhibition  28515341       3           26
trips at 25 :                                name      pmid  counts  name_length
10423   Alpha-factor block-release   9843569       2           26
136868  Mycobacterium tuberculosis  23222129       2           26
188602  Time-ChIP in ESCs and NSCs  27304074       2           26
reps at 50 :                                                       name      pmid  counts  \
197187  Transcriptional responses to macrophage phagoc...  3490

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmprepdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmptripdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmpdupdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returnin

In [30]:
print(df3.head(n=2))
replicatesdf = df3.merge(tmprepdf, on=['name','pmid'], how='inner')
print(len(replicatesdf))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  
813
