## Inspect whether it is suitable/reasonable to use Pubtator species annotations from cited manuscripts for a dataset

Only a small subset of dataset records in NDE have values for the 'species' or 'infectiousAgent' fields. Many datasets have values for the 'citation' field. Pubtator allows [FTP downloads](https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/) of taxonomic extractions/annotations by PMID. Can this approach be used to extrapolate this information for a dataset citing a PMID?

About the pubtator file: Each file contains five columns as shown in below:
0.   PMID:       PubMed abstract identifier
1.  Type:       i.e., gene, disease, chemical, species, and mutation
2.  Concept ID: Corresponding database identifier (e.g., NCBIGene ID, MESH ID)
3.   Mentions:   Bio-concept mentions corresponding to the PubMed abstract
4.  Resource:   Various manually annotated resources are included in the files (e.g., MeSH and gene2pubmed)


**Tested*
1. pull all records with a citation pmid
2. explode the dataframe to have only single value pmid instead of list pmids
3. load the Pubtator extracted species dataframe
4. get the intersect of the dataframes
5. Quick manual inspection for accuracy
  * Observations from quick manual inspection:
    - Pubtator annotates the entire manuscript which may include derived plant products as reagents in figures. This is a problematic source of error as the taxonomic ID of a reagent is irrelevant
6. Keep only taxon that have terms found in the dataset description (see potential improvements)

**Potential improvements**
- Throw out any identified species that are not mentioned in the dataset name or description. Since Pubtator gives the actual text that was mapped to an NCBI Taxon ID, a check can be performed to see if a taxa appears in the description or not and to throw it out if it doesn't.
- Note that since NCBI GEO and OMICS-DI have duplicate records, matching the name field, and comparing the lengths of the description fields will allow us to investigate whether the description field is better from GEO
  - If GEO does have better descriptions than its OMICS-DI duplicate, then a species may successfully map to the GEO version but fail in matching the OMICS-DI version
  
**Potential other applications**
Pubtator also has disease extraction, however, they align to MeSH which is not a true disease ontology. Pubtator disease annotations can similarly be downloaded and the extracted disease terms for a pmid can be checked against a dataset description for the dataset citing that pmid. Once the healthconditions are obtained, they can be normalized using the Translator KPs.

In [2]:
import pandas as pd
from pandas import read_csv
import requests
import json
import time
import math

In [103]:
def extract_pmid(pmid_dict):
    try:
        pmid = str(pmid_dict['pmid'])
    except:
        pmid = pmid_dict.replace('{pmid: ','').replace('}','')
    return pmid

def confirm_result(row):
    truthcheck = []
    try:
        taxalist = row['taxname'].split('|')
        for eachtaxon in taxalist:
            if eachtaxon in row['description']:
                truthcheck.append('found')
            else:
                truthcheck.append('false')
        if 'found' in truthcheck:
            return 'yes'
        else:
            return 'no'
    except:
        return 'no'

### Fetch minimal metadata from only records with a citation pmid

In [74]:
%%time

## Perform the initial query

query_url = 'https://api.data.niaid.nih.gov/v1/query?q=_exists_:citation.pmid&fields=_id,name,description,citation.pmid&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
#print(len(cleanr['hits']))
df1 = pd.DataFrame(cleanr['hits'])
scroll_id = cleanr['_scroll_id']
total_hits = cleanr['total']

CPU times: total: 125 ms
Wall time: 1.07 s


In [15]:
%%time
## Scroll to get all the results

i = 0
#k = 3 
k = math.ceil(total_hits/1000)
while i < k:
    r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    tmp = json.loads(r2.text)
    scroll_id = tmp['_scroll_id']
    tmpdf = pd.DataFrame(tmp['hits'])
    df1 = pd.concat((df1,tmpdf),ignore_index=True)
    #print(len(df1))
    i = i+1
    time.sleep(0.25)

2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000
145000
146000
147000
148000
149000
150000
151000
152000
153000
154000
155000
156000
157000
158000
159000


KeyError: '_scroll_id'

In [16]:
## inspect the results
print(df1.head(n=3))

                   _id  _score                citation  \
0  OMICSDI_PRJNA775608     1.0  [{'pmid': '34874923'}]   
1   OMICSDI_PRJNA74531     1.0  [{'pmid': '23105075'}]   
2  OMICSDI_PRJNA754436     1.0  [{'pmid': '34793837'}]   

                                         description  \
0  Alveolar epithelial glycocalyx degradation med...   
1  Streptococcus agalactiae STIR-CD-17 Genome seq...   
2  CRISPR-Cas9 generated SARM1 knockout and epito...   

                                                name _ignored  
0  Alveolar epithelial glycocalyx degradation med...      NaN  
1                Streptococcus agalactiae STIR-CD-17      NaN  
2  CRISPR-Cas9 generated SARM1 knockout and epito...      NaN  


In [42]:
#### save the raw results
#df1.to_csv('data/citation_df_raw.tsv',sep='\t',header=0)

#### Clean up the results (since a single record may have multiple citations)
df2 = df1.explode('citation')
df3 = df2[['_id','citation','description','name']].copy()
df3.dropna(inplace=True)
print(len(df1),len(df2),len(df3))
df3['pmid'] = df3['citation'].apply(lambda x: extract_pmid(x))
df3.drop('citation',axis=1,inplace=True)
print(df3.head(n=2))

274648 284631 282321
                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  


In [43]:
#### save the clean results
#df3.to_csv('data/citation_df_clean.tsv',sep='\t',header=True)

### Use the citation pmids to pull species for those pmids from a pubtator export

In [None]:
df3 = read_csv('data/citation_df_clean.tsv',delimiter='\t',header=0,index_col=0)

In [61]:
pmidlist = df3['pmid'].unique().tolist()
pmidintlist = []
faillist = []
for pmid in pmidlist:
    try:
        pmidintlist.append(int(pmid))
    except:
        faillist.append(pmid)

print(len(pmidlist), len(pmidintlist),len(faillist))

123965 123964 1


In [63]:
%%time
#### chunk the read of the input (turn it into a generator and iterate through it)
#### Only keep the species info from pubtator for pmids that have an NDE citation pmid match
speciesdf = pd.read_csv('data/species2pubtatorcentral.txt',delimiter='\t',
                        usecols=[0,2,3], names=['pmid','taxid','taxname'], header=None, chunksize=20000)
savedata = pd.DataFrame(columns=['pmid','taxid','taxname'])
for adf in speciesdf:
    tmpdf = adf.loc[adf['pmid'].isin(pmidlist)]
    tmpdf2 = adf.loc[adf['pmid'].isin(pmidintlist)]
    if len(tmpdf)>0:
        savedata = pd.concat((savedata,tmpdf),ignore_index=True)
    elif len(tmpdf2)>0:
        savedata = pd.concat((savedata,tmpdf2),ignore_index=True)

print(len(savedata))

675978
CPU times: total: 7min 18s
Wall time: 7min 21s


In [65]:
#### export the results so we don't have to do that again
savedata.to_csv('data/pmids_cited_taxa.tsv',sep='\t',header=0)

### Merge the NDE data (with citation pmids) with the Pubtator results 

In [91]:
#### Get rid of any duplication that was introduced as an artifact of the merging process
savedata['pmid'] = savedata['pmid'].astype(str)
merged_df = df3.merge(savedata,on='pmid',how='inner')
merged_df.drop_duplicates(keep='first',inplace=True)
print(len(merged_df))

1645100


In [92]:
#### Inspect the results
test_df = merged_df.head(n=20).copy()
print(test_df.head(n=2))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   

                                                name      pmid  taxid  \
0  Alveolar epithelial glycocalyx degradation med...  34874923   4081   
1  Alveolar epithelial glycocalyx degradation med...  34874923  10090   

                               taxname  
0              Lycopersicon esculentum  
1  mice|Mice|Murine|Mouse|murine|mouse  


Pubtator will identify species in the full body of the paper for the pmid. This can include species that were used as reagents in the methodology (for example, Lycopersicon esculentum refers to some sort of tomato extract or protein that was mentioned ina figure). To ensure we only include species that were mentioned in the dataset, keep only taxa where at least one of the taxaname was mentioned in the record name or description

### Filter out species not mentioned in the record

In [98]:
#### Test the function for doing so

test_df['text match?'] = test_df.apply(lambda row: confirm_result(row), axis=1)
print(test_df.head(n=10))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
2  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
3  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
4  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
5  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
6        GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   
7        GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   
8        GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   
9        GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   

                                                name      pmid    taxid  \
0  Alveolar epithelial glycocalyx degradation med...  34874923     4081   
1 

In [104]:
#### Apply the function to determine if a record had a term that matched with at least one of the taxa names

merged_df['text match?'] = merged_df.apply(lambda row: confirm_result(row), axis=1)

In [105]:
#### Filter out the ones that didn't

good_df = merged_df.loc[merged_df['text match?']=='yes']
print(len(good_df))

233789


In [107]:
#### Inspect the resulting table
print(good_df.head(n=5))

                    _id                                        description  \
7         GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   
8         GEO_GSE186705  Acute Respiratory Distress Syndrome (ARDS) is ...   
12   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   
13  OMICSDI_PRJNA754436  CRISPR-Cas9 generated SARM1 knockout and epito...   
14        GEO_GSE182091  The aim of this study was initially to determi...   

                                                 name      pmid  taxid  \
7   Alveolar epithelial glycocalyx degradation med...  34874923  10090   
8   Alveolar epithelial glycocalyx degradation med...  34874923   9606   
12                Streptococcus agalactiae STIR-CD-17  23105075   1311   
13  CRISPR-Cas9 generated SARM1 knockout and epito...  34793837  10090   
14  CRISPR-Cas9 generated SARM1 knockout and epito...  34793837  10090   

                                              taxname text match?  
7                 

In [109]:
#### Export the results so we don't have to do that again
good_df.to_csv('data/taxa_found.tsv',sep='\t',header=0)

## Checking Metadata differences between OMICS-DI and GEO

1. Compare lengths of names and descriptions
2. For duplicate records in this sample, pull 'species', 'measurementTechnique', and 'infectiousAgent' fields to compare the data from the two repos

In [3]:
df3 = read_csv('data/citation_df_clean.tsv',delimiter='\t',header=0,index_col=0)

  df3 = read_csv('data/citation_df_clean.tsv',delimiter='\t',header=0,index_col=0)


In [3]:
print(df3.head(n=2))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  


In [4]:
#### Find duplicate records
## Since each record has a unique id, if we group by the name and citation pmid, we'll find duplicate records
df3['pmid'] = df3['pmid'].astype(str)
df3_counts = df3.groupby(['name','pmid']).size().reset_index(name='counts')
rep_subset = df3_counts.loc[df3_counts['counts']>1]
print(len(rep_subset))

57686


In [5]:
#### Check to see if the number of unique names matches that of the number of unique citation records
## Note, it does not. There are more unique names than pmids, therefore, some datasets cite the same pmid
unique_names = rep_subset['name'].unique().tolist()
unique_pmids = rep_subset['pmid'].unique().tolist()
print(len(unique_names),len(unique_pmids))

55654 44000


In [6]:
#### Check to see if there are replicates (multiples of more than 2) 
print(df3_counts.sort_values('counts',ascending=False).head(n=5))

                                                     name      pmid  counts
136334                                       Mus musculus  34830319      48
186788  The relationship betweem bacterial community s...  28018299      43
14441                                Arabidopsis thaliana  15656970      38
101028                                       Homo sapiens  28888135      26
93001                                        Homo sapiens  12704389      25


In [7]:
#### Using only name and pmid can result in multiple replicates. These may need special handling
#### The issue of replicates may be due to both OMICS-DI ingestion of GEO and versioning
#### First address the duplicates only as these will likely be due to OMICS-DI ingestion of GEO


dup_freq_subset = rep_subset.loc[rep_subset['counts']<3]
dup_subset = dup_freq_subset.merge(df3,on=['name','pmid'],how='left')
print(dup_subset.head(n=6))

                                                name      pmid  counts  \
0  'Bois noir' phytoplasma induces significant re...  19799775       2   
1  'Bois noir' phytoplasma induces significant re...  19799775       2   
2  (1) Murine CD4 T cells: naïve vs peptide treat...  21490154       2   
3  (1) Murine CD4 T cells: naïve vs peptide treat...  21490154       2   
4  (2-Benzimidazolyl)acetonitrile derivative for ...  30466066       2   
5  (2-Benzimidazolyl)acetonitrile derivative for ...  30466066       2   

                    _id                                        description  
0  OMICSDI_E-GEOD-10906  Transcriptional profiling of Vitis vinifera cv...  
1          GEO_GSE10906  Transcriptional profiling of Vitis vinifera cv...  
2      OMICSDI_GSE26908  (1) Transcriptional profiling of mouse CD4 T c...  
3          GEO_GSE26908  (1) Transcriptional profiling of mouse CD4 T c...  
4     OMICSDI_GSE115918  Natural chemical modifications to 5-formylurac...  
5         GEO_GSE11

In [8]:
#### Get pairs of ids
## Sort the data frame by pmid (to get pairs), then by _id (to ensure orderting)
## Generate one dataframe by dropping duplicates (subset pmid, keeping first)
## Generate second dataframe by dropping duplicates (subset pmid, keeping list)
## Merge the two to get pairs of data

dup_subset.sort_values(by=['pmid','_id'], inplace=True)
keep_first = dup_subset.drop_duplicates(subset='pmid',keep='first').copy()
keep_last = dup_subset.drop_duplicates(subset='pmid',keep='last').copy()
keep_first.rename(columns={'_id':'GEO_id','description':'GEO_desc'},inplace=True)
keep_last.rename(columns={'_id':'OMICS_id','description':'OMICS_desc'},inplace=True)
print(keep_first.head(n=5))
print("===================")
print(keep_last.head(n=5))
print("===================")
clean_dup_df = keep_first.merge(keep_last,on=['name','pmid','counts'],how='inner')
print(clean_dup_df.head(n=2))

                                              name      pmid  counts  \
49476                        Human Leukocytes SAGE  10419873       2   
49856   Human mammary epithelium and breast cancer  10430922       2   
22197                Diffuse large B-cell lymphoma  10676951       2   
65516      NCI cDNA microarray-human 60 cell lines  10700174       2   
104861           snf/swi mutants of S. cerevisiae.  10725359       2   

             GEO_id                                           GEO_desc  
49476   GEO_GSE5833  This series represents the human leukocyte SAG...  
49856     GEO_GSE53  Distinctive gene expression patterns in human ...  
22197     GEO_GSE60  Diffuse large B-cell lymphoma (DLBCL), the mos...  
65516   GEO_GSE2003  We used cDNA microarrays to explore the variat...  
104861    GEO_GSE21  The Saccharomyces cerevisiae Snf/Swi complex h...  
                                              name      pmid  counts  \
49477                        Human Leukocytes SAGE  10419

In [9]:
def compare_desc_length(row):
    if row['GEO_desc_len'] > row['OMICS_desc_len']:
        compare_result = 'GEO longer'
    elif row['GEO_desc_len'] < row['OMICS_desc_len']:
        compare_result = 'OMICS longer'
    elif row['GEO_desc_len'] == row['OMICS_desc_len']:
        compare_result = 'same length'
    return compare_result

In [10]:
## compare lengths of descriptions
clean_dup_df['GEO_desc_len'] = clean_dup_df['GEO_desc'].str.len()
clean_dup_df['OMICS_desc_len'] = clean_dup_df['OMICS_desc'].str.len()
clean_dup_df['compare'] = clean_dup_df.apply(lambda row : compare_desc_length(row), axis = 1)
print(clean_dup_df.head(n=2))

                                         name      pmid  counts       GEO_id  \
0                       Human Leukocytes SAGE  10419873       2  GEO_GSE5833   
1  Human mammary epithelium and breast cancer  10430922       2    GEO_GSE53   

                                            GEO_desc           OMICS_id  \
0  This series represents the human leukocyte SAG...    OMICSDI_GSE5833   
1  Distinctive gene expression patterns in human ...  OMICSDI_E-GEOD-53   

                                          OMICS_desc  GEO_desc_len  \
0  This series represents the human leukocyte SAG...           107   
1  Distinctive gene expression patterns in human ...          1496   

   OMICS_desc_len       compare  
0             502  OMICS longer  
1            1482    GEO longer  


In [11]:
print(clean_dup_df.iloc[0]['GEO_id'],clean_dup_df.iloc[0]['GEO_desc'])
print('================================')
print(clean_dup_df.iloc[0]['OMICS_id'],clean_dup_df.iloc[0]['OMICS_desc'])

GEO_GSE5833 This series represents the human leukocyte SAGE library collection. Keywords: Human Leukocytes, Blood, SAGE
OMICSDI_GSE5833 This series represents the human leukocyte SAGE library collection. Keywords: Human Leukocytes, Blood, SAGE Overall design: Leukocytes are classified as myelocytic or lymphocytic, and each class of leukocytes consists of several types of cells that have different phenotypes and different roles. To define the gene expression in these cells, we have performed SAGE using human leukocytes and have provided the gene database for these cells not only at the resting stage but also at the activated stage.


### Summary of comparison of duplicate descriptions

In [16]:
summarydf = clean_dup_df.groupby('compare').size()
print(summarydf)

rep_freq_subset = rep_subset.loc[rep_subset['counts']>3].copy()
trip_freq_subset = rep_subset.loc[rep_subset['counts']==3].copy()

print("replicates (>3): ", len(rep_freq_subset))
print("triplicates (=3): ",len(trip_freq_subset))
print("duplicates (=2): ", len(dup_freq_subset))

compare
GEO longer      11858
OMICS longer    22661
same length       520
dtype: int64
replicates (>3):  378
triplicates (=3):  4757
duplicates (=2):  52551


Issue of replicates and triplicates seems to primarily be due to the use of a species name as the name of the dataset. These types of datasets are likely to cite the same PMID paper describing the species and may consist of wholly different datasets based on the descriptions

### Investigate source of triplicate records

In [25]:

trip_subset = trip_freq_subset.merge(df3,on=['name','pmid'],how='left')
trip_subset.sort_values(by=['pmid','name'],inplace=True)
trip_subset.to_csv('data/triplicates_by_name_and_pmid.tsv', sep='\t',header=True)
print(trip_subset.head(n=21))

                                                   name      pmid  counts  \
663                                   Rattus norvegicus  11158336       3   
664                                   Rattus norvegicus  11158336       3   
665                                   Rattus norvegicus  11158336       3   
840                            Saccharomyces cerevisiae  12524544       3   
841                            Saccharomyces cerevisiae  12524544       3   
842                            Saccharomyces cerevisiae  12524544       3   
744                              Bacillus_anthracis_CGH  12721629       3   
745                              Bacillus_anthracis_CGH  12721629       3   
746                              Bacillus_anthracis_CGH  12721629       3   
153                                        Mus musculus  12819134       3   
154                                        Mus musculus  12819134       3   
155                                        Mus musculus  12819134       3   

Triplicates appear to be mostly having the same issue with replicates, but with occasional duplicates mixed in where the name of the duplicates is short and simple enough to get mixed in with other unrelated records with matching names

### Identify a heuristic for ommitting replicates based on name length or match to a species name

In [18]:
### Inspecting name lengths
rep_freq_subset['name_length'] = rep_freq_subset['name'].str.len()
rep_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
rep_name_mean = rep_freq_subset['name_length'].mean()
rep_name_min = rep_freq_subset['name_length'].min() 
rep_name_max = rep_freq_subset['name_length'].max()
print("replicates: ", "min: ", rep_name_min, "max: ", rep_name_max, "mean: ", rep_name_mean)

trip_freq_subset['name_length'] = trip_freq_subset['name'].str.len()
trip_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
trip_name_mean = trip_freq_subset['name_length'].mean()
trip_name_min = trip_freq_subset['name_length'].min() 
trip_name_max = trip_freq_subset['name_length'].max()
print("triplicates: ", "min: ", trip_name_min, "max: ", trip_name_max, "mean: ", trip_name_mean)

dup_freq_subset['name_length'] = dup_freq_subset['name'].str.len()
dup_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)
dup_name_mean = dup_freq_subset['name_length'].mean()
dup_name_min = dup_freq_subset['name_length'].min() 
dup_name_max = dup_freq_subset['name_length'].max()
print("duplicates: ", "min: ", dup_name_min, "max: ", dup_name_max, "mean: ", dup_name_mean)

replicates:  min:  12 max:  176 mean:  74.33068783068784
triplicates:  min:  5 max:  249 mean:  92.59701492537313
duplicates:  min:  2 max:  255 mean:  91.43451123670339


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dup_freq_subset['name_length'] = dup_freq_subset['name'].str.len()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dup_freq_subset.sort_values(by='name_length',ascending=True,inplace=True)


In [19]:
print(rep_freq_subset.head(n=3))

print(trip_freq_subset.head(n=3))

print(dup_freq_subset.head(n=3))

                name      pmid  counts  name_length
135193  Mus musculus  28385888      14           12
99065   Homo sapiens  26385698       4           12
99099   Homo sapiens  26416749       4           12
          name      pmid  counts  name_length
180591   Tbx-2  34819350       3            5
16988   BCL11B  28232744       3            6
113934  Jmjd1c  26878175       3            6
       name      pmid  counts  name_length
119628   MP  12704389       2            2
636     58C  25159868       2            3
137522  NBS  25159868       2            3


In [20]:
cutoff_test = [25, 50, 75, 90]

for eachcutoff in cutoff_test:
    tmprepdf = rep_freq_subset.loc[rep_freq_subset['name_length']>eachcutoff]
    tmprepdf.sort_values('name_length',ascending=True,inplace=True)
    print("reps at "+str(eachcutoff),": ",tmprepdf.head(n=3))
    tmptripdf = trip_freq_subset.loc[trip_freq_subset['name_length']>eachcutoff]
    tmptripdf.sort_values('name_length',ascending=True,inplace=True)
    print("trips at "+str(eachcutoff),": ",tmptripdf.head(n=3))
    tmpdupdf = dup_freq_subset.loc[dup_freq_subset['name_length']>eachcutoff]
    tmpdupdf.sort_values('name_length',ascending=True,inplace=True)
    print("trips at "+str(eachcutoff),": ",tmpdupdf.head(n=3))

reps at 25 :                                  name      pmid  counts  name_length
176656    Synechocystis sp. PCC 6803  17910763       5           26
142794   Oryza sativa Japonica Group  23180784       7           27
39666   Corynebacterium glutamicum R  29148103       4           28
trips at 25 :                                name      pmid  counts  name_length
169229  Simplified ChIP-exo assays  30030442       3           26
212890  marine sediment metagenome  25916483       3           26
17088   BET bromodomain inhibition  28515341       3           26
trips at 25 :                                name      pmid  counts  name_length
10423   Alpha-factor block-release   9843569       2           26
136868  Mycobacterium tuberculosis  23222129       2           26
188602  Time-ChIP in ESCs and NSCs  27304074       2           26
reps at 50 :                                                       name      pmid  counts  \
197187  Transcriptional responses to macrophage phagoc...  3490

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmprepdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmptripdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmpdupdf.sort_values('name_length',ascending=True,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returnin

In [21]:
print(df3.head(n=2))
replicatesdf = df3.merge(tmprepdf, on=['name','pmid'], how='inner')
print(len(replicatesdf))

                   _id                                        description  \
0  OMICSDI_PRJNA775608  Alveolar epithelial glycocalyx degradation med...   
1   OMICSDI_PRJNA74531  Streptococcus agalactiae STIR-CD-17 Genome seq...   

                                                name      pmid  
0  Alveolar epithelial glycocalyx degradation med...  34874923  
1                Streptococcus agalactiae STIR-CD-17  23105075  
813
