# Data Collection

## Problem Statement

The goal of this project is to collect and process data related to retracted published academic literature from the PubMed database. This data will be used to create an NLP model that will be able to determine the subject matters that are most likely to be retracted after publishing as well as predict if a paper will be retracted based on its body of text. Several metrics will be used to assess the performance of the model, including accuracy. 

If time allows, a web-based application will be made that allows you to put in the text of a paper and determine the likelihood of the paper to be retracted. 

This project is important to researchers, as researchers would be able to determine if a paper they create needs to be edited to avoid common retraction reasons, such as plaigarism. Additionally, editors of journals could use the model in this project to quickly determine if a newly submitted article needs to be looked over closely before publishing. 

For the general public, this project shows the topics or scientific methods that are more likely to be retracted. By having a healthy conversation about the dynamics of retraction, we can build a greater trust between the general public and the scientific community.

## Data Collection Workflow

## Importing Libraries

In [2]:
from Bio import Entrez
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

## Accessing Journal Names

In [43]:
df = pd.read_csv('./journals_in_pmc.csv')
df.head()

Unnamed: 0,Journal title,NLM TA,pISSN,eISSN,Publisher,LOCATORplus ID,Latest issue,Earliest volume,Free access,Open access,Participation level,Deposit status,Journal URL
0,3 Biotech,3 Biotech,2190-572X,2190-5738,Springer,101565857,v.10(9);Sep 2020,v.1;2011,12 months,Some,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/1811/
1,3D Printing in Medicine,3D Print Med,,2365-6271,BioMed Central,101721758,v.5;Dec 2019,v.2;2016,Immediate,All,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3516/
2,AACE Clinical Case Reports,AACE Clin Case Rep,,2376-0605,American Association of Clinical Endocrinologists,101670593,v.6(5);Sep-Oct 2020,v.5;2019,Immediate,No,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3582/
3,The AAPS Journal,AAPS J,,1550-7416,American Association of Pharmaceutical Scientists,101223209,v.18(3);May 2016,v.6;2004,,Some,Full,No New Content,http://www.ncbi.nlm.nih.gov/pmc/journals/792/
4,AAPS PharmSci,AAPS PharmSci,,1522-1059,American Association of Pharmaceutical Scientists,100897065,v.6(2);Jun 2004,v.1;1999,Immediate,No,Full,Predecessor,http://www.ncbi.nlm.nih.gov/pmc/journals/989/


In [44]:
df['Participation level'].value_counts()

 Full              3104
 NIH Portfolio      448
Name: Participation level, dtype: int64

In [45]:
df = pd.concat((df, pd.get_dummies(df['Participation level'], prefix='participation')), axis=1)

In [46]:
df['participation_full'] = df['participation_ Full ']

In [47]:
df = df.drop(columns=['participation_ NIH Portfolio ', 'participation_ Full '])

In [48]:
df.loc[df['participation_full'] == 0].index

Int64Index([  17,   23,   24,   25,   26,   27,   28,   29,   30,   35,
            ...
            3396, 3408, 3415, 3427, 3456, 3464, 3476, 3478, 3479, 3544],
           dtype='int64', length=448)

In [49]:
df = df.drop(df.loc[df['participation_full'] == 0].index, axis=0)

In [50]:
print(df.shape)
df['participation_full'].value_counts()

(3104, 14)


1    3104
Name: participation_full, dtype: int64

In [None]:
df.to_csv('./journals_in_pmc_clean.csv')

In [157]:
df = pd.read_csv('./journals_in_pmc_clean.csv', index_col=False)
df = df.drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Journal title,NLM TA,pISSN,eISSN,Publisher,LOCATORplus ID,Latest issue,Earliest volume,Free access,Open access,Participation level,Deposit status,Journal URL,participation_full
0,3 Biotech,3 Biotech,2190-572X,2190-5738,Springer,101565857,v.10(9);Sep 2020,v.1;2011,12 months,Some,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/1811/,1
1,3D Printing in Medicine,3D Print Med,,2365-6271,BioMed Central,101721758,v.5;Dec 2019,v.2;2016,Immediate,All,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3516/,1
2,AACE Clinical Case Reports,AACE Clin Case Rep,,2376-0605,American Association of Clinical Endocrinologists,101670593,v.6(5);Sep-Oct 2020,v.5;2019,Immediate,No,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3582/,1
3,The AAPS Journal,AAPS J,,1550-7416,American Association of Pharmaceutical Scientists,101223209,v.18(3);May 2016,v.6;2004,,Some,Full,No New Content,http://www.ncbi.nlm.nih.gov/pmc/journals/792/,1
4,AAPS PharmSci,AAPS PharmSci,,1522-1059,American Association of Pharmaceutical Scientists,100897065,v.6(2);Jun 2004,v.1;1999,Immediate,No,Full,Predecessor,http://www.ncbi.nlm.nih.gov/pmc/journals/989/,1


In [15]:
df['Journal title'][388]

'BMC Blood Disorders'

## Pulling Data from PubMed/PMC - Retractions

### Pulling Data from PubMed - Retractions

In [None]:
#https://medium.com/@kliang933/scraping-big-data-from-public-research-repositories-e-g-pubmed-arxiv-2-488666f6f29b

In [16]:
Entrez.email = 'lmpack01@outlook.com'
handle = Entrez.esearch(db='pubmed',term='(hasretractionin) AND ("AAPS PharmSciTech"[Journal])', retmode='xml')
results = Entrez.read(handle)

28667474


In [17]:
results

{'Count': '5', 'RetMax': '5', 'RetStart': '0', 'IdList': ['28667474', '27511111', '23835739', '23800858', '18446488'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'hasretractionin[All Fields]', 'Field': 'All Fields', 'Count': '7962', 'Explode': 'N'}, {'Term': '"AAPS PharmSciTech"[Journal]', 'Field': 'Journal', 'Count': '3320', 'Explode': 'N'}, 'AND'], 'QueryTranslation': 'hasretractionin[All Fields] AND "AAPS PharmSciTech"[Journal]'}

In [18]:
ids = ' , '.join(results['IdList'])
Entrez.email = 'lmpack01@outlook.com'
handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
results_id = Entrez.read(handle)

28667474 , 27511111 , 23835739 , 23800858 , 18446488


In [19]:
str(results_id['PubmedArticle'][0]['MedlineCitation']['Article'])

"DictElement({'Language': ['eng'], 'ArticleDate': [DictElement({'Year': '2017', 'Month': '06', 'Day': '30'}, attributes={'DateType': 'Electronic'})], 'ELocationID': [StringElement('10.1208/s12249-017-0838-6', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})], 'Journal': {'ISSN': StringElement('1530-9932', attributes={'IssnType': 'Electronic'}), 'JournalIssue': DictElement({'Volume': '19', 'Issue': '6', 'PubDate': {'Year': '2018', 'Month': '08'}}, attributes={'CitedMedium': 'Internet'}), 'Title': 'AAPS PharmSciTech'}, 'ArticleTitle': 'RETRACTED ARTICLE: Development and In Vitro-In Vivo Characterization of Chronomodulated Pulsatile Delivery Formulation of Terbutaline Sulphate by Box-Behnken Statistical Design.', 'Pagination': {'MedlinePgn': '2750'}, 'AuthorList': ListElement([DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Ch. Devilal College of Pharmacy, Jagadhri, Yamunanagar, Haryana, India.'}], 'Identifier': [], 'LastName': 'Bajwa', 'ForeName': 'Prabhjot Singh', 'Ini

In [None]:
ls_num = []
ls_id = []
ls_doi = []
ls_language = []
ls_year = []
ls_month = []
ls_day = []
ls_volume = []
ls_issue = []
ls_journal = []
ls_title = []
ls_page = []
x=0
no_doi = 0

for i in range(0,len(df['NLM TA'])):
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pubmed',term='(hasretractionin) AND ('+df['NLM TA'][i]+'[Journal])', retmode='xml', retmax=1000)
    results = Entrez.read(handle)
    
    ids = ' , '.join(results['IdList'])
    print(df['NLM TA'][i],[i])
    
    if len(ids)==0:
        pass
    else:
        Entrez.email = 'lmpack01@outlook.com'
        handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
        results_id = Entrez.read(handle)
        
        x += len(results['IdList'])
        print(x)
        
        for j in range(0, len(results['IdList'])):
            ls_id.append(results['IdList'][j])
            try:
                doi = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ELocationID'][0])
                ls_doi.append(doi)
            except:
                ls_doi.append(None)
                no_doi += 1
                print(no_doi)
                
            try:
                language = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Language'][0])
                ls_language.append(language)
            except:
                ls_language.append(None)
                
            try:
                year = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Year'])
                ls_year.append(year)
            except:
                ls_year.append(None)
            
            try:
                month = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Month'])
                ls_month.append(month)
            except:
                ls_month.append(None)
                
            try:
                day = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Day'])
                ls_day.append(day)
            except:
                ls_day.append(None)
                
            try:
                volume = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Volume'])
                ls_volume.append(volume)
            except:
                ls_volume.append(None)
            
            try:
                issue = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Issue'])
                ls_issue.append(issue)
            except:
                ls_issue.append(None)
                
            try:
                journal = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['Title'])
                ls_journal.append(journal)
            except:
                ls_journal.append(None)
                
            try:
                title = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleTitle'])
                ls_title.append(title)
            except:
                ls_title.append(None)
                
            try:
                page = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Pagination']['MedlinePgn'])
                ls_page.append(page)
            except:
                ls_page.append(None)
        
    time.sleep(5)

In [26]:
doi = pd.read_csv('./doi.csv')
doi['0'].isnull().sum()

827

In [158]:
data = pd.concat([pd.Series(ls_id), pd.Series(ls_doi), pd.Series(ls_language), pd.Series(ls_year), pd.Series(ls_month), pd.Series(ls_day), pd.Series(ls_volume), pd.Series(ls_issue), pd.Series(ls_journal), pd.Series(ls_title), pd.Series(ls_page)], axis=1)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,16353936,,eng,2005.0,10.0,19.0,7.0,3.0,The AAPS journal,Recent advances for the treatment of cocaine a...,E579-86
1,28667474,10.1208/s12249-017-0838-6,eng,2017.0,6.0,30.0,19.0,6.0,AAPS PharmSciTech,RETRACTED ARTICLE: Development and In Vitro-In...,2750
2,27511111,10.1208/s12249-016-0596-x,eng,2016.0,8.0,10.0,18.0,5.0,AAPS PharmSciTech,Study of the Transformations of Micro/Nano-cry...,1428-1437
3,23835739,10.1208/s12249-013-0001-y,eng,2013.0,7.0,9.0,14.0,3.0,AAPS PharmSciTech,Meloxicam taste-masked oral disintegrating tab...,1118-28
4,23800858,10.1208/s12249-013-9993-6,eng,2013.0,6.0,26.0,14.0,3.0,AAPS PharmSciTech,Design and formulation technique of a novel dr...,1045-54
...,...,...,...,...,...,...,...,...,...,...,...
2965,12619192,,eng,,,,44.0,1.0,Yonsei medical journal,Bilateral popliteal artery aneurysms with rupt...,159-62
2966,11371116,,eng,,,,42.0,2.0,Yonsei medical journal,Tic convulsif caused by cerebellopontine angle...,255-7
2967,22814263,10.3779/j.issn.1009-3419.2012.07.07,eng,,,,15.0,7.0,Zhongguo fei ai za zhi = Chinese journal of lu...,Lung cancer: microRNA and target database.,429-34
2968,20840810,10.3779/j.issn.1009-3419.2010.09.01,chi,,,,13.0,9.0,Zhongguo fei ai za zhi = Chinese journal of lu...,[Effect of Bufalin on proliferation and apopto...,841-5


In [159]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2970 entries, 0 to 2969
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       2970 non-null   object 
 1   1       2143 non-null   object 
 2   2       2970 non-null   object 
 3   3       2202 non-null   float64
 4   4       2202 non-null   float64
 5   5       2202 non-null   float64
 6   6       2896 non-null   float64
 7   7       2189 non-null   float64
 8   8       2970 non-null   object 
 9   9       2970 non-null   object 
 10  10      2851 non-null   object 
dtypes: float64(5), object(6)
memory usage: 255.4+ KB


In [161]:
data.to_csv('./pubmed_data_retraction.csv')

In [None]:
data = pd.read_csv('./pubmed_data_retraction.csv')
data

In [None]:
data = data.dropna(axis=0, subset=['1'])
data

In [164]:
data['2'].value_counts()

eng    2139
chi       2
spa       1
fre       1
Name: 2, dtype: int64

In [165]:
print(data.loc[data['2']=='chi'].index)
print(data.loc[data['2']=='spa'].index)
print(data.loc[data['2']=='fre'].index)

Int64Index([2968, 2969], dtype='int64')
Int64Index([259], dtype='int64')
Int64Index([2243], dtype='int64')


In [166]:
data = data.drop([2968, 2969, 259, 2243], axis=0)
data['2'].value_counts()

eng    2139
Name: 2, dtype: int64

In [None]:
ls_index = []
for i in data['1']:
    if len(i) < 7:
        for j in range(0,len(data.loc[data['1']==i].index)):
            ls_index.append(data.loc[data['1']==i].index[j])
data = data.drop(ls_index)

In [None]:
data = data.drop_duplicates()

In [None]:
data.to_csv('./pubmed_data_retraction_cleaned.csv')

In [2]:
data = pd.read_csv('./pubmed_data_retraction_cleaned.csv')
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,0,1,2,3,4,5,6,7,8,9,10
0,0,1,1,28667474,10.1208/s12249-017-0838-6,eng,2017.0,6.0,30.0,19.0,6.0,AAPS PharmSciTech,RETRACTED ARTICLE: Development and In Vitro-In...,2750
1,1,2,2,27511111,10.1208/s12249-016-0596-x,eng,2016.0,8.0,10.0,18.0,5.0,AAPS PharmSciTech,Study of the Transformations of Micro/Nano-cry...,1428-1437
2,2,3,3,23835739,10.1208/s12249-013-0001-y,eng,2013.0,7.0,9.0,14.0,3.0,AAPS PharmSciTech,Meloxicam taste-masked oral disintegrating tab...,1118-28
3,3,4,4,23800858,10.1208/s12249-013-9993-6,eng,2013.0,6.0,26.0,14.0,3.0,AAPS PharmSciTech,Design and formulation technique of a novel dr...,1045-54
4,4,5,5,18446488,10.1208/s12249-008-9044-x,eng,2008.0,2.0,14.0,9.0,1.0,AAPS PharmSciTech,"The influence of sodium hyaluronate, L-leucine...",243-9


In [4]:
data.shape

(2106, 14)

In [103]:
data['1'].sample(n=25,replace = False)

212             10.1186/1471-2121-13-8
618           10.3389/fpsyg.2016.01298
1877      10.1371/journal.pone.0001444
895            10.1074/jbc.M111.260414
1439             10.1093/neuonc/nor116
1995        10.1038/s41598-019-38519-5
602           10.3389/fphar.2017.00871
511                10.7554/eLife.12248
2108           10.4103/2229-5070.72109
1953         10.1186/s12978-019-0732-7
1175    10.1523/JNEUROSCI.2613-09.2009
1428                10.1038/ncomms6446
98             10.4103/0256-4947.83211
1432                10.1038/ncomms1623
44           10.1107/S160053680706254X
599            10.3389/fonc.2013.00153
1488               10.2147/OTT.S124118
1            10.1208/s12249-016-0596-x
899            10.1074/jbc.M111.247726
2002        10.1038/s41598-017-10365-3
111            10.1074/jbc.M111.329078
1598      10.1371/journal.pone.0218664
1392              10.1128/MCB.01480-09
1340     10.1158/1535-7163.MCT-14-0672
1384              10.1128/MCB.00114-14
Name: 1, dtype: object

### Pulling Data from PMC - Retractions

In [47]:
Entrez.email = 'lmpack01@outlook.com'
handle = Entrez.esearch(db='pmc',term='10.11604/pamj.2013.16.18.2505', retmode='xml')
results = Entrez.read(handle)
results

{'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['3909696'], 'TranslationSet': [], 'TranslationStack': [{'Term': '10.11604/pamj.2013.16.18.2505[All Fields]', 'Field': 'All Fields', 'Count': '1', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': '10.11604/pamj.2013.16.18.2505[All Fields]'}

In [48]:
ids = ' , '.join(results['IdList'])
Entrez.email = 'lmpack01@outlook.com'
handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
text = handle.read()

In [49]:
soup = BeautifulSoup(text, "lxml")
soup.extract

<bound method PageElement.extract of <?xml version="1.0" ?><!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<html><body><pmc-articleset><article article-type="case-report" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<?properties open_access?>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Pan Afr Med J</journal-id>
<journal-id journal-id-type="iso-abbrev">Pan Afr Med J</journal-id>
<journal-id journal-id-type="publisher-id">PAMJ</journal-id>
<journal-title-group>
<journal-title>The Pan African Medical Journal</journal-title>
</journal-title-group>
<issn pub-type="epub">1937-8688</issn>
<publisher>
<publisher-name>The African Field Epidemiology Network</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">24498467</article-id>
<article-id pub-id-type="pmc">3909696</article-id>
<article-id pub-id-type="pub

In [149]:
ls_keywords = []
for i in range(0,len(soup.find_all("kwd"))):
    ls_keywords.append(soup.find_all("kwd")[i].text)

print(ls_keywords)
#https://www.wiley.com/network/researchers/preparing-your-article/how-to-choose-effective-keywords-for-your-article

['Diabetic', 'urinary tract infection', 'Emphysematous cystitis', 'computed tomography', 'bladder catherization']


In [89]:
ls_l_name =[]
for k in range(0, len(soup.find_all("contrib"))):
    for i in soup.find_all("contrib")[k].find_all("name")[0].text.split('\n'):
        if i == '':
            pass
        else:
            ls_l_name.append(i)
        
print(ls_l_name)

['Ahsaini', 'Mustapha', 'Kassogue', 'Amadou', 'Tazi', 'Mohammed Fadl', 'Zaougui', 'Anas', 'Elammari', 'Jalal Edine', 'Khallouk', 'Abdelhak', 'El Fassi', 'Mohammed Jamal', 'Farih', 'My Hassan']


In [None]:
ls_total_text = []
ls_total_keywords = []
ls_total_abstract = []
ls_publisher = []
count = 2075
no_text = 0

for i in data['1'][2075:2106]:
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pmc',term=i, retmode='xml')
    results = Entrez.read(handle)

    ids = ' , '.join(results['IdList'])
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
    text = handle.read()

    soup = BeautifulSoup(text, "lxml")
    print(f"{i} [{count}]")
    
    try:
        ls_raw_text = []
        text = str()
        for i in range(0,len(soup.find_all("sec"))):
            for j in range(0,len(soup.find_all("sec")[i].find_all("p"))):
                ls_raw_text.append(str(soup.find_all("sec")[i].find_all("p")[j].text))

        for i in range(0,len(ls_raw_text)):
            text += ls_raw_text[i]
        ls_total_text.append(text)
        
        if text=='':
            no_text += 1
            print(f'No text --> {no_text}')
    except:
        ls_total_text.append(None)
    
    try:
        ls_keywords = []
        for i in range(0,len(soup.find_all("kwd"))):
            ls_keywords.append(soup.find_all("kwd")[i].text)
        ls_total_keywords.append(ls_keywords)
    except:
        ls_total_keywords.append(None)
    
    try:
        ls_abstract = []
        for i in soup.find_all("abstract")[0].text.split('\n'):
            if i == '':
                pass
            else:
                ls_abstract.append(i)
        ls_total_abstract.append(ls_abstract[0])
    except:
        ls_total_abstract.append(None)
    
    try:      
        ls_publisher.append(soup.find_all("publisher-name")[0].text)
    except:
        ls_publisher.append(None)
    
    count +=1
    time.sleep(3)

In [154]:
print(pd.Series(data['1'][2075:2106]).shape)
print(pd.Series(ls_total_text).shape)
print(pd.Series(ls_total_keywords).shape)
print(pd.Series(ls_total_abstract).shape)
print(pd.Series(ls_publisher).shape)

(31,)
(31,)
(31,)
(31,)
(31,)


In [155]:
pd.Series(data['1'][2075:2106]).to_csv('./doi__2075_2106.csv')
pd.Series(ls_total_text).to_csv('./text__2075_2106.csv')
pd.Series(ls_total_keywords).to_csv('./keywords__2075_2106.csv')
pd.Series(ls_total_abstract).to_csv('./abstract__2075_2106.csv')
pd.Series(ls_publisher).to_csv('./publisher__2075_2106.csv')

In [None]:
#numbers that weren't complete after minimum 3 min: 82, 116, 124, 281, 292, 371, 372, 373, 388, 608, 623, 628
#637, 638, 1410, 1928

## Pulling Data from PubMed/PMC - No Retractions

### Pulling Data from PubMed - No Retractions

In [None]:
ls_id = []
ls_doi = []
ls_language = []
ls_year = []
ls_month = []
ls_day = []
ls_volume = []
ls_issue = []
ls_journal = []
ls_title = []
ls_page = []
x=0
no_doi = 0

for i in range(0,len(df['NLM TA'])):
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pubmed',term='('+df['NLM TA'][i]+'[Journal])', retmode='xml', retmax=2, sort='pub+date')
    results = Entrez.read(handle)
    
    ids = ' , '.join(results['IdList'])
    print(df['NLM TA'][i],[i])
    
    if len(ids)==0:
        pass
    else:
        Entrez.email = 'lmpack01@outlook.com'
        handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
        results_id = Entrez.read(handle)
        
        x += len(results['IdList'])
        print(x)
        
        for j in range(0, len(results['IdList'])):
            ls_id.append(results['IdList'][j])
            try:
                doi = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ELocationID'][0])
                ls_doi.append(doi)
            except:
                ls_doi.append(None)
                no_doi += 1
                print(no_doi)
                
            try:
                language = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Language'][0])
                ls_language.append(language)
            except:
                ls_language.append(None)
                
            try:
                year = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Year'])
                ls_year.append(year)
            except:
                ls_year.append(None)
            
            try:
                month = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Month'])
                ls_month.append(month)
            except:
                ls_month.append(None)
                
            try:
                day = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Day'])
                ls_day.append(day)
            except:
                ls_day.append(None)
                
            try:
                volume = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Volume'])
                ls_volume.append(volume)
            except:
                ls_volume.append(None)
            
            try:
                issue = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Issue'])
                ls_issue.append(issue)
            except:
                ls_issue.append(None)
                
            try:
                journal = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['Title'])
                ls_journal.append(journal)
            except:
                ls_journal.append(None)
                
            try:
                title = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleTitle'])
                ls_title.append(title)
            except:
                ls_title.append(None)
                
            try:
                page = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Pagination']['MedlinePgn'])
                ls_page.append(page)
            except:
                ls_page.append(None)
        
    time.sleep(5)

In [161]:
data_no_retract = pd.concat([pd.Series(ls_id), pd.Series(ls_doi), pd.Series(ls_language), pd.Series(ls_year), pd.Series(ls_month), pd.Series(ls_day), pd.Series(ls_volume), pd.Series(ls_issue), pd.Series(ls_journal), pd.Series(ls_title), pd.Series(ls_page)], axis=1)
data_no_retract

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,33014689,10.1007/s13205-020-02432-w,eng,2020.0,9.0,19.0,10.0,10.0,3 Biotech,AuNPs/CNF-modified DNA biosensor for early and...,446
1,33014688,10.1007/s13205-020-02435-7,eng,2020.0,9.0,19.0,10.0,10.0,3 Biotech,NF-κB signaling induces inductive expression o...,445
2,33006702,10.1186/s41205-020-00083-4,eng,2020.0,10.0,2.0,6.0,1.0,3D printing in medicine,Clinical applications of custom 3D printed imp...,29
3,32997313,10.1186/s41205-020-00081-6,eng,2020.0,9.0,30.0,6.0,1.0,3D printing in medicine,3D printing in critical care: a narrative review.,28
4,32984538,10.4158/ACCR-2020-0076,eng,2020.0,8.0,6.0,6.0,5.0,AACE clinical case reports,CONVERSION OF HYPOTHYROIDISM TO HYPERTHYROIDIS...,e279-e281
...,...,...,...,...,...,...,...,...,...,...,...
6111,32987453,2095-8137(2020)00-0001-08,eng,2020.0,9.0,28.0,,,Zoological research,Determining the level of extra-pair paternity ...,1-8
6112,32760460,10.6620/ZS.2020.59-14,eng,2020.0,4.0,28.0,59.0,,Zoological studies,Molecular Evaluation of the Fairy Shrimp Famil...,e14
6113,32760459,10.6620/ZS.2020.59-13,eng,2020.0,4.0,20.0,59.0,,Zoological studies,<i>Pontopolycope orientalis</i> sp. nov. (Crus...,e13
6114,32549998,10.1186/s40851-020-00162-8,eng,2020.0,6.0,15.0,6.0,,Zoological letters,"Correction to: Natural selection, selective br...",10


In [162]:
data_no_retract.to_csv('./pubmed_data_no_retraction.csv')

In [None]:
data_no_retract = pd.read_csv('./pubmed_data_no_retraction.csv')
data_no_retract

In [None]:
data_no_retract = data_no_retract.dropna(axis=0, subset=['1'])
data_no_retract

In [165]:
data_no_retract['2'].value_counts()

eng    5475
chi       7
por       6
spa       2
fre       1
Name: 2, dtype: int64

In [168]:
print(data_no_retract.loc[data_no_retract['2']=='chi'].index)
print(data_no_retract.loc[data_no_retract['2']=='por'].index)
print(data_no_retract.loc[data_no_retract['2']=='spa'].index)
print(data_no_retract.loc[data_no_retract['2']=='fre'].index)

Int64Index([2543, 2544, 4696, 4697, 6105, 6106, 6107], dtype='int64')
Int64Index([470, 471, 472, 473, 5450, 5467], dtype='int64')
Int64Index([530, 5454], dtype='int64')
Int64Index([1104], dtype='int64')


In [170]:
data_no_retract = data_no_retract.drop([2543, 2544, 4696, 4697, 6105, 6106, 6107, 470, 471, 472, 473, 5450, 5467, 530, 5454, 1104], axis=0)
data_no_retract['2'].value_counts()

eng    5475
Name: 2, dtype: int64

In [171]:
ls_index = []
for i in data_no_retract['1']:
    if len(i) < 7:
        for j in range(0,len(data_no_retract.loc[data_no_retract['1']==i].index)):
            ls_index.append(data_no_retract.loc[data_no_retract['1']==i].index[j])
data_no_retract = data_no_retract.drop(ls_index)

In [None]:
data_no_retract = data_no_retract.drop_duplicates()
data_no_retract

In [174]:
data_no_retract.to_csv('./pubmed_data_no_retraction_cleaned.csv')

### Pulling Data from PMC - No Retractions

In [175]:
data_no_retract = pd.read_csv('./pubmed_data_no_retraction_cleaned.csv')
print(data_no_retract.shape)
data_no_retract.head()

(5335, 13)


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,0,1,2,3,4,5,6,7,8,9,10
0,0,0,33014689,10.1007/s13205-020-02432-w,eng,2020.0,9.0,19.0,10.0,10.0,3 Biotech,AuNPs/CNF-modified DNA biosensor for early and...,446
1,1,1,33014688,10.1007/s13205-020-02435-7,eng,2020.0,9.0,19.0,10.0,10.0,3 Biotech,NF-κB signaling induces inductive expression o...,445
2,2,2,33006702,10.1186/s41205-020-00083-4,eng,2020.0,10.0,2.0,6.0,1.0,3D printing in medicine,Clinical applications of custom 3D printed imp...,29
3,3,3,32997313,10.1186/s41205-020-00081-6,eng,2020.0,9.0,30.0,6.0,1.0,3D printing in medicine,3D printing in critical care: a narrative review.,28
4,4,4,32984538,10.4158/ACCR-2020-0076,eng,2020.0,8.0,6.0,6.0,5.0,AACE clinical case reports,CONVERSION OF HYPOTHYROIDISM TO HYPERTHYROIDIS...,e279-e281


In [None]:
ls_total_text = []
ls_total_keywords = []
ls_total_abstract = []
ls_publisher = []
count = 0
no_text = 0

for i in data_no_retract['1']:
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pmc',term=i, retmode='xml')
    results = Entrez.read(handle)

    ids = ' , '.join(results['IdList'])
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
    text = handle.read()

    soup = BeautifulSoup(text, "lxml")
    print(f"{i} [{count}]")
    
    try:
        ls_raw_text = []
        text = str()
        for i in range(0,len(soup.find_all("sec"))):
            for j in range(0,len(soup.find_all("sec")[i].find_all("p"))):
                ls_raw_text.append(str(soup.find_all("sec")[i].find_all("p")[j].text))

        for i in range(0,len(ls_raw_text)):
            text += ls_raw_text[i]
        ls_total_text.append(text)
        
        if text=='':
            no_text += 1
            print(f'No text --> {no_text}')
    except:
        ls_total_text.append(None)
    
    try:
        ls_keywords = []
        for i in range(0,len(soup.find_all("kwd"))):
            ls_keywords.append(soup.find_all("kwd")[i].text)
        ls_total_keywords.append(ls_keywords)
    except:
        ls_total_keywords.append(None)
    
    try:
        ls_abstract = []
        for i in soup.find_all("abstract")[0].text.split('\n'):
            if i == '':
                pass
            else:
                ls_abstract.append(i)
        ls_total_abstract.append(ls_abstract[0])
    except:
        ls_total_abstract.append(None)
    
    try:      
        ls_publisher.append(soup.find_all("publisher-name")[0].text)
    except:
        ls_publisher.append(None)
    
    count +=1
    time.sleep(3)

Did not get all of the data on one pull, last DOI shown was 10.1371/journal.pone.0231127 (index possibly 4542).

In [183]:
pd.Series(data_no_retract['1'][0:4543]).to_csv('./doi__no_retraction__0_4543.csv')
pd.Series(ls_total_text).to_csv('./text__no_retraction__0_4543.csv')
pd.Series(ls_total_keywords).to_csv('./keywords__no_retraction__0_4543.csv')
pd.Series(ls_total_abstract).to_csv('./abstract__no_retraction__0_4543.csv')
pd.Series(ls_publisher).to_csv('./publisher__no_retraction__0_4543.csv')

Second Pull

In [59]:
day = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16',
                 '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
len(day)

31

In [37]:
day[:30]

['01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30']

In [30]:
import numpy as np
np.random.choice(day, 12, replace = False)

array(['20', '03', '04', '02', '19', '10', '30', '08', '06', '02', '20',
       '01', '06', '09', '16', '22', '03', '06', '21', '28', '02', '28',
       '21', '03', '31', '06', '27', '13', '11', '21', '25', '30', '21',
       '15', '19', '19', '21', '29', '15', '15', '05', '30', '11', '28',
       '24', '22', '31', '02', '11', '02', '08', '14', '09', '09', '21',
       '16', '10', '09', '02', '26'], dtype='<U2')

In [60]:
start_month = ['01/01', '02/01', '03/01', '04/01', '05/01', '06/01', 
               '07/01', '08/01', '09/01', '10/01', '11/01', '12/01']
end_month = ['01/31', '02/28', '03/31', '04/30', '05/31', '06/30', 
            '07/31', '08/31', '09/30', '10/31', '11/30', '12/31']

month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
year = ['2015', '2016', '2017', '2018', '2019']
start_date = []
end_date = []

for i in year:
    for j in range(0,len(month)):
        if j == '01' or j == '03' or j == '05' or j == '07' or j == '08' or j == '10' or j == '12':
            day_num = np.random.choice(day, 1)
        elif j == '02':
            day_num = np.random.choice(day[:28])
        else:
            day_num = np.random.choice(day[:30])
        start_date.append(i+'/'+start_month[j])
        end_date.append(i+'/'+month[j]+'/'+day_num)
        
print(len(start_date))

60


In [61]:
end_date

['2015/01/09',
 '2015/02/28',
 '2015/03/10',
 '2015/04/25',
 '2015/05/15',
 '2015/06/21',
 '2015/07/29',
 '2015/08/13',
 '2015/09/08',
 '2015/10/09',
 '2015/11/24',
 '2015/12/17',
 '2016/01/17',
 '2016/02/06',
 '2016/03/19',
 '2016/04/02',
 '2016/05/26',
 '2016/06/24',
 '2016/07/17',
 '2016/08/02',
 '2016/09/14',
 '2016/10/17',
 '2016/11/28',
 '2016/12/26',
 '2017/01/10',
 '2017/02/02',
 '2017/03/30',
 '2017/04/26',
 '2017/05/04',
 '2017/06/01',
 '2017/07/16',
 '2017/08/12',
 '2017/09/15',
 '2017/10/17',
 '2017/11/02',
 '2017/12/14',
 '2018/01/17',
 '2018/02/26',
 '2018/03/12',
 '2018/04/03',
 '2018/05/17',
 '2018/06/29',
 '2018/07/25',
 '2018/08/08',
 '2018/09/19',
 '2018/10/06',
 '2018/11/19',
 '2018/12/13',
 '2019/01/11',
 '2019/02/25',
 '2019/03/04',
 '2019/04/14',
 '2019/05/09',
 '2019/06/08',
 '2019/07/16',
 '2019/08/16',
 '2019/09/30',
 '2019/10/20',
 '2019/11/09',
 '2019/12/25']

In [62]:
ls_id = []
ls_doi = []
ls_language = []
ls_year = []
ls_month = []
ls_day = []
ls_volume = []
ls_issue = []
ls_journal = []
ls_title = []
ls_page = []
x=0
no_doi = 0

for i in range(0,len(start_date)):
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pubmed',term='(PLoS One [Journal])', retmode='xml', retmax=167, mindate = start_date[i], maxdate = end_date[i])
    results = Entrez.read(handle)
    
    ids = ' , '.join(results['IdList'])

    
    if len(ids)==0:
        pass
    else:
        Entrez.email = 'lmpack01@outlook.com'
        handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
        results_id = Entrez.read(handle)
        
        x += len(results['IdList'])
        print(x)
        
        for j in range(0, len(results['IdList'])):
            ls_id.append(results['IdList'][j])
            try:
                doi = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ELocationID'][0])
                ls_doi.append(doi)
            except:
                ls_doi.append(None)
                no_doi += 1
                print(no_doi)
                
            try:
                language = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Language'][0])
                ls_language.append(language)
            except:
                ls_language.append(None)
                
            try:
                year = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Year'])
                ls_year.append(year)
                print(year)
            except:
                ls_year.append(None)
            
            try:
                month = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Month'])
                ls_month.append(month)
                print(month)
            except:
                ls_month.append(None)
                
            try:
                day = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Day'])
                ls_day.append(day)
                print(day)
            except:
                ls_day.append(None)
                
            try:
                volume = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Volume'])
                ls_volume.append(volume)
            except:
                ls_volume.append(None)
            
            try:
                issue = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Issue'])
                ls_issue.append(issue)
            except:
                ls_issue.append(None)
                
            try:
                journal = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['Title'])
                ls_journal.append(journal)
            except:
                ls_journal.append(None)
                
            try:
                title = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleTitle'])
                ls_title.append(title)
            except:
                ls_title.append(None)
                
            try:
                page = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Pagination']['MedlinePgn'])
                ls_page.append(page)
            except:
                ls_page.append(None)
        
    time.sleep(5)

167
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
8
2015
1
7
2015
1
7
2015
1
7
2015
1
7
2015
1
7
2015
1
7
2015
1
7
2015
1
6
2015
1
6
2015
1
6
2015
1
6
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
5
2015
1
2
2015
1
2
2015
1
2
2015
1
2
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014
12
31
2014

1002
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015
6
19
2015


1837
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
20
2015
11
19
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015
11
18
2015


2672
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
4
1
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3
31
2016
3


3563
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
14
2016
10
13
2016
10
13
2016
10
13
2016


4485
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
25
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017
4
24
2017


5415
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
16
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017
10
13
2017


6241
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
9
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018
3
8
2018


7129
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
18
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
17
2018
9
14
2018
9
14
2018
9
14
2018


7964
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019
2
22
2019


8945
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019
8
15
2019


In [63]:
data_no_retract = pd.concat([pd.Series(ls_id), pd.Series(ls_doi), pd.Series(ls_language), pd.Series(ls_year), pd.Series(ls_month), pd.Series(ls_day), pd.Series(ls_volume), pd.Series(ls_issue), pd.Series(ls_journal), pd.Series(ls_title), pd.Series(ls_page)], axis=1)
data_no_retract

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,25569838,10.1371/journal.pone.0115528,eng,2015,1,8,10,1,PloS one,High incidence is not high exposure: what prop...,e0115528
1,25569796,10.1371/journal.pone.0115194,eng,2015,1,8,10,1,PloS one,Neurological abnormalities in full-term asphyx...,e0115194
2,25569682,10.1371/journal.pone.0117040,eng,2015,1,8,10,1,PloS one,Production of siderophores increases resistanc...,e0117040
3,25569558,10.1371/journal.pone.0116930,eng,2015,1,8,10,1,PloS one,Application of clinico-radiologic-pathologic d...,e0116930
4,25569428,10.1371/journal.pone.0116566,eng,2015,1,8,10,1,PloS one,Cinnamon ameliorates experimental allergic enc...,e0116566
...,...,...,...,...,...,...,...,...,...,...,...
9608,31856208,10.1371/journal.pone.0226734,eng,2019,12,19,14,12,PloS one,Statistical learning and the uncertainty of me...,e0226734
9609,31856207,10.1371/journal.pone.0226837,eng,2019,12,19,14,12,PloS one,Leishmania amazonensis resistance in murine ma...,e0226837
9610,31856205,10.1371/journal.pone.0226726,eng,2019,12,19,14,12,PloS one,Disparities in survival by stage after surgery...,e0226726
9611,31856204,10.1371/journal.pone.0227068,eng,2019,12,19,14,12,PloS one,Correction: Elevated levels of eEF1A2 protein ...,e0227068


In [65]:
data_no_retract[3].value_counts()

2019    1983
2016    1893
2018    1890
2015    1880
2017    1843
2014     124
Name: 3, dtype: int64

In [66]:
data_no_retract[4].value_counts()

12    959
5     930
10    918
7     835
9     835
3     823
2     804
1     768
11    743
8     724
6     668
4     606
Name: 4, dtype: int64

In [67]:
data_no_retract[5].value_counts()

14    806
12    642
13    490
23    477
31    474
15    463
28    424
22    421
11    420
6     385
8     334
7     324
18    317
24    310
5     303
10    301
19    269
1     253
9     243
16    238
3     236
25    219
27    212
17    199
4     172
2     169
26    167
20    156
21     98
29     91
Name: 5, dtype: int64

In [68]:
data_no_retract.to_csv('./pubmed_data_second_no_retraction.csv')

In [69]:
data_no_retract = pd.read_csv('./pubmed_data_second_no_retraction.csv')
data_no_retract = data_no_retract.dropna(axis=0, subset=['1'])
data_no_retract

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,25569838,10.1371/journal.pone.0115528,eng,2015,1,8,10,1,PloS one,High incidence is not high exposure: what prop...,e0115528
1,1,25569796,10.1371/journal.pone.0115194,eng,2015,1,8,10,1,PloS one,Neurological abnormalities in full-term asphyx...,e0115194
2,2,25569682,10.1371/journal.pone.0117040,eng,2015,1,8,10,1,PloS one,Production of siderophores increases resistanc...,e0117040
3,3,25569558,10.1371/journal.pone.0116930,eng,2015,1,8,10,1,PloS one,Application of clinico-radiologic-pathologic d...,e0116930
4,4,25569428,10.1371/journal.pone.0116566,eng,2015,1,8,10,1,PloS one,Cinnamon ameliorates experimental allergic enc...,e0116566
...,...,...,...,...,...,...,...,...,...,...,...,...
9608,9608,31856208,10.1371/journal.pone.0226734,eng,2019,12,19,14,12,PloS one,Statistical learning and the uncertainty of me...,e0226734
9609,9609,31856207,10.1371/journal.pone.0226837,eng,2019,12,19,14,12,PloS one,Leishmania amazonensis resistance in murine ma...,e0226837
9610,9610,31856205,10.1371/journal.pone.0226726,eng,2019,12,19,14,12,PloS one,Disparities in survival by stage after surgery...,e0226726
9611,9611,31856204,10.1371/journal.pone.0227068,eng,2019,12,19,14,12,PloS one,Correction: Elevated levels of eEF1A2 protein ...,e0227068


In [70]:
data_no_retract['2'].value_counts()

eng    9613
Name: 2, dtype: int64

In [71]:
ls_index = []
for i in data_no_retract['1']:
    if len(i) < 7:
        for j in range(0,len(data_no_retract.loc[data_no_retract['1']==i].index)):
            ls_index.append(data_no_retract.loc[data_no_retract['1']==i].index[j])
data_no_retract = data_no_retract.drop(ls_index)
data_no_retract

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,25569838,10.1371/journal.pone.0115528,eng,2015,1,8,10,1,PloS one,High incidence is not high exposure: what prop...,e0115528
1,1,25569796,10.1371/journal.pone.0115194,eng,2015,1,8,10,1,PloS one,Neurological abnormalities in full-term asphyx...,e0115194
2,2,25569682,10.1371/journal.pone.0117040,eng,2015,1,8,10,1,PloS one,Production of siderophores increases resistanc...,e0117040
3,3,25569558,10.1371/journal.pone.0116930,eng,2015,1,8,10,1,PloS one,Application of clinico-radiologic-pathologic d...,e0116930
4,4,25569428,10.1371/journal.pone.0116566,eng,2015,1,8,10,1,PloS one,Cinnamon ameliorates experimental allergic enc...,e0116566
...,...,...,...,...,...,...,...,...,...,...,...,...
9608,9608,31856208,10.1371/journal.pone.0226734,eng,2019,12,19,14,12,PloS one,Statistical learning and the uncertainty of me...,e0226734
9609,9609,31856207,10.1371/journal.pone.0226837,eng,2019,12,19,14,12,PloS one,Leishmania amazonensis resistance in murine ma...,e0226837
9610,9610,31856205,10.1371/journal.pone.0226726,eng,2019,12,19,14,12,PloS one,Disparities in survival by stage after surgery...,e0226726
9611,9611,31856204,10.1371/journal.pone.0227068,eng,2019,12,19,14,12,PloS one,Correction: Elevated levels of eEF1A2 protein ...,e0227068


In [72]:
data_no_retract = data_no_retract.drop_duplicates()
data_no_retract

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,25569838,10.1371/journal.pone.0115528,eng,2015,1,8,10,1,PloS one,High incidence is not high exposure: what prop...,e0115528
1,1,25569796,10.1371/journal.pone.0115194,eng,2015,1,8,10,1,PloS one,Neurological abnormalities in full-term asphyx...,e0115194
2,2,25569682,10.1371/journal.pone.0117040,eng,2015,1,8,10,1,PloS one,Production of siderophores increases resistanc...,e0117040
3,3,25569558,10.1371/journal.pone.0116930,eng,2015,1,8,10,1,PloS one,Application of clinico-radiologic-pathologic d...,e0116930
4,4,25569428,10.1371/journal.pone.0116566,eng,2015,1,8,10,1,PloS one,Cinnamon ameliorates experimental allergic enc...,e0116566
...,...,...,...,...,...,...,...,...,...,...,...,...
9608,9608,31856208,10.1371/journal.pone.0226734,eng,2019,12,19,14,12,PloS one,Statistical learning and the uncertainty of me...,e0226734
9609,9609,31856207,10.1371/journal.pone.0226837,eng,2019,12,19,14,12,PloS one,Leishmania amazonensis resistance in murine ma...,e0226837
9610,9610,31856205,10.1371/journal.pone.0226726,eng,2019,12,19,14,12,PloS one,Disparities in survival by stage after surgery...,e0226726
9611,9611,31856204,10.1371/journal.pone.0227068,eng,2019,12,19,14,12,PloS one,Correction: Elevated levels of eEF1A2 protein ...,e0227068


In [78]:
ls_total_text = []
ls_total_keywords = []
ls_total_abstract = []
ls_publisher = []
count = 6166
no_text = 0

for i in data_no_retract['1'][6166:]:
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.esearch(db='pmc',term=i, retmode='xml')
    results = Entrez.read(handle)

    ids = ' , '.join(results['IdList'])
    Entrez.email = 'lmpack01@outlook.com'
    handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
    text = handle.read()

    soup = BeautifulSoup(text, "lxml")
    print(f"{i} [{count}]")
    
    try:
        ls_raw_text = []
        text = str()
        for i in range(0,len(soup.find_all("sec"))):
            for j in range(0,len(soup.find_all("sec")[i].find_all("p"))):
                ls_raw_text.append(str(soup.find_all("sec")[i].find_all("p")[j].text))

        for i in range(0,len(ls_raw_text)):
            text += ls_raw_text[i]
        ls_total_text.append(text)
        
        if text=='':
            no_text += 1
            print(f'No text --> {no_text}')
    except:
        ls_total_text.append(None)
    
    try:
        ls_keywords = []
        for i in range(0,len(soup.find_all("kwd"))):
            ls_keywords.append(soup.find_all("kwd")[i].text)
        ls_total_keywords.append(ls_keywords)
    except:
        ls_total_keywords.append(None)
    
    try:
        ls_abstract = []
        for i in soup.find_all("abstract")[0].text.split('\n'):
            if i == '':
                pass
            else:
                ls_abstract.append(i)
        ls_total_abstract.append(ls_abstract[0])
    except:
        ls_total_abstract.append(None)
    
    try:      
        ls_publisher.append(soup.find_all("publisher-name")[0].text)
    except:
        ls_publisher.append(None)
    
    count +=1
    time.sleep(3)

10.1371/journal.pone.0193828 [6166]
10.1371/journal.pone.0193599 [6167]
10.1371/journal.pone.0193560 [6168]
10.1371/journal.pone.0193670 [6169]
10.1371/journal.pone.0193323 [6170]
10.1371/journal.pone.0193649 [6171]
10.1371/journal.pone.0193452 [6172]
10.1371/journal.pone.0193612 [6173]
10.1371/journal.pone.0193894 [6174]
10.1371/journal.pone.0193164 [6175]
10.1371/journal.pone.0193879 [6176]
10.1371/journal.pone.0193205 [6177]
10.1371/journal.pone.0193830 [6178]
10.1371/journal.pone.0193674 [6179]
10.1371/journal.pone.0193867 [6180]
10.1371/journal.pone.0193748 [6181]
10.1371/journal.pone.0193219 [6182]
10.1371/journal.pone.0193878 [6183]
10.1371/journal.pone.0193643 [6184]
10.1371/journal.pone.0193829 [6185]
10.1371/journal.pone.0193404 [6186]
10.1371/journal.pone.0193893 [6187]
10.1371/journal.pone.0193673 [6188]
10.1371/journal.pone.0193491 [6189]
10.1371/journal.pone.0193542 [6190]
10.1371/journal.pone.0193723 [6191]
10.1371/journal.pone.0193572 [6192]
10.1371/journal.pone.0193569

10.1371/journal.pone.0193295 [6393]
10.1371/journal.pone.0192643 [6394]
10.1371/journal.pone.0189915 [6395]
10.1371/journal.pone.0194337 [6396]
10.1371/journal.pone.0194872 [6397]
10.1371/journal.pone.0195023 [6398]
10.1371/journal.pone.0197622 [6399]
10.1371/journal.pone.0197461 [6400]
10.1371/journal.pone.0197408 [6401]
10.1371/journal.pone.0197440 [6402]
10.1371/journal.pone.0197613 [6403]
10.1371/journal.pone.0197470 [6404]
10.1371/journal.pone.0197436 [6405]
10.1371/journal.pone.0197112 [6406]
10.1371/journal.pone.0197388 [6407]
10.1371/journal.pone.0197441 [6408]
10.1371/journal.pone.0197183 [6409]
10.1371/journal.pone.0196705 [6410]
10.1371/journal.pone.0197365 [6411]
10.1371/journal.pone.0197371 [6412]
10.1371/journal.pone.0196989 [6413]
10.1371/journal.pone.0197422 [6414]
10.1371/journal.pone.0196924 [6415]
10.1371/journal.pone.0197370 [6416]
10.1371/journal.pone.0197442 [6417]
10.1371/journal.pone.0197379 [6418]
10.1371/journal.pone.0197705 [6419]
No text --> 4
10.1371/journa

10.1371/journal.pone.0199254 [6617]
10.1371/journal.pone.0198811 [6618]
10.1371/journal.pone.0199354 [6619]
10.1371/journal.pone.0198673 [6620]
10.1371/journal.pone.0199315 [6621]
10.1371/journal.pone.0198871 [6622]
10.1371/journal.pone.0199435 [6623]
10.1371/journal.pone.0199044 [6624]
10.1371/journal.pone.0199236 [6625]
10.1371/journal.pone.0198969 [6626]
10.1371/journal.pone.0199283 [6627]
10.1371/journal.pone.0201222 [6628]
10.1371/journal.pone.0200989 [6629]
10.1371/journal.pone.0200926 [6630]
10.1371/journal.pone.0200843 [6631]
10.1371/journal.pone.0201423 [6632]
No text --> 15
10.1371/journal.pone.0201325 [6633]
10.1371/journal.pone.0201500 [6634]
No text --> 16
10.1371/journal.pone.0201296 [6635]
10.1371/journal.pone.0200946 [6636]
10.1371/journal.pone.0201096 [6637]
10.1371/journal.pone.0201178 [6638]
10.1371/journal.pone.0201304 [6639]
10.1371/journal.pone.0201243 [6640]
10.1371/journal.pone.0201208 [6641]
10.1371/journal.pone.0201161 [6642]
10.1371/journal.pone.0200498 [6643

10.1371/journal.pone.0200340 [6840]
10.1371/journal.pone.0201932 [6841]
10.1371/journal.pone.0201739 [6842]
10.1371/journal.pone.0201740 [6843]
10.1371/journal.pone.0201920 [6844]
10.1371/journal.pone.0201693 [6845]
10.1371/journal.pone.0201880 [6846]
10.1371/journal.pone.0201990 [6847]
10.1371/journal.pone.0201610 [6848]
10.1371/journal.pone.0202122 [6849]
No text --> 26
10.1371/journal.pone.0201929 [6850]
10.1371/journal.pone.0201975 [6851]
10.1371/journal.pone.0201074 [6852]
10.1371/journal.pone.0201712 [6853]
10.1371/journal.pone.0201897 [6854]
10.1371/journal.pone.0201707 [6855]
10.1371/journal.pone.0201996 [6856]
10.1371/journal.pone.0201416 [6857]
10.1371/journal.pone.0202136 [6858]
No text --> 27
10.1371/journal.pone.0197837 [6859]
10.1371/journal.pone.0201641 [6860]
10.1371/journal.pone.0201680 [6861]
10.1371/journal.pone.0201846 [6862]
10.1371/journal.pone.0201615 [6863]
10.1371/journal.pone.0201382 [6864]
10.1371/journal.pone.0201833 [6865]
10.1371/journal.pone.0201220 [6866

10.1371/journal.pone.0204154 [7061]
10.1371/journal.pone.0203799 [7062]
10.1371/journal.pone.0203651 [7063]
10.1371/journal.pone.0203629 [7064]
10.1371/journal.pone.0203850 [7065]
10.1371/journal.pone.0203693 [7066]
10.1371/journal.pone.0204052 [7067]
10.1371/journal.pone.0204178 [7068]
10.1371/journal.pone.0203824 [7069]
10.1371/journal.pone.0203744 [7070]
10.1371/journal.pone.0203803 [7071]
10.1371/journal.pone.0203486 [7072]
10.1371/journal.pone.0203771 [7073]
10.1371/journal.pone.0203676 [7074]
10.1371/journal.pone.0203865 [7075]
10.1371/journal.pone.0203692 [7076]
10.1371/journal.pone.0204021 [7077]
10.1371/journal.pone.0204157 [7078]
10.1371/journal.pone.0203621 [7079]
10.1371/journal.pone.0203745 [7080]
10.1371/journal.pone.0203888 [7081]
10.1371/journal.pone.0203631 [7082]
10.1371/journal.pone.0203917 [7083]
10.1371/journal.pone.0204061 [7084]
10.1371/journal.pone.0204032 [7085]
10.1371/journal.pone.0203920 [7086]
10.1371/journal.pone.0204123 [7087]
10.1371/journal.pone.0203856

10.1371/journal.pone.0204434 [7282]
10.1371/journal.pone.0204904 [7283]
10.1371/journal.pone.0204653 [7284]
10.1371/journal.pone.0204652 [7285]
10.1371/journal.pone.0204893 [7286]
10.1371/journal.pone.0204798 [7287]
10.1371/journal.pone.0199998 [7288]
10.1371/journal.pone.0204800 [7289]
10.1371/journal.pone.0204273 [7290]
10.1371/journal.pone.0203811 [7291]
10.1371/journal.pone.0203712 [7292]
10.1371/journal.pone.0204361 [7293]
10.1371/journal.pone.0204511 [7294]
10.1371/journal.pone.0204689 [7295]
10.1371/journal.pone.0207684 [7296]
10.1371/journal.pone.0207704 [7297]
10.1371/journal.pone.0207370 [7298]
10.1371/journal.pone.0207274 [7299]
10.1371/journal.pone.0207573 [7300]
10.1371/journal.pone.0207702 [7301]
10.1371/journal.pone.0207902 [7302]
No text --> 60
10.1371/journal.pone.0207580 [7303]
10.1371/journal.pone.0207962 [7304]
No text --> 61
10.1371/journal.pone.0207893 [7305]
No text --> 62
10.1371/journal.pone.0207963 [7306]
No text --> 63
10.1371/journal.pone.0207727 [7307]
10.1

10.1371/journal.pone.0208757 [7504]
10.1371/journal.pone.0208030 [7505]
10.1371/journal.pone.0209026 [7506]
10.1371/journal.pone.0208085 [7507]
10.1371/journal.pone.0208252 [7508]
10.1371/journal.pone.0208621 [7509]
10.1371/journal.pone.0208187 [7510]
10.1371/journal.pone.0208077 [7511]
10.1371/journal.pone.0208169 [7512]
10.1371/journal.pone.0207625 [7513]
10.1371/journal.pone.0207954 [7514]
10.1371/journal.pone.0208110 [7515]
10.1371/journal.pone.0208328 [7516]
10.1371/journal.pone.0208143 [7517]
10.1371/journal.pone.0207846 [7518]
10.1371/journal.pone.0208043 [7519]
10.1371/journal.pone.0208497 [7520]
10.1371/journal.pone.0208059 [7521]
10.1371/journal.pone.0208231 [7522]
10.1371/journal.pone.0208138 [7523]
10.1371/journal.pone.0207936 [7524]
10.1371/journal.pone.0208535 [7525]
10.1371/journal.pone.0207871 [7526]
10.1371/journal.pone.0208503 [7527]
10.1371/journal.pone.0208272 [7528]
10.1371/journal.pone.0208330 [7529]
10.1371/journal.pone.0207986 [7530]
10.1371/journal.pone.0208223

10.1371/journal.pone.0210072 [7728]
10.1371/journal.pone.0210009 [7729]
10.1371/journal.pone.0210074 [7730]
10.1371/journal.pone.0210081 [7731]
10.1371/journal.pone.0210105 [7732]
10.1371/journal.pone.0209663 [7733]
10.1371/journal.pone.0210114 [7734]
10.1371/journal.pone.0210146 [7735]
10.1371/journal.pone.0209853 [7736]
10.1371/journal.pone.0209425 [7737]
10.1371/journal.pone.0209841 [7738]
10.1371/journal.pone.0208318 [7739]
10.1371/journal.pone.0209804 [7740]
10.1371/journal.pone.0210020 [7741]
10.1371/journal.pone.0209560 [7742]
10.1371/journal.pone.0209818 [7743]
10.1371/journal.pone.0210129 [7744]
10.1371/journal.pone.0210073 [7745]
10.1371/journal.pone.0206194 [7746]
10.1371/journal.pone.0208456 [7747]
10.1371/journal.pone.0208076 [7748]
10.1371/journal.pone.0208181 [7749]
10.1371/journal.pone.0208015 [7750]
10.1371/journal.pone.0205618 [7751]
10.1371/journal.pone.0208464 [7752]
10.1371/journal.pone.0209324 [7753]
10.1371/journal.pone.0204539 [7754]
10.1371/journal.pone.0208462

10.1371/journal.pone.0200862 [7950]
10.1371/journal.pone.0212005 [7951]
10.1371/journal.pone.0211728 [7952]
10.1371/journal.pone.0212020 [7953]
10.1371/journal.pone.0211958 [7954]
10.1371/journal.pone.0211480 [7955]
10.1371/journal.pone.0211944 [7956]
10.1371/journal.pone.0211869 [7957]
10.1371/journal.pone.0212031 [7958]
10.1371/journal.pone.0208216 [7959]
10.1371/journal.pone.0211437 [7960]
10.1371/journal.pone.0212040 [7961]
10.1371/journal.pone.0211999 [7962]
10.1371/journal.pone.0211438 [7963]
10.1371/journal.pone.0212954 [7964]
10.1371/journal.pone.0213209 [7965]
10.1371/journal.pone.0212757 [7966]
10.1371/journal.pone.0213219 [7967]
10.1371/journal.pone.0211335 [7968]
10.1371/journal.pone.0212911 [7969]
10.1371/journal.pone.0212934 [7970]
10.1371/journal.pone.0213205 [7971]
10.1371/journal.pone.0213082 [7972]
10.1371/journal.pone.0208189 [7973]
10.1371/journal.pone.0213185 [7974]
10.1371/journal.pone.0213195 [7975]
10.1371/journal.pone.0212994 [7976]
10.1371/journal.pone.0212502

10.1371/journal.pone.0214404 [8173]
10.1371/journal.pone.0214686 [8174]
10.1371/journal.pone.0213945 [8175]
10.1371/journal.pone.0214551 [8176]
10.1371/journal.pone.0213639 [8177]
10.1371/journal.pone.0206709 [8178]
10.1371/journal.pone.0214386 [8179]
10.1371/journal.pone.0208963 [8180]
10.1371/journal.pone.0213300 [8181]
10.1371/journal.pone.0206334 [8182]
10.1371/journal.pone.0214513 [8183]
10.1371/journal.pone.0213938 [8184]
10.1371/journal.pone.0215491 [8185]
10.1371/journal.pone.0215348 [8186]
10.1371/journal.pone.0215129 [8187]
10.1371/journal.pone.0215325 [8188]
10.1371/journal.pone.0215310 [8189]
10.1371/journal.pone.0215316 [8190]
10.1371/journal.pone.0213762 [8191]
10.1371/journal.pone.0215324 [8192]
10.1371/journal.pone.0215058 [8193]
10.1371/journal.pone.0215074 [8194]
10.1371/journal.pone.0215065 [8195]
10.1371/journal.pone.0215532 [8196]
10.1371/journal.pone.0215064 [8197]
10.1371/journal.pone.0215624 [8198]
No text --> 111
10.1371/journal.pone.0215329 [8199]
10.1371/jour

10.1371/journal.pone.0215964 [8399]
10.1371/journal.pone.0216071 [8400]
10.1371/journal.pone.0216039 [8401]
10.1371/journal.pone.0216062 [8402]
10.1371/journal.pone.0216235 [8403]
10.1371/journal.pone.0216281 [8404]
10.1371/journal.pone.0216237 [8405]
10.1371/journal.pone.0216410 [8406]
10.1371/journal.pone.0216229 [8407]
10.1371/journal.pone.0216392 [8408]
10.1371/journal.pone.0216469 [8409]
10.1371/journal.pone.0216404 [8410]
10.1371/journal.pone.0216363 [8411]
10.1371/journal.pone.0216249 [8412]
10.1371/journal.pone.0216181 [8413]
10.1371/journal.pone.0216556 [8414]
10.1371/journal.pone.0215872 [8415]
10.1371/journal.pone.0208871 [8416]
10.1371/journal.pone.0215591 [8417]
10.1371/journal.pone.0215898 [8418]
10.1371/journal.pone.0216344 [8419]
10.1371/journal.pone.0216079 [8420]
10.1371/journal.pone.0216208 [8421]
10.1371/journal.pone.0215855 [8422]
10.1371/journal.pone.0216220 [8423]
10.1371/journal.pone.0215779 [8424]
10.1371/journal.pone.0215915 [8425]
10.1371/journal.pone.0216088

10.1371/journal.pone.0219594 [8621]
10.1371/journal.pone.0219493 [8622]
10.1371/journal.pone.0219831 [8623]
10.1371/journal.pone.0219676 [8624]
10.1371/journal.pone.0219685 [8625]
10.1371/journal.pone.0219354 [8626]
10.1371/journal.pone.0219830 [8627]
10.1371/journal.pone.0219844 [8628]
10.1371/journal.pone.0219792 [8629]
10.1371/journal.pone.0219603 [8630]
10.1371/journal.pone.0219626 [8631]
10.1371/journal.pone.0219691 [8632]
10.1371/journal.pone.0219456 [8633]
10.1371/journal.pone.0219712 [8634]
10.1371/journal.pone.0219746 [8635]
10.1371/journal.pone.0219841 [8636]
10.1371/journal.pone.0219428 [8637]
10.1371/journal.pone.0218877 [8638]
10.1371/journal.pone.0219491 [8639]
10.1371/journal.pone.0219559 [8640]
10.1371/journal.pone.0218935 [8641]
10.1371/journal.pone.0219429 [8642]
10.1371/journal.pone.0218574 [8643]
10.1371/journal.pone.0218861 [8644]
10.1371/journal.pone.0218547 [8645]
10.1371/journal.pone.0217712 [8646]
10.1371/journal.pone.0215585 [8647]
10.1371/journal.pone.0218040

10.1371/journal.pone.0220966 [8844]
10.1371/journal.pone.0221063 [8845]
10.1371/journal.pone.0220920 [8846]
10.1371/journal.pone.0221052 [8847]
10.1371/journal.pone.0220985 [8848]
10.1371/journal.pone.0220840 [8849]
10.1371/journal.pone.0220863 [8850]
10.1371/journal.pone.0220461 [8851]
10.1371/journal.pone.0220809 [8852]
10.1371/journal.pone.0220913 [8853]
10.1371/journal.pone.0220682 [8854]
10.1371/journal.pone.0220577 [8855]
10.1371/journal.pone.0220703 [8856]
10.1371/journal.pone.0220749 [8857]
10.1371/journal.pone.0220439 [8858]
10.1371/journal.pone.0220535 [8859]
10.1371/journal.pone.0220737 [8860]
10.1371/journal.pone.0220386 [8861]
10.1371/journal.pone.0220198 [8862]
10.1371/journal.pone.0219201 [8863]
10.1371/journal.pone.0220728 [8864]
10.1371/journal.pone.0219754 [8865]
10.1371/journal.pone.0220503 [8866]
10.1371/journal.pone.0220761 [8867]
10.1371/journal.pone.0220750 [8868]
10.1371/journal.pone.0220780 [8869]
10.1371/journal.pone.0219490 [8870]
10.1371/journal.pone.0220646

10.1371/journal.pone.0222715 [9068]
10.1371/journal.pone.0221640 [9069]
10.1371/journal.pone.0218591 [9070]
10.1371/journal.pone.0219109 [9071]
10.1371/journal.pone.0222672 [9072]
10.1371/journal.pone.0215888 [9073]
10.1371/journal.pone.0222494 [9074]
10.1371/journal.pone.0219729 [9075]
10.1371/journal.pone.0216507 [9076]
10.1371/journal.pone.0221270 [9077]
10.1371/journal.pone.0222579 [9078]
10.1371/journal.pone.0210731 [9079]
10.1371/journal.pone.0222411 [9080]
10.1371/journal.pone.0222174 [9081]
10.1371/journal.pone.0221396 [9082]
10.1371/journal.pone.0222501 [9083]
10.1371/journal.pone.0222395 [9084]
10.1371/journal.pone.0222499 [9085]
10.1371/journal.pone.0221906 [9086]
10.1371/journal.pone.0222708 [9087]
10.1371/journal.pone.0222453 [9088]
10.1371/journal.pone.0222407 [9089]
10.1371/journal.pone.0221805 [9090]
10.1371/journal.pone.0217888 [9091]
10.1371/journal.pone.0222651 [9092]
10.1371/journal.pone.0222602 [9093]
10.1371/journal.pone.0222420 [9094]
10.1371/journal.pone.0222612

10.1371/journal.pone.0224904 [9294]
10.1371/journal.pone.0225081 [9295]
10.1371/journal.pone.0224880 [9296]
10.1371/journal.pone.0224876 [9297]
10.1371/journal.pone.0224898 [9298]
10.1371/journal.pone.0224888 [9299]
10.1371/journal.pone.0224831 [9300]
10.1371/journal.pone.0225066 [9301]
10.1371/journal.pone.0224900 [9302]
10.1371/journal.pone.0223723 [9303]
10.1371/journal.pone.0224818 [9304]
10.1371/journal.pone.0224616 [9305]
10.1371/journal.pone.0224804 [9306]
10.1371/journal.pone.0224829 [9307]
10.1371/journal.pone.0224451 [9308]
10.1371/journal.pone.0224760 [9309]
10.1371/journal.pone.0224500 [9310]
10.1371/journal.pone.0224331 [9311]
10.1371/journal.pone.0224609 [9312]
10.1371/journal.pone.0224820 [9313]
10.1371/journal.pone.0224756 [9314]
10.1371/journal.pone.0224727 [9315]
10.1371/journal.pone.0224507 [9316]
10.1371/journal.pone.0224677 [9317]
10.1371/journal.pone.0223847 [9318]
10.1371/journal.pone.0224362 [9319]
10.1371/journal.pone.0224485 [9320]
10.1371/journal.pone.0223834

10.1371/journal.pone.0225996 [9518]
10.1371/journal.pone.0226350 [9519]
10.1371/journal.pone.0226795 [9520]
10.1371/journal.pone.0226715 [9521]
10.1371/journal.pone.0226570 [9522]
10.1371/journal.pone.0226599 [9523]
10.1371/journal.pone.0226277 [9524]
10.1371/journal.pone.0226696 [9525]
10.1371/journal.pone.0226564 [9526]
10.1371/journal.pone.0226839 [9527]
10.1371/journal.pone.0226561 [9528]
10.1371/journal.pone.0226613 [9529]
10.1371/journal.pone.0226391 [9530]
10.1371/journal.pone.0226759 [9531]
10.1371/journal.pone.0226506 [9532]
10.1371/journal.pone.0226625 [9533]
10.1371/journal.pone.0218644 [9534]
10.1371/journal.pone.0226971 [9535]
10.1371/journal.pone.0226986 [9536]
10.1371/journal.pone.0226490 [9537]
10.1371/journal.pone.0226594 [9538]
10.1371/journal.pone.0226260 [9539]
10.1371/journal.pone.0226155 [9540]
10.1371/journal.pone.0226804 [9541]
10.1371/journal.pone.0226686 [9542]
10.1371/journal.pone.0226842 [9543]
10.1371/journal.pone.0226848 [9544]
10.1371/journal.pone.0218837

In [79]:
pd.Series(data_no_retract['1'][6166:]).to_csv('./doi__no_retraction__6166_end.csv')
pd.Series(ls_total_text).to_csv('./text__no_retraction__6166_end.csv')
pd.Series(ls_total_keywords).to_csv('./keywords__no_retraction__6166_end.csv')
pd.Series(ls_total_abstract).to_csv('./abstract__no_retraction__6166_end.csv')
pd.Series(ls_publisher).to_csv('./publisher__no_retraction__6166_end.csv')

In [81]:
data_no_retract.shape

(9613, 12)

In [97]:
one = pd.read_csv('./text__no_retraction__0_5545.csv')
two = pd.read_csv('./text__no_retraction__5545_6166.csv')
three = pd.read_csv('./text__no_retraction__6166_end.csv')

total = pd.concat([one, two, three], axis = 0)
total.shape

(9615, 2)

In [98]:
total['0'].isnull().sum()

399

In [104]:
total['0'].value_counts().head(6)

(XLSX)Click here for additional data file.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

In [107]:
one = pd.read_csv('./text__no_retraction__0_5545.csv')
one.tail()

Unnamed: 0.1,Unnamed: 0,0
5541,5541,"Winter air pollution in Ulaanbaatar, Mongolia ..."
5542,5542,Chronic infection with hepatitis C virus (HCV)...
5543,5543,California has one of the most highly engineer...
5544,5544,Thiol-dependent cathepsins are found in all li...
5545,5545,To estimate hepatitis C virus (HCV) viremic ra...


In [108]:
two = pd.read_csv('./text__no_retraction__5545_6166.csv')
two.head()

Unnamed: 0.1,Unnamed: 0,0
0,0,To estimate hepatitis C virus (HCV) viremic ra...
1,1,Pollinators are crucial in almost all terrestr...
2,2,The association of melanosis coli with the dev...
3,3,"In most plant species, repetitive DNA constitu..."
4,4,"In the last decades, there has been a great in..."


In [109]:
two.tail()

Unnamed: 0.1,Unnamed: 0,0
617,617,Bread wheat (Triticum aestivum L.) is one of t...
618,618,"Metabolic syndrome (MetS), defined as a comple..."
619,619,Competitive learning techniques are being succ...
620,620,"In South East Africa, about 100,000 years ago ..."
621,621,In the text of González-Fernández [1] can be f...


In [110]:
three = pd.read_csv('./text__no_retraction__6166_end.csv')
three.head()

Unnamed: 0.1,Unnamed: 0,0
0,0,In the text of González-Fernández [1] can be f...
1,1,Infantile spasms (IS) are the defining seizure...
2,2,The arylamine N-acetyltransferases are a famil...
3,3,Several biomarkers have been proposed for ultr...
4,4,Osteoporosis is a skeletal disease characteriz...


In [105]:
one = pd.read_csv('./doi__no_retraction__0_5545.csv')
one.tail()

Unnamed: 0.1,Unnamed: 0,1
5540,5540,10.1371/journal.pone.0186821
5541,5541,10.1371/journal.pone.0186834
5542,5542,10.1371/journal.pone.0186898
5543,5543,10.1371/journal.pone.0187181
5544,5544,10.1371/journal.pone.0186869


In [106]:
two = pd.read_csv('./doi__no_retraction__5545_6166.csv')
two.head()

Unnamed: 0.1,Unnamed: 0,1
0,5545,10.1371/journal.pone.0187177
1,5546,10.1371/journal.pone.0187079
2,5547,10.1371/journal.pone.0186668
3,5548,10.1371/journal.pone.0187131
4,5549,10.1371/journal.pone.0186957
