# Data Collection

## Importing Libraries

In [2]:
from Bio import Entrez
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup

Biopython (labeled as Bio) is a free Python library developed for work in bioinformatics. Within the biopython library is Entrez Programming Utilities (labeled as Entrez): a set of scripts that serve as a way to query any databases that fall under the National Center for Biotechnology Information. These databases include both PubMed and PMC (the two databases used for this project). For more information about these libraries and functions, please see the README file in this repository.

## Data Collection Workflow

Part of the difficulties with pulling the data necessary for this project is that published academic research is typically only accessible if a fee is paid. To avoid this problem, the corpi of articles for this project needed to be able to be accessed completely from the PMC database, a completely free, full-text database archive housed as a part of the PubMed database. Additionally, certain search terms were necessary to differentiate between retracted articles and non-retracted articles. However, these necessary search terms were not usable when querying the PMC database.

Because of these complications, a several step process was used to be able to properly pull the necessary data. For the retraction data, a CSV file made available by PMC was used to determine all of the journals that can be found within the PMC archive. These journal names were included as part of the search terms for querying the PubMed database as well as the terms necessary for limiting the search to retracted articles. Information about the articles that were from the provided journal names and had also been retracted was placed into a dataframe, including the digital object identifier (DOI) of the article. This DOI is a unique string of characters that can be used to identify any published journal article. These DOI values were then used as the search terms for querying the PMC database to receive the complete corpus of each article that had been retracted.

For the no retraction data, it was determined after initial research of the project that the journal PLOS ONE was responsible for a significant portion of the retracted journal articles. Additionally, this journal is not limited in scope when publishing journal articles, meaning that any topic in STEM is likely to be accepted by the journal. Because of this, it was determined that all non-retracted journal articles would be pulled from PLOS ONE to remove influences from querying on the metadata of the non-retracted articles. Thus, the PLOS ONE identifier was used as a search term for querying the PubMed database as well as a randomly assigned date. Similarly to the retraction data workflow, information about the articles was placed into a dataframe, including the DOI of each article. The DOI values were then used as the search terms for querying the PMC database to receive the complete corpus of each PLOS ONE article that had not been retracted.

All of this information was saved into various CSV files throughout the querying process to be cleaned in a later notebook. For more information about the difference between the PubMed and PMC database, please see the README file in this repository.

## Accessing Journal Names

As described in the "Data Collection Workflow" section above, the journals that were part of the PMC database needed to be determined. The CSV file below contained this information.

In [None]:
df = pd.read_csv('./journals_in_pmc.csv')

In [44]:
df['Participation level'].value_counts()

 Full              3104
 NIH Portfolio      448
Name: Participation level, dtype: int64

Not all of the journals within the PMC database participated fully in releasing the text of each article. Those journals that did not participate fully needed to be removed so that querying time would not be wasted on articles that I would not be able to later access.

In [45]:
df = pd.concat((df, pd.get_dummies(df['Participation level'], prefix='participation')), axis=1)

In [46]:
df['participation_full'] = df['participation_ Full ']

In [47]:
df = df.drop(columns=['participation_ NIH Portfolio ', 'participation_ Full '])

In [48]:
df.loc[df['participation_full'] == 0].index

Int64Index([  17,   23,   24,   25,   26,   27,   28,   29,   30,   35,
            ...
            3396, 3408, 3415, 3427, 3456, 3464, 3476, 3478, 3479, 3544],
           dtype='int64', length=448)

In [49]:
df = df.drop(df.loc[df['participation_full'] == 0].index, axis=0)

In [None]:
df.to_csv('./journals_in_pmc_clean.csv')

In [157]:
df = pd.read_csv('./journals_in_pmc_clean.csv', index_col=False)
df = df.drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,Journal title,NLM TA,pISSN,eISSN,Publisher,LOCATORplus ID,Latest issue,Earliest volume,Free access,Open access,Participation level,Deposit status,Journal URL,participation_full
0,3 Biotech,3 Biotech,2190-572X,2190-5738,Springer,101565857,v.10(9);Sep 2020,v.1;2011,12 months,Some,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/1811/,1
1,3D Printing in Medicine,3D Print Med,,2365-6271,BioMed Central,101721758,v.5;Dec 2019,v.2;2016,Immediate,All,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3516/,1
2,AACE Clinical Case Reports,AACE Clin Case Rep,,2376-0605,American Association of Clinical Endocrinologists,101670593,v.6(5);Sep-Oct 2020,v.5;2019,Immediate,No,Full,,http://www.ncbi.nlm.nih.gov/pmc/journals/3582/,1
3,The AAPS Journal,AAPS J,,1550-7416,American Association of Pharmaceutical Scientists,101223209,v.18(3);May 2016,v.6;2004,,Some,Full,No New Content,http://www.ncbi.nlm.nih.gov/pmc/journals/792/,1
4,AAPS PharmSci,AAPS PharmSci,,1522-1059,American Association of Pharmaceutical Scientists,100897065,v.6(2);Jun 2004,v.1;1999,Immediate,No,Full,Predecessor,http://www.ncbi.nlm.nih.gov/pmc/journals/989/,1


Due to formatting issues within the CSV file, I had to thoroughly clean the CSV file to remove the journals that did not have full participation with the PMC archive. Once the appropriate journals had been removed, I saved dataframe as a new CSV file. The clean dataframe can be seen above.

## Pulling Retractioned Articles Only

Entrez has several methods that can be called for querying of the PubMed and PMC databases. This [article](https://medium.com/@kliang933/scraping-big-data-from-public-research-repositories-e-g-pubmed-arxiv-2-488666f6f29b) was used as a resource for understanding how to properly utilize these functions for the querying calls that were necessary for data collection. As described in the "Data Collection Workflow" section above, information obtained from PubMed queries must be pulled to continue the workflow process.

### Pulling Data from PubMed

In [None]:
#lists that will become columns in the dataframe
ls_num = []
ls_id = []
ls_doi = []
ls_language = []
ls_year = []
ls_month = []
ls_day = []
ls_volume = []
ls_issue = []
ls_journal = []
ls_title = []
ls_page = []

#setting up counter to keep track of how many articles had been pulled from each journal
x=0

#setting up counter to keep track of the number of articles that did not have a DOI
no_doi = 0

#to query retracted articles for each journal
for i in range(0,len(df['NLM TA'])):
    
    #providing information for the query to be accepted
    Entrez.email = 'my_email_address'
    
    #providing search terms for querying the PubMed database
    handle = Entrez.esearch(db='pubmed',term='(hasretractionin) AND ('+df['NLM TA'][i]+'[Journal])', retmode='xml', retmax=1000)
    
    #formatting the results of the query: article IDs within the database
    results = Entrez.read(handle)
    ids = ' , '.join(results['IdList']) #joining the list of IDs together to be read in one large string
    print(df['NLM TA'][i],[i]) #printing the name of the journal and the index number of the journal
    
    
    #conditional in case there were no articles that met the search term criteria
    if len(ids)==0:
        pass
    
    else:
        
        #providing information for the query to be accepted
        Entrez.email = 'my_email_address'
        
        #providing IDs for querying
        handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
        
        #formatting the results of the query: information about the article
        results_id = Entrez.read(handle)
        
        #counter to keep track of how many articles had been pulled from each journal
        x += len(results['IdList'])
        print(x)
        
        
        #to sort through the infomration about each article pulled from each journal
        for j in range(0, len(results['IdList'])):
            
            #adding the ID of each article to a list for the dataframe
            ls_id.append(results['IdList'][j])
            
            #each piece of information has a try/except statement in case the specific piece of information
            #is not given when the article is pulled
            #if the try statement runs, then the piece of information will be added to a list
            #if the try statement does not run, the a None value will be added to the list for that piece of information 
            
            #extracting DOI of each article
            try:
                doi = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ELocationID'][0])
                ls_doi.append(doi)
            except:
                ls_doi.append(None)
                no_doi += 1
                print(no_doi) #printing how many times the article pulled does not have a DOI
                
            #extracting the language used in each article
            try:
                language = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Language'][0])
                ls_language.append(language)
            except:
                ls_language.append(None)
                
            #extracting the year each article was published  
            try:
                year = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Year'])
                ls_year.append(year)
            except:
                ls_year.append(None)
            
            #extracting the month each article was published
            try:
                month = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Month'])
                ls_month.append(month)
            except:
                ls_month.append(None)
            
            #extracting the day each article was published
            try:
                day = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Day'])
                ls_day.append(day)
            except:
                ls_day.append(None)
            
            #extracting the volume number of the journal when each article was published
            try:
                volume = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Volume'])
                ls_volume.append(volume)
            except:
                ls_volume.append(None)
            
            #extracting the issue number of the journal when each article was published
            try:
                issue = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Issue'])
                ls_issue.append(issue)
            except:
                ls_issue.append(None)
            
            #extracting the journal name where each article was published
            try:
                journal = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['Title'])
                ls_journal.append(journal)
            except:
                ls_journal.append(None)
            
            #extracting the title of each article
            try:
                title = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleTitle'])
                ls_title.append(title)
            except:
                ls_title.append(None)
            
            #extracting the pagination for each article (or page numbers)
            try:
                page = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Pagination']['MedlinePgn'])
                ls_page.append(page)
            except:
                ls_page.append(None)
    
    #added a sleep function to prevent 404 errors when pulling the data
    time.sleep(5)

To summarize the above script, each journal name is added to the other search terms necessary to pull retracted articles from the specified journal. The first query returns the ID number of each article (a specific identifier for PubMed) found. This ID number is then used to pull a large body of text that contains information about the article. Important pieces of information are extracted from this text and added to lists. If the information is not present, the script moves onto the next piece of information. The script prints the journal name, the index number of the journal in the "journals_in_pmc_clean.csv" file, the number of articles that each journal has that have been retracted, and the number of articles that have no DOI associated with them. These print statements allow me to watch the progress of the query because of its intricacy. 

In [None]:
data = pd.concat([pd.Series(ls_id), pd.Series(ls_doi), pd.Series(ls_language), pd.Series(ls_year), pd.Series(ls_month), pd.Series(ls_day), pd.Series(ls_volume), pd.Series(ls_issue), pd.Series(ls_journal), pd.Series(ls_title), pd.Series(ls_page)], axis=1)
data

In [161]:
data.to_csv('./pubmed_data_retraction.csv')

The information lists created were concatenated together to form a dataframe. Thus, the dataframe contained information about the PubMed ID number, DOI, language, year, month, day, volume, issue, journal name, title, and page numbers for each article that had been retracted from the journals provided in the "journals_in_pmc_clean.csv" file. The dataframe was then saved to a CSV file as a proofing measure.

In [None]:
data = pd.read_csv('./pubmed_data_retraction.csv')

In [None]:
data = data.dropna(axis=0, subset=['1'])

In [164]:
data['2'].value_counts()

eng    2139
chi       2
spa       1
fre       1
Name: 2, dtype: int64

In [165]:
print(data.loc[data['2']=='chi'].index)
print(data.loc[data['2']=='spa'].index)
print(data.loc[data['2']=='fre'].index)

Int64Index([2968, 2969], dtype='int64')
Int64Index([259], dtype='int64')
Int64Index([2243], dtype='int64')


In [166]:
data = data.drop([2968, 2969, 259, 2243], axis=0)
data['2'].value_counts()

eng    2139
Name: 2, dtype: int64

The dataframe that contained the information about each retracted article needed to be cleaned. Articles that had no DOI could not be queried for in the PMC database to receive the full text of the article. Because of this, articles without a DOI were dropped from the dataframe. I would not be able to read articles that were not written in English (as it is the only language that I can fluently read). Because of this, any article that wasn't written in English were dropped from the dataframe as well.

In [None]:
ls_index = []
for i in data['1']:
    if len(i) < 7:
        for j in range(0,len(data.loc[data['1']==i].index)):
            ls_index.append(data.loc[data['1']==i].index[j])
data = data.drop(ls_index)

Some of the articles had DOI values that were extremely small. These DOI values showed in intial project research to be problematic in pulling from the PMC database. Because of this, any DOI value that was less than 7 characters was dropped from the dataframe.

In [None]:
data = data.drop_duplicates()

In [None]:
data.to_csv('./datasets/pubmed_data_retraction_cleaned.csv')

In [None]:
data = pd.read_csv('./datasets/pubmed_data_retraction_cleaned.csv')

Duplicates in the dataframe were dropped. The now clean dataframe was saved to a CSV file as a proofing measure for the future.

In [103]:
data['1'].sample(n=25,replace = False)

212             10.1186/1471-2121-13-8
618           10.3389/fpsyg.2016.01298
1877      10.1371/journal.pone.0001444
895            10.1074/jbc.M111.260414
1439             10.1093/neuonc/nor116
1995        10.1038/s41598-019-38519-5
602           10.3389/fphar.2017.00871
511                10.7554/eLife.12248
2108           10.4103/2229-5070.72109
1953         10.1186/s12978-019-0732-7
1175    10.1523/JNEUROSCI.2613-09.2009
1428                10.1038/ncomms6446
98             10.4103/0256-4947.83211
1432                10.1038/ncomms1623
44           10.1107/S160053680706254X
599            10.3389/fonc.2013.00153
1488               10.2147/OTT.S124118
1            10.1208/s12249-016-0596-x
899            10.1074/jbc.M111.247726
2002        10.1038/s41598-017-10365-3
111            10.1074/jbc.M111.329078
1598      10.1371/journal.pone.0218664
1392              10.1128/MCB.01480-09
1340     10.1158/1535-7163.MCT-14-0672
1384              10.1128/MCB.00114-14
Name: 1, dtype: object

Randomly selected samples were pulled from the DOI column of the dataframe. These DOIs were used to check manually that the articles from the query were indeed articles that had been retracted. All the DOIs shown above were proven to be for retracted articles, so it was assumed that the query was successful and the workflow of data collection could proceed to the next step.

### Pulling Data from PMC

In [None]:
#lists that will become columns in the dataframe
ls_total_text = []
ls_total_keywords = []
ls_total_abstract = []
ls_publisher = []

#setting up a counter to keep track of the index number of the DOI being pulled
count = 2075

#setting up a counter to keep track of the number of articles that did not have any text
no_text = 0

#to query a specific number of DOIs of retracted articles
for i in data['1'][2075:2106]:
    
    #providing information for the query to be accepted
    Entrez.email = 'my_email_address'
    
    #providing search terms for querying the PMC database
    handle = Entrez.esearch(db='pmc',term=i, retmode='xml')
    
    #formatting the results of the query: article IDs within the database
    results = Entrez.read(handle)
    ids = ' , '.join(results['IdList'])
    
    #providing information for the query to be accepted
    Entrez.email = 'my_email_address'
    
    #providing IDs for querying
    handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
    
    #formatting the results of the query: information about the article
    text = handle.read()

    #using BeautifulSoup to pull specific information from the article
    soup = BeautifulSoup(text, "lxml")
    
    #count to keep track of which DOI is currently being processed
    print(f"{i} [{count}]")
    
    #each piece of information has a try/except statement in case the specific piece of information
    #is not given when the article is pulled
    #if the try statement runs, then the piece of information will pulled from the html and added to a list
    #if the try statement does not run, the a None value will be added to the list for that piece of information
    
    #extracting full text for each article
    try:
        #list to hold each section of each article
        ls_raw_text = []
        
        #string value that will hold the complete text
        text = str()
        
        #finding each section of each article
        for i in range(0,len(soup.find_all("sec"))):
            for j in range(0,len(soup.find_all("sec")[i].find_all("p"))): #indexing through each paragraph in each section
                ls_raw_text.append(str(soup.find_all("sec")[i].find_all("p")[j].text))
        
        #combining the different sections of the article into one large body of text
        for i in range(0,len(ls_raw_text)):
            text += ls_raw_text[i]
        ls_total_text.append(text)
        
        #count to keep track of the number of articles that had no text
        if text=='':
            no_text += 1
            print(f'No text --> {no_text}')
    except:
        ls_total_text.append(None)
    
    #extracting keywords list for each article
    try:
        ls_keywords = []
        for i in range(0,len(soup.find_all("kwd"))):
            ls_keywords.append(soup.find_all("kwd")[i].text)
        ls_total_keywords.append(ls_keywords)
    except:
        ls_total_keywords.append(None)
    
    #extracting the abstract for each article
    try:
        ls_abstract = []
        for i in soup.find_all("abstract")[0].text.split('\n'): #splitting on \n due to formatting from html
            if i == '':
                pass
            else:
                ls_abstract.append(i)
        ls_total_abstract.append(ls_abstract[0])
    except:
        ls_total_abstract.append(None)
    
    #extracting the publisher name for each article
    try:      
        ls_publisher.append(soup.find_all("publisher-name")[0].text)
    except:
        ls_publisher.append(None)
    
    count +=1
    
    #added a sleep function to prevent 404 errors when pulling the data
    time.sleep(3)

To summarize the above script, each DOI for each retracted article is used as a search term for querying the PMC database. The resulting ID value of the article (as the ID value is different between the PMC and PubMed databases for each article) is pulled and then used to query the PMC database for information about the article. This method pulls the full text, keywords list, abstract, and publisher for each retracted article. As the script runs, the current DOI, index number of the DOI, and a count of how many articles have no text are printed to allow me to keep track of when the function ends.

In [155]:
pd.Series(data['1'][2075:2106]).to_csv('./doi__2075_2106.csv')
pd.Series(ls_total_text).to_csv('./text__2075_2106.csv')
pd.Series(ls_total_keywords).to_csv('./keywords__2075_2106.csv')
pd.Series(ls_total_abstract).to_csv('./abstract__2075_2106.csv')
pd.Series(ls_publisher).to_csv('./publisher__2075_2106.csv')

Ideally, the script would run through the entire DOI column. However, when using this function, I often had 404 errors or the script would stop abruptly with no error message. For this reason, I decided to break the DOI column into small sections to closely monitor the information and save the information I received frequently. Because of this, several CSV files were created, each labeled with what DOI index values were used when pulling the information. This information was then cleaned into one complete CSV file in a notebook that will not be provided. The complete CSV file is used for the "Data Cleaning" notebook instead of the large number of individual CSV files. 

## Pulling Non-Retracted Articles Only

As described in the "Data Collection Workflow" section above, information obtained from PubMed queries must be pulled to continue the workflow process. In this instance, the querying will be specifically for articles from the PLOS ONE journal that were not retracted.

### Pulling Data from PubMed

In [59]:
day = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16',
                 '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
len(day)

31

In [60]:
start_month = ['01/01', '02/01', '03/01', '04/01', '05/01', '06/01', 
               '07/01', '08/01', '09/01', '10/01', '11/01', '12/01']
end_month = ['01/31', '02/28', '03/31', '04/30', '05/31', '06/30', 
            '07/31', '08/31', '09/30', '10/31', '11/30', '12/31']

month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
year = ['2015', '2016', '2017', '2018', '2019']
start_date = []
end_date = []

#determining a random date for each month between 2015-2019
for i in year:
    for j in range(0,len(month)):
        
        #for each month that has 31 days
        if j == '01' or j == '03' or j == '05' or j == '07' or j == '08' or j == '10' or j == '12':
            day_num = np.random.choice(day, 1)
            
        #for february
        elif j == '02':
            day_num = np.random.choice(day[:28])
            
        #for each month that has 30 days
        else:
            day_num = np.random.choice(day[:30])
        start_date.append(i+'/'+start_month[j])
        end_date.append(i+'/'+month[j]+'/'+day_num)
        
print(len(start_date))

60


When looking at the last five years of PLOS ONE volumes, there were approximately 150000 articles that were published. To try not bias the date in which the article was published, a random date was determined for each month between 2015-2019. The above script determined this list of dates.

In [None]:
#as this script is nearly identical to an earlier script, please check previous script
#for additional control flow comments

ls_id = []
ls_doi = []
ls_language = []
ls_year = []
ls_month = []
ls_day = []
ls_volume = []
ls_issue = []
ls_journal = []
ls_title = []
ls_page = []
x=0
no_doi = 0

for i in range(0,len(start_date)):
    Entrez.email = 'my_email_address'
    handle = Entrez.esearch(db='pubmed',term='(PLoS One [Journal])', retmode='xml', retmax=167, mindate = start_date[i], maxdate = end_date[i])
    results = Entrez.read(handle)
    
    ids = ' , '.join(results['IdList'])

    
    if len(ids)==0:
        pass
    else:
        Entrez.email = 'my_email_address'
        handle = Entrez.efetch(db='pubmed', id = ids, retmode='xml', rettype='full')
        results_id = Entrez.read(handle)
        
        x += len(results['IdList'])
        print(x)
        
        for j in range(0, len(results['IdList'])):
            ls_id.append(results['IdList'][j])
            try:
                doi = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ELocationID'][0])
                ls_doi.append(doi)
            except:
                ls_doi.append(None)
                no_doi += 1
                print(no_doi)
                
            try:
                language = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Language'][0])
                ls_language.append(language)
            except:
                ls_language.append(None)
                
            try:
                year = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Year'])
                ls_year.append(year)
                print(year)
            except:
                ls_year.append(None)
            
            try:
                month = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Month'])
                ls_month.append(month)
                print(month)
            except:
                ls_month.append(None)
                
            try:
                day = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleDate'][0]['Day'])
                ls_day.append(day)
                print(day)
            except:
                ls_day.append(None)
                
            try:
                volume = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Volume'])
                ls_volume.append(volume)
            except:
                ls_volume.append(None)
            
            try:
                issue = int(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['JournalIssue']['Issue'])
                ls_issue.append(issue)
            except:
                ls_issue.append(None)
                
            try:
                journal = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Journal']['Title'])
                ls_journal.append(journal)
            except:
                ls_journal.append(None)
                
            try:
                title = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['ArticleTitle'])
                ls_title.append(title)
            except:
                ls_title.append(None)
                
            try:
                page = str(results_id['PubmedArticle'][j]['MedlineCitation']['Article']['Pagination']['MedlinePgn'])
                ls_page.append(page)
            except:
                ls_page.append(None)
        
    time.sleep(5)

To summarize the above script, each journal name is added to the other search terms necessary to pull non-retracted articles from PLOS ONE between 2015-2019. The first query returns the ID number of each article found. This ID number is then used to pull a large body of text that contains information about the article. Important pieces of information are extracted from this text and added to lists. If the information is not present, the script moves onto the next piece of information. The script prints the number of articles that have been determined. This print statement allows me to watch the progress of the query because of its intricacy. 

In [None]:
data_no_retract = pd.concat([pd.Series(ls_id), pd.Series(ls_doi), pd.Series(ls_language), pd.Series(ls_year), pd.Series(ls_month), pd.Series(ls_day), pd.Series(ls_volume), pd.Series(ls_issue), pd.Series(ls_journal), pd.Series(ls_title), pd.Series(ls_page)], axis=1)

In [65]:
data_no_retract[3].value_counts()

2019    1983
2016    1893
2018    1890
2015    1880
2017    1843
2014     124
Name: 3, dtype: int64

In [66]:
data_no_retract[4].value_counts()

12    959
5     930
10    918
7     835
9     835
3     823
2     804
1     768
11    743
8     724
6     668
4     606
Name: 4, dtype: int64

In [67]:
data_no_retract[5].value_counts()

14    806
12    642
13    490
23    477
31    474
15    463
28    424
22    421
11    420
6     385
8     334
7     324
18    317
24    310
5     303
10    301
19    269
1     253
9     243
16    238
3     236
25    219
27    212
17    199
4     172
2     169
26    167
20    156
21     98
29     91
Name: 5, dtype: int64

The information lists created were concatenated together to form a dataframe. Thus, the dataframe contained information about the PubMed ID number, DOI, language, year, month, day, volume, issue, journal name, title, and page numbers for each article that had not been retracted from PLOS ONE. The dataframe was then saved to a CSV file as a proofing measure. The distribution of the year, month, and day values was checked to ensure that the date had been randomized properly when querying.

In [68]:
data_no_retract.to_csv('./pubmed_data_second_no_retraction.csv')

In [None]:
data_no_retract = pd.read_csv('./pubmed_data_second_no_retraction.csv')
data_no_retract = data_no_retract.dropna(axis=0, subset=['1'])

In [70]:
data_no_retract['2'].value_counts()

eng    9613
Name: 2, dtype: int64

In [None]:
ls_index = []
for i in data_no_retract['1']:
    if len(i) < 7:
        for j in range(0,len(data_no_retract.loc[data_no_retract['1']==i].index)):
            ls_index.append(data_no_retract.loc[data_no_retract['1']==i].index[j])
data_no_retract = data_no_retract.drop(ls_index)

In [None]:
data_no_retract = data_no_retract.drop_duplicates()

The dataframe that contained the information about each non-retracted article needed to be cleaned. Articles that had no DOI could not be queried for in the PMC database to receive the full text of the article. Because of this, articles without a DOI were dropped from the dataframe. All of the articles pulled were written in English, thus no articles needed to be dropped. Some of the articles had DOI values that were extremely small. These DOI values showed in intial project research to be problematic in pulling from the PMC database. Because of this, any DOI value that was less than 7 characters was dropped from the dataframe.

### Pulling Data from PMC

In [None]:
#as this script is nearly identical to an earlier script, please check previous script
#for additional control flow comments

ls_total_text = []
ls_total_keywords = []
ls_total_abstract = []
ls_publisher = []
count = 6166
no_text = 0

for i in data_no_retract['1'][6166:]:
    Entrez.email = 'my_email_address'
    handle = Entrez.esearch(db='pmc',term=i, retmode='xml')
    results = Entrez.read(handle)

    ids = ' , '.join(results['IdList'])
    Entrez.email = 'my_email_address'
    handle = Entrez.efetch(db='pmc', id = ids, retmode='xml', rettype='full')
    text = handle.read()

    soup = BeautifulSoup(text, "lxml")
    print(f"{i} [{count}]")
    
    try:
        ls_raw_text = []
        text = str()
        for i in range(0,len(soup.find_all("sec"))):
            for j in range(0,len(soup.find_all("sec")[i].find_all("p"))):
                ls_raw_text.append(str(soup.find_all("sec")[i].find_all("p")[j].text))

        for i in range(0,len(ls_raw_text)):
            text += ls_raw_text[i]
        ls_total_text.append(text)
        
        if text=='':
            no_text += 1
            print(f'No text --> {no_text}')
    except:
        ls_total_text.append(None)
    
    try:
        ls_keywords = []
        for i in range(0,len(soup.find_all("kwd"))):
            ls_keywords.append(soup.find_all("kwd")[i].text)
        ls_total_keywords.append(ls_keywords)
    except:
        ls_total_keywords.append(None)
    
    try:
        ls_abstract = []
        for i in soup.find_all("abstract")[0].text.split('\n'):
            if i == '':
                pass
            else:
                ls_abstract.append(i)
        ls_total_abstract.append(ls_abstract[0])
    except:
        ls_total_abstract.append(None)
    
    try:      
        ls_publisher.append(soup.find_all("publisher-name")[0].text)
    except:
        ls_publisher.append(None)
    
    count +=1
    time.sleep(3)

In [79]:
pd.Series(data_no_retract['1'][6166:]).to_csv('./doi__no_retraction__6166_end.csv')
pd.Series(ls_total_text).to_csv('./text__no_retraction__6166_end.csv')
pd.Series(ls_total_keywords).to_csv('./keywords__no_retraction__6166_end.csv')
pd.Series(ls_total_abstract).to_csv('./abstract__no_retraction__6166_end.csv')
pd.Series(ls_publisher).to_csv('./publisher__no_retraction__6166_end.csv')

To summarize the above script, each DOI for each non-retracted article is used as a search term for querying the PMC database. The resulting ID value of the article (as the ID value is different between the PMC and PubMed databases for each article) is pulled and then used to query the PMC database for information about the article. This method pulls the full text, keywords list, abstract, and publisher for each non-retracted article from PLOS ONE. As the script runs, the current DOI, index number of the DOI, and a count of how many articles have no text are printed to allow me to keep track of when the function ends.

Similarly to when pulling the retracted articles, the script would ideally have run through the entire DOI column. However, when using this function, I often had 404 errors or the script would stop abruptly with no error message. For this reason, I decided to break the DOI column into three different sections to closely monitor the information and save the information I received more frequently. Because of this, three different CSV files were created, each labeled with what DOI index values were used when pulling the information. 