# Feature Engineering

## Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import textstat
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

## Read CSV Files

In [None]:
total = pd.read_csv('./total_data_plos_only_cleaned.csv')
total.head()

In [3]:
total = total.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'])

In [5]:
no_retract = pd.read_csv('./no_retraction_data_plos_only_cleaned.csv')
no_retract = no_retract.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'])

In [6]:
retract = pd.read_csv('./retraction_data_plos_only_cleaned.csv')
retract = retract.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'])

The last CSV files created in the "Data Cleaning" notebook were read and unnecessary columns were dropped.

## Feature Engineering

### Keywords

#### Retraction

The keywords lists in the retraction dataframe were unpacked and lemmatized to determine the most common keywords in retracted articles. This information will be explored in the "EDA" notebook.

In [10]:
#This script was influenced by https://stackoverflow.com/questions/40950791/remove-quotes-from-string-in-python
keywords_list = []
count = 0
for i in retract['keywords']:
    if i == []: #if no keywords, move to next article
        pass
    else:
        for j in i.split(): #splitting each keyword string
            
            #removing all symbols from each split, lowercasing all remaining words, and adding to a list
            keywords_list.append(j.replace("'",'').replace('[','').replace(',','').replace(']','').replace('(','').replace(')','').replace('\\n', '').replace('\\n','').lower())
            
#output is a list that contains each occurrence of each keyword discussed in retracted articles

In [11]:
lemmatizer = WordNetLemmatizer()
ls_keywords = []
for i in keywords_list:
    ls_keywords.append(lemmatizer.lemmatize(i)) #lemmatizing list of keywords produced above to increase topic diversity
pd.Series(ls_keywords).value_counts().head(11)

            663
cell        186
antibody    116
cancer      112
response     89
disease      84
health       79
factor       71
theory       67
gene         66
heat         61
dtype: int64

#### Keywords Binary

Keyword binary columns were created in the total dataframe and the retraction dataframe. If there is no keyword list in the article, then the value is set to 0. If there is a keyword list in the article, then the value is set to 1. This information will be explored in the "EDA" notebook.

In [13]:
keywords_binary = []
for i in total['keywords']:
    if len(i) != 2:
        keywords_binary.append(1)
    else:
        keywords_binary.append(0)
print(len(keywords_binary))

10619


In [14]:
total['keywords_binary'] = keywords_binary

In [15]:
keywords_binary = []
for i in retract['keywords']:
    if len(i) != 2:
        keywords_binary.append(1)
    else:
        keywords_binary.append(0)
print(len(keywords_binary))

retract['keywords_binary'] = keywords_binary

1537


### Word Count

A word count column was created using the in "clean_text" column the total dataframe to represent the number of words in each article. This information will be explored in the "EDA" notebook.

In [16]:
list_words = []
for i in range(0, len(total['clean_text'])):
    list_words.append(len(total['clean_text'][i].split()))
total['num_words'] = list_words 

### Character Length

A character length column was created using the in "clean_text" column the total dataframe to represent the number of characters in each article. This information will be explored in the "EDA" notebook.

In [17]:
list_words = []
for i in range(0, len(total['clean_text'])):
    list_words.append(len(total['clean_text'][i]))
total['character_length'] = list_words 

### Animal Studies

I thought that certain topics (such as animal-based research) would have more regulatory bodies, thus would be less likely to be retracted because more people are ensuring the methods and results of the project. Additionally, working with animals or animal-based products can prove to be much more variated than mathematical models. Because of this, I wanted to be able to explore words related to animals or animal studies.

In [19]:
animal_terms = ['IACUC', 'mouse', 'mice', 'rats', 'rat', 'hamster', 'hamsters', 'pigs', 'rabbits', 'rabbit', 
                'cat', 'cats', 'dog', 'dogs', 'ungulate', 'ungulates', 'pig', 'horse', 'donkey', 'goat',
               'bovine', 'porcine', 'murine', 'chicken', 'sheep', 'cow', 'cows', 'horses', 'goats']

To search for words related to animals, the target words must be decided upon prior to searching through the article. [Source 1](http://vetmed.tamu.edu/media/2005639/vadnais%20protein%20therapeutics%202017.pdf) and [source 2](https://www.ncbi.nlm.nih.gov/books/NBK218261/) were used as resources for determining appropriate animal study words. These words were put into the list above.

In [20]:
def animal_binary(dataframe):
    list_articles = []
    iacuc = []
    
    #passing through each article
    for i in range(0, len(dataframe['clean_text'])):
        count = 0
        
        #passing through each word in each article
        for j in dataframe['clean_text'][i].split():
            
            #comparing the word in the article to the list of animal terms
            for k in animal_terms:
                if j == k:
                    
                    #prevents the article from being repeatedly added to the lists
                    if i not in list_articles:
                        list_articles.append(i)
                        iacuc.append(1)
                        count = 1
                else:
                    pass
        
        #used for if the article does not contain any animal terms
        if count == 0:
            iacuc.append(0)
        else:
            pass
    
    #shows the number of articles that contained animal terms
    print(len(list_articles))
    
    #shows if there will be a reshape error when adding to the dataframe
    print(len(iacuc))

    dataframe['animal_binary'] = iacuc
    return

In [21]:
animal_binary(total)
animal_binary(retract)
animal_binary(no_retract)

4289
10619
791
1537
3498
9082


I created a function in the script above to identify if an article contained one of these animal terms. A column was created that contained binary values, where 0 indicated the article did not contain any of the animal terms while a 1 indicated the article contained at least one instance of one of the animal terms. This information will be explored in the "EDA" notebook.

In [22]:
def list_of_animal_words(dataframe):
    list_articles = []
    list_words = []
    
    #passing through each article
    for i in range(0, len(dataframe['clean_text'])):
        count = 0
        iacuc = []
        
        #passing through each word in each article
        for j in dataframe['clean_text'][i].split():
            
            #comparing the word in the article to the list of animal terms
            for k in animal_terms:
                if j == k:
                    
                    #prevents the article from being repeatedly added to the lists
                    if i not in list_articles:
                        list_articles.append(i)
                    
                    #prevents an animal term from being repeatedly added to the list for each article
                    if j not in iacuc:
                        iacuc.append(k)
                    count = 1
                else:
                    pass
        
        #used for if the article does not contain any animal terms
        if count == 0:
            list_words.append([])
        else:
            list_words.append(iacuc)
    
    #shows the number of articles that contained animal terms
    print(len(list_articles))
    
    #shows if there will be a reshape error when adding to the dataframe
    print(len(list_words))
    dataframe['animal_words'] = list_words
    return

In [23]:
list_of_animal_words(total)
list_of_animal_words(retract)
list_of_animal_words(no_retract)

4289
10619
791
1537
3498
9082


I also wanted to look at the distribution of animal terms in each article, as immunostaining will require several different animal-based products. Thus, I created a column of lists, where each list shows which animal term appeared in the article. This information will be explored in the "EDA" notebook.

In [None]:
iacuc = ['IACUC']
mouse = ['mouse', 'mice']
rat = ['rat', 'rats']
murine = ['murine']
hamster = ['hamster', 'hamsters']
rabbit = ['rabbit', 'rabbits']
cat = ['cat', 'cats']
pig = ['pig', 'pigs', 'porcine']
dog = ['dog', 'dogs']
ungulate = ['ungulate', 'ungulates']
horse = ['horse', 'horses']
donkey = ['donkey']
goat = ['goat', 'goats']
cow = ['cow', 'cows', 'bovine']
chicken = ['chicken']
sheep = ['sheep']

In [24]:
#create a new animal dummy column
def animal_dummy(word_list, column_name, dataframe):
    column_list = []
    
    for i in dataframe['animal_words']: #each animal word list for each article
        count = 0
        for j in i: #each animal word
            for k in word_list: #each category of animal
                if j == k:
                    count = 1 #set count = 1 if one of the words in the animal category appears in the text
                    
        #the number of times one of the words that indicates a certain category of animal
        column_list.append(count) 
    
    dataframe[column_name] = column_list
    return 

I needed to unpack the lists of animal terms, but wanted to combine words that were related into similar categories. I created the above function and then ran it for each category for each dataframe in the script below. In a way, this dummies the "animal_words" column. This information will be explored in the "EDA" notebook.

In [26]:
animal_dummy(iacuc, 'iacuc', retract)
animal_dummy(mouse, 'mouse', retract)
animal_dummy(rat, 'rat', retract)
animal_dummy(murine, 'murine', retract)
animal_dummy(hamster, 'hamster', retract)
animal_dummy(rabbit, 'rabbit', retract)
animal_dummy(cat, 'cat', retract)
animal_dummy(pig, 'pig', retract)
animal_dummy(dog, 'dog', retract)
animal_dummy(ungulate, 'ungulate', retract)
animal_dummy(horse, 'horse', retract)
animal_dummy(donkey, 'donkey', retract)
animal_dummy(goat, 'goat', retract)
animal_dummy(cow, 'cow', retract)
animal_dummy(chicken, 'chicken', retract)
animal_dummy(sheep, 'sheep', retract)

animal_dummy(iacuc, 'iacuc', no_retract)
animal_dummy(mouse, 'mouse', no_retract)
animal_dummy(rat, 'rat', no_retract)
animal_dummy(murine, 'murine', no_retract)
animal_dummy(hamster, 'hamster', no_retract)
animal_dummy(rabbit, 'rabbit', no_retract)
animal_dummy(cat, 'cat', no_retract)
animal_dummy(pig, 'pig', no_retract)
animal_dummy(dog, 'dog', no_retract)
animal_dummy(ungulate, 'ungulate', no_retract)
animal_dummy(horse, 'horse', no_retract)
animal_dummy(donkey, 'donkey', no_retract)
animal_dummy(goat, 'goat', no_retract)
animal_dummy(cow, 'cow', no_retract)
animal_dummy(chicken, 'chicken', no_retract)
animal_dummy(sheep, 'sheep', no_retract)

animal_dummy(iacuc, 'iacuc', total)
animal_dummy(mouse, 'mouse', total)
animal_dummy(rat, 'rat', total)
animal_dummy(murine, 'murine', total)
animal_dummy(hamster, 'hamster', total)
animal_dummy(rabbit, 'rabbit', total)
animal_dummy(cat, 'cat', total)
animal_dummy(pig, 'pig', total)
animal_dummy(dog, 'dog', total)
animal_dummy(ungulate, 'ungulate', total)
animal_dummy(horse, 'horse', total)
animal_dummy(donkey, 'donkey', total)
animal_dummy(goat, 'goat', total)
animal_dummy(cow, 'cow', total)
animal_dummy(chicken, 'chicken', total)
animal_dummy(sheep, 'sheep', total)

The below .value_counts methods show that there was a increase of retracted articles containing an animal term of approximately 13% compared to non-retracted articles. Thus, articles that were retracted were more likely to contain an animal term. It is possible that the variance in working with animal models correlates to an article being retracted.

In [27]:
#retracted articles
total['animal_binary'][:1537].value_counts(normalize=True)

1    0.514639
0    0.485361
Name: animal_binary, dtype: float64

In [28]:
#non-retracted articles
total['animal_binary'][1537:].value_counts(normalize=True)

0    0.614843
1    0.385157
Name: animal_binary, dtype: float64

### Human Studies

Similarly to the animal studies feature engineering, I thought that human-based research would have even more regulatory bodies, thus would be less likely to be retracted because more people are ensuring the methods and results of the project. Additionally, working with people and human data can prove to be even more variated. Because of this, I wanted to be able to explore words related to human studies and data.

In [30]:
list_articles = []
irb = []

#passing through each article
for i in range(0, len(total['clean_text'])):
    count = 0
    word_count = 0
    patient_count = 0
    
    #passing through each word in each article
    for j in total['clean_text'][i].split():
        
        #human study/data terms
        if j == 'IRB' or j == 'case' or j == 'participants':
            
            #comparing the word in the article to the list of human terms
            if j =='IRB' or j == 'participants':
                
                #prevents the article from being repeatedly added to the lists
                if i not in list_articles:
                    list_articles.append(i)
                    irb.append(1)
                    count = 1
            else:
                #checks to see if the word after the occurrence of "case" is "study"
                try:
                    if total['clean_text'][i].split()[word_count+1] == 'study':
                        
                        #prevents the article from being repeatedly added to the lists
                        if i not in list_articles:
                            list_articles.append(i)
                            irb.append(1)
                            count = 1
                    else:
                        pass
                
                #if the next word is not "study", then the word does not count as a true occurrence
                except:
                    pass            
        else:
            pass
        word_count += 1
    if count == 0:
        irb.append(0)
    else:
        pass

#shows the number of articles that contained human terms
print(len(list_articles))

#shows if there will be a reshape error when adding to the dataframe
print(len(irb))
total['irb_binary'] = irb

3214
10619


Words used to determine if an article uses human studies/data were "IRB," "case study," and "participants." The above script was used to create a binary column, where the presence of one of these phrases was indicated by 1 while none of these phrases appearing in the article is indicated by 0.

There was an approximately 15% decrease of human study/data terms in retracted articles compared to non-retracted articles. Thus, retracted articles were more likely to contain human study/data terms. Similarly to the animal studies information, it is possible that the variance in working with people or human data correlates to an article being retracted.

In [31]:
#retracted articles
total['irb_binary'][:1537].value_counts(normalize=True)

0    0.829538
1    0.170462
Name: irb_binary, dtype: float64

In [32]:
#non-retracted articles
total['irb_binary'][1537:].value_counts(normalize=True)

0    0.674961
1    0.325039
Name: irb_binary, dtype: float64

### Regulatory Binary

For research projects that have both animal studies and human studies, regulation is significantly more intense than only having one of those studies. Below is a script that combines the information of the "irb_binary" and "animal_binary" columns. If an article had both terms that relate to animal studies and human studies, then the value recorded is 2. If an article only had terms for one of those study types, then the value recorded is 1. Articles that make no mention of any terms related to animal or human studies are recorded as a 0.

In [33]:
regulatory = []

#passing through each article
for i in range(0, len(total['irb_binary'])):
    
    #combining irb binary and animal binary information
    if total['irb_binary'][i] == 1 or total['animal_binary'][i] == 1:
        if total['irb_binary'][i] == 1 and total['animal_binary'][i] == 1:
            regulatory.append(2)
        else:
            regulatory.append(1)
    else:
        regulatory.append(0)

#shows if there will be a reshape error when adding to the dataframe
len(regulatory)

10619

In [34]:
total['reg_binary'] = regulatory
total = total.rename(columns={'reg_binary':'regulatory'})

#retracted articles
total['regulatory'][:1537].value_counts(normalize=True)

1    0.545869
0    0.384515
2    0.069616
Name: regulatory, dtype: float64

In [35]:
#non-retracted articles
total['regulatory'][1537:].value_counts(normalize=True)

1    0.581150
0    0.354327
2    0.064523
Name: regulatory, dtype: float64

There is very little difference between retracted and non-retracted articles for the regulatory binary value. The increase in regulation of having multiple regulated studies in one project does not seem to correlate to the article being retracted.

### Review Binary

Review articles are articles that look at other publish literature and discuss different aspects of this literature on a broader scale than individual research projects. Because of this, I thought that review articles may be more inclined to have issues with plagiarism. Often times, the introduction of these articles will say verbatim, "in this review" as one of the final lines to transition to the next section. The below script creates a binary column, where an article containing "this review" is indicated by 1 while an article that does not contain this phrase is indicated by 0.

In [36]:
list_articles = []
review = []

#passing through each article
for i in range(0, len(total['clean_text'])):
    count = 0
    word_count = 0
    
    #passing through each word in each article
    for j in total['clean_text'][i].split():
        
        #review term
        if j == 'review':
            
            #checks to see if the word before the occurrence of "review" is "this"
            if total['clean_text'][i].split()[word_count-1] == 'this':
                
                #prevents the article from being repeatedly added to the lists
                if i not in list_articles:
                    list_articles.append(i)
                    review.append(1)
                    count = 1
        
        #if the previous word is not "this", then the word does not count as a true occurrence
        else:
            pass
        word_count += 1
    
    #used for if the article does not contain any review terms
    if count == 0:
        review.append(0)
    else:
        pass

#shows the number of articles that contained review terms
print(len(list_articles))

#shows if there will be a reshape error when adding to the dataframe
print(len(review))
total['review_binary'] = review

234
10619


There is an approximately 2% increase in the phrase "this review" in retracted articles compared to non-retracted articles. Thus, retracted articles are more likely to be review articles than non-retracted articles. While this percentage change is extremely small, it must be noted that the number of total occurrences of a review article is only 234. For a change to occur at all at this scale is important to pay attention to. It is possible that review papers correlate to the paper being retracted.

In [37]:
#retracted articles
total['review_binary'][:1537].value_counts(normalize=True)

0    0.960963
1    0.039037
Name: review_binary, dtype: float64

In [38]:
#non-retracted articles
total['review_binary'][1537:].value_counts(normalize=True)

0    0.980841
1    0.019159
Name: review_binary, dtype: float64

### Novel Ideas

Often times published literature will use the word "novel" to express that the ideas found within that article have never been published before. Research grants are often focused on funding new ideas and new ideas often lead to more citations. However, new ideas are also difficult to bring to fruition, as laying the groundwork of a completely new project requires a vast amount of troubleshooting. Because of this, I wanted to be able to explore words related to novel ideas. The below script creates a binary column, where an article containing the word "novel" more than once is indicated by 1 while an article not containing the word or only having one occurrence of the word is indicated by 0.

In [39]:
list_articles = []
novel_idea = []

#passing through each article
for i in range(0, len(total['clean_text'])):
    count = 0
    novel_count = 0
    
    #passing through each word in each article
    for j in total['clean_text'][i].split():
        
        #novel term
        if j == 'novel':
            
            #passing through each word again to see if it occurs a second time
            for k in total['clean_text'][i].split():
                if k == 'novel':
                    novel_count += 1
                    
            #novel occurs more than once in the article
            if novel_count > 1:
                
                #prevents the article from being repeatedly added to the lists
                if i not in list_articles:
                    list_articles.append(i)
                    novel_idea.append(1)
                    count = 1
        else:
            pass
    
    #used for if the article does not contain the novel term
    if count == 0:
        novel_idea.append(0)
    else:
        pass

#shows the number of articles that contained the novel term
print(len(list_articles))

#shows if there will be a reshape error when adding to the dataframe
print(len(novel_idea))
total['novel_idea'] = novel_idea

1565
10619


There is very little difference between retracted and non-retracted articles for the regulatory binary value. The project being a novel idea does not seem to correlate to the article being retracted.

In [40]:
total['novel_idea'][:1537].value_counts(normalize=True)

0    0.826936
1    0.173064
Name: novel_idea, dtype: float64

In [41]:
total['novel_idea'][1537:].value_counts(normalize=True)

0    0.85697
1    0.14303
Name: novel_idea, dtype: float64

### Text Readability

Text readability refers to how difficult a corpus is to read. Several measures for text readability exist, but often these measures use extremely similar equations or the measures use a reference list of words (with any word not being on that list increasing the readability score significantly). Because these articles are extremely jargon forward, I determined that the text readability measures I would use are the Flesch Reading Ease and Flesch Kincaid Grade Value, found within the textstat library. More information about these equations can be found at [source 1](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) and [source 2](https://pypi.org/project/textstat/).

The below function determines the value of these measures for each text and then adds the value to a new column corresponding to each measure.

In [44]:
#used to determine readability measures for each article
def readability(dataframe):
    flesch_reading_ease_value = []
    flesch_kincaid_grade_value = []
    
    for i in dataframe['clean_text']:
        flesch_reading_ease_value.append(textstat.flesch_reading_ease(i))
        flesch_kincaid_grade_value.append(textstat.flesch_kincaid_grade(i))

    dataframe['flesch_reading_ease'] = flesch_reading_ease_value
    dataframe['flesch_kincaid_grade'] = flesch_kincaid_grade_value
    
    #shows if there will be a reshape error when adding to the dataframe
    print(len(dataframe['flesch_reading_ease']))
    return

In [45]:
readability(retract)
readability(no_retract)
readability(total)

1537
9082
10619


All of the feature engineered columns created for each dataframe were saved as new CSV files.

In [46]:
retract.to_csv('./retract_feature_engineered_data.csv')
no_retract.to_csv('./no_retract_feature_engineered_data.csv')
total.to_csv('./total_feature_engineered_data.csv')