Removing the Abstract from Full Text Docs
----

# Setup and loading data

In [48]:
import pandas as pd
import numpy as np
import re

In [36]:
uncleaned_fulltext = pd.read_csv('../data/fulltext_uncleaned.csv')

In [37]:
def whitespace(col:pd.Series):
    # replace all endlines and tabs with spaces
    tmp = col.str.replace(r'[\t\n]+', ' ', regex=True)
    # remove all duplicate spaces
    tmp = tmp.str.replace(r' {2,}', ' ', regex=True)
    return tmp

df_inprocess = uncleaned_fulltext.copy()
df_inprocess.abstract = whitespace(df_inprocess.abstract)
df_inprocess.fulltext = whitespace(df_inprocess.fulltext)

In [41]:
def contains_abstract_whole(row):
    return row['abstract'] in row['fulltext']
def contains_abstract_part(row):
    length = len(row['abstract'])
    one_tenth = length // 10
    start = length // 2 - one_tenth
    end = start + 2*one_tenth
    selection = row['abstract'][start:end]
    return selection in row['fulltext']
def contains_abstract_end(row):
    selection = row['abstract'][-50:]
    return selection in row['fulltext']

print('full abstract matches: ', df_inprocess.apply(contains_abstract_whole, axis=1).sum())
print('middle 20% abstract matches: ', df_inprocess.apply(contains_abstract_part, axis=1).sum())
print('end abstract matches: ', df_inprocess.apply(contains_abstract_end, axis=1).sum())
print('word \'Abstract\' appears: ', df_inprocess.fulltext.str.contains('A[Bb][Ss][Tt][Rr][Aa][Cc][Tt]', regex=True).sum())

full abstract matches:  2402
middle 20% abstract matches:  5160
end abstract matches:  7453
word 'Abstract' appears:  8124


In [45]:
match_none = (
    ~df_inprocess.apply(contains_abstract_whole, axis=1) &
    ~df_inprocess.apply(contains_abstract_part, axis=1) &
    ~df_inprocess.apply(contains_abstract_end, axis=1) &
    ~df_inprocess.fulltext.str.contains('A[Bb][Ss][Tt][Rr][Aa][Cc][Tt]', regex=True)
)

print(match_none.sum())
print(df_inprocess[match_none].iloc[2]['abstract'])
print(df_inprocess[match_none].iloc[2]['fulltext'])

1657
This article engages with the relationship between social theory, architectural theory and material culture. The article is a reply to an article in a previous volume of the journal in question (Smith, M. (2001) ‘Repetition and difference: Lefebvre, Le Corbusier and modernity’s (im)moral landscape’, Ethics, Place and Environment, 4(1), 31-34) and, consequently, is also a direct engagement with another academic's scholarship. It represents a critique of their work as well as a recasting of their ideas, arguing that the matter in question went beyond interpretative issues to a direct critique of another author's scholarship on both Le Corbusier and Lefebvre. A reply to my article from the author of the original article was carried in a later issue of the journal (Smith, M. (2002) ‘Ethical Difference(s): a Response to Maycroft on Le Corbusier and Lefebvre’, Ethics, Place and Environment, 5(3), 260-269).


In [43]:
match_only_heading = (
    ~df_inprocess.apply(contains_abstract_whole, axis=1) &
    ~df_inprocess.apply(contains_abstract_part, axis=1) &
    ~df_inprocess.apply(contains_abstract_end, axis=1) &
    df_inprocess.fulltext.str.contains('A[Bb][Ss][Tt][Rr][Aa][Cc][Tt]', regex=True)
)

print(match_only_heading.sum())
print(df_inprocess[match_only_heading].iloc[1]['abstract'])
print(df_inprocess[match_only_heading].iloc[1]['fulltext'])

1862
Laboratory animals should be provided with enrichment objects in their cages; however, it is first necessary to test whether the proposed enrichment objects provide benefits that increase the animals’ welfare. The two main paradigms currently used to assess proposed enrichment objects are the choice test, which is limited to determining relative frequency of choice, and consumer demand studies, which can indicate the strength of a preference but are complex to design. Here, we propose a third methodology: a runway paradigm, which can be used to assess the strength of an animal’s motivation for enrichment objects, is simpler to use than consumer demand studies, and is faster to complete than typical choice tests. Time spent with objects in a standard choice test was used to rank several enrichment objects in order to compare with the ranking found in our runway paradigm. The rats ran significantly more times, ran faster, and interacted longer with objects with which they had previo

In [47]:
match_only_middle = (
    ~df_inprocess.apply(contains_abstract_whole, axis=1) &
    df_inprocess.apply(contains_abstract_part, axis=1) &
    ~df_inprocess.apply(contains_abstract_end, axis=1)
)

print(match_only_middle.sum())

871


In [99]:
def find_abstract(row: pd.Series) -> pd.Series:
    result = row[['abstract', 'fulltext']].copy()
    abstract_length = len(row['abstract'])
    
    # does entire abstract match? Then just return that
    abstract_start = row['fulltext'].find(row['abstract'])
    if abstract_start > -1:
        # remove abstract and everything before it
        abstract_end = abstract_start + abstract_length
        result['fulltext'] = result['fulltext'][abstract_end:]
        return result
    
    # Find candidates for heading of abstract?
    # Need to account for appearance of "Abstract" in title? There are 4 such instances, maybe just drop?
    # abstract_pattern = re.compile(f'(?:Abstract)|(?:ABSTRACT)[.:]?')
    # heading_candidate = abstract_pattern.search(row['fulltext'])
    
    min_substring_len = 100

    # Can end match?
    substring = row['abstract'][-min_substring_len:]
    abstract_start = row['fulltext'].find(substring)
    if abstract_start > -1:
        # Find start
        abstract_end = abstract_start + min_substring_len
        
        next_step = min_substring_len+1
        substring = row['abstract'][-next_step:]
        next_start = row['fulltext'].find(substring, 0, abstract_end)
        while next_start > -1 & next_step <= abstract_length:
            abstract_start = next_start
            next_step += 1
            substring = row['abstract'][-next_step:]
            next_start = row['fulltext'].find(substring, 0, abstract_end)
        extracted_abstract = row['fulltext'][abstract_start:abstract_end]
        result['abstract'] = extracted_abstract
        result['fulltext'] = result['fulltext'][abstract_end:]
        # print(abstract_start, abstract_end)
        return result

    # Is there match in the middle?
    midpoint = abstract_length // 2
    substart = midpoint-min_substring_len//2
    subend = midpoint+min_substring_len//2
    substring = row['abstract'][substart:subend]
    abstract_start = row['fulltext'].find(substring)
    if abstract_start > -1:
        # Find end
        abstract_end = abstract_start + min_substring_len
        max_end = abstract_start + midpoint + min_substring_len//2
        substring = row['abstract'][substart:subend+1]
        while substring in row['fulltext'][:max_end]:
            abstract_end += 1
            subend += 1
            substring = row['abstract'][substart:subend+1]

        # Find start
        substring = row['abstract'][substart-1:subend]
        next_start = row['fulltext'].find(substring, 0, abstract_end)
        while next_start > -1:
            abstract_start = next_start
            substart -= 1
            substring = row['abstract'][substart-1:subend]
            next_start = row['fulltext'].find(substring, 0, abstract_end)
        extracted_abstract = row['fulltext'][abstract_start:abstract_end]
        result['abstract'] = extracted_abstract
        result['fulltext'] = result['fulltext'][abstract_end:]
        return result

    # No heading, no match of abstract text: ?
    # Heading present but couldn't match abstract text: drop

    # All other scenarios: return NA
    result['abstract'] = np.NaN
    result['fulltext'] = np.NaN
    return result

def remove_abstract(row: pd.Series) -> pd.Series:
    abstract_length = len(row['abstract'])
    
    # does entire abstract match?
    abstract_start = row['fulltext'].find(row['abstract'])
    if abstract_start > -1:
        # remove abstract and everything before it
        abstract_end = abstract_start + abstract_length
        return {'fulltext':row['fulltext'][abstract_end:], 'removed_abstract':True}
    
    min_substring_len = 100

    # Can end match?
    substring = row['abstract'][-min_substring_len:]
    abstract_start = row['fulltext'].find(substring)
    if abstract_start > -1:
        # Remove everything before
        abstract_end = abstract_start + min_substring_len
        return {'fulltext':row['fulltext'][abstract_end:], 'removed_abstract':True}

    # Is there match in the middle?
    midpoint = abstract_length // 2
    substart = midpoint-min_substring_len//2
    subend = midpoint+min_substring_len//2
    substring = row['abstract'][substart:subend]
    abstract_start = row['fulltext'].find(substring)
    if abstract_start > -1:
        # Find end
        abstract_end = abstract_start + min_substring_len
        max_end = abstract_start + midpoint + min_substring_len//2
        substring = row['abstract'][substart:subend+1]
        while substring in row['fulltext'][:max_end]:
            abstract_end += 1
            subend += 1
            substring = row['abstract'][substart:subend+1]
        return {'fulltext':row['fulltext'][abstract_end:], 'removed_abstract':True}
   
    # Abstract not present, remove if "Abstract" heading apparently present, or return null
    abstract_pattern = re.compile(f'(?:Abstract)|(?:ABSTRACT)')
    if abstract_pattern.search(row['fulltext']):
        return {'fulltext':'', 'removed_abstract':False}
    else:
        return {'fulltext':row['fulltext'][:], 'removed_abstract':False}


In [88]:
df_inprocess.iloc[:1].apply(find_abstract, axis=1).iloc[0]

820 1482


abstract    ; knowledge that is produced and preserved wit...
fulltext     Key words. body; Bourdieu; carnival; creativi...
Name: 0, dtype: object

In [102]:
df_inprocess.head(20).apply(remove_abstract, axis=1, result_type='expand')

Unnamed: 0,fulltext,removed_abstract
0,Key words. body; Bourdieu; carnival; creativi...,True
1,Introduction: Studio based Pedagogy The studi...,True
2,2 There were different views on how to consol...,True
3,1. Introduction Notions of quality are of par...,True
4,"KEYWORDS Personal Digital Assistant, Head Mou...",True
5,"Keywords: animacy, progressive, genitive alte...",True
6,y. Replaced Elements 3 Simulation of Associati...,True
7,"Key Terms: Autobiographical Memory, Judgments...",True
8,"Key Words: Depoliticisation, Open Marxism Lab...",True
9,,False


In [110]:
cleaned_data = df_inprocess[['abstract']].copy()
cleaned_data['fulltext'] = df_inprocess.apply(remove_abstract, axis=1, result_type='expand')['fulltext']

In [114]:
cleaned_data = cleaned_data[cleaned_data.fulltext != '']
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9570 entries, 0 to 11839
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   abstract  9570 non-null   object
 1   fulltext  9570 non-null   object
dtypes: object(2)
memory usage: 224.3+ KB


In [None]:
cleaned_data.to_csv('../data/abstracts_fulltext_cleaned_exact.csv', index=False)