In [3]:
import json
import re
import pandas as pd

# Load and Prepare Data

In [5]:
# Load the JSON file containing the search results
with open("../data/02_document_search_results.json", "r") as file:
    data = json.load(file)

# Print count for verification
print(f"Total articles processed: {len(data)}")

Total articles processed: 512


In [6]:
# Define expanded regex patterns to match descriptive sentences
patterns = [
    r"\b[Tt]his (article|work|study|paper|research|review|survey|chapter)\b",
    r"\b[Ii]n this (work|study|paper|research|review|survey|chapter)\b",
    r"\b[Ww]e (propose|introduce|present|develop|describe|demonstrate|report|discuss|analyze|examine|investigate|explore|evaluate|address|outline)\b",
    r"\b[Ii]n this (manuscript|article|contribution|approach|framework|investigation|analysis|implementation)\b",
    r"\b[Tt]he (article|paper|study|work|research|review|survey|manuscript|current study|present study|present work|current work)\b",
    r"\b[Oo]ur (work|study|paper|chapter|research|approach|framework|method|system|contribution|focus|aim|objective|goal)\b",
    r"\b[Tt]his (manuscript|contribution|investigation|analysis|implementation|approach|framework|method|system)\b",
    r"\b[Tt]he (purpose|aim|goal|objective) of this (paper|work|study|research|article|chapter|manuscript)\b",
    r"\b[Hh]ere(,)? we\b",
]

In [8]:
def extract_introduction_text(text, patterns):
    """
    Extracts text from the beginning of the abstract until it reaches a sentence
    that matches one of the specified patterns (typically where authors start
    describing their specific work).

    If no pattern is found, returns the entire text.

    Args:
        text (str): The text to process
        patterns (list): List of regex patterns to match

    Returns:
        str: Text until the first pattern match, or the entire text if no pattern is found
    """
    # Split text into sentences
    sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", text)

    # Collect sentences until a pattern match is found
    result_sentences = []
    pattern_found = False

    for sentence in sentences:
        # Check if sentence matches any pattern
        if any(re.search(pattern, sentence) for pattern in patterns):
            pattern_found = True
            break

        result_sentences.append(sentence)

    # If no pattern was found, return the original text
    if not pattern_found:
        return text

    # Join the collected sentences back together with spaces
    result_text = " ".join(result_sentences)
    return result_text

In [13]:
def extract_text_from_pattern_to_end(text, patterns):
    """
    Extracts text from the first sentence matching any of the specified patterns
    until the end of the text (typically where authors start describing their specific work).

    If no pattern is found, returns an empty string.

    Args:
        text (str): The text to process
        patterns (list): List of regex patterns to match

    Returns:
        str: Text from the first pattern match to the end of the text, or empty string if no pattern is found
    """
    # Split text into sentences
    sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", text)

    # Find the index of the first sentence matching any pattern
    start_idx = -1
    for i, sentence in enumerate(sentences):
        if any(re.search(pattern, sentence) for pattern in patterns):
            start_idx = i
            break

    # If no pattern was found, return an empty string
    if start_idx == -1:
        return ""

    # Join the sentences from the matching sentence until the end
    result_text = " ".join(sentences[start_idx:])
    return result_text



In [11]:
# Add introduction text to the cleaned_data dictionary
for item in data:
    item["introduction"] = extract_introduction_text(item["abstract"], patterns)

# Count abstracts with non-empty introductions
abstracts_with_intros = [item for item in data if item["introduction"].strip()]
full_text_abstracts = [
    item for item in data if item["introduction"] == item["abstract"]
]

# Statistics about abstracts where no pattern was found
print(f"Number of abstracts where no pattern was found: {len(full_text_abstracts)}")
print(
    f"Percentage with no pattern found: {len(full_text_abstracts) / len(data) * 100:.2f}%"
)

# Display the first few introductory texts (up to 500 chars) along with their DOIs
for i, item in enumerate(abstracts_with_intros[:3]):
    print(f"\nIntroduction {i+1} (DOI: {item['doi']}):")
    intro = item["introduction"]
    print(f"{intro[:500]}..." if len(intro) > 500 else intro)
    print("-" * 80)


Number of abstracts where no pattern was found: 54
Percentage with no pattern found: 10.55%

Introduction 1 (DOI: 10.1038/s41598-025-91206-6):
Building rooftop extraction has been applied in various fields, such as cartography, urban planning, automatic driving, and intelligent city construction. Automatic building detection and extraction algorithms using high spatial resolution aerial images can provide precise location and geometry information, significantly reducing time, costs, and labor. Recently, deep learning algorithms, especially convolution neural networks (CNNs) and Transformer, have robust local or global feature extractio...
--------------------------------------------------------------------------------

Introduction 2 (DOI: 10.1016/j.ijcce.2024.12.003):
Problem: Modernizing and standardizing place names and addresses is a key challenge in the development of smart cities.
--------------------------------------------------------------------------------

Introduction 3 (DO

In [14]:
# Add contribution text to the cleaned_data dictionary
for item in data:
    item["contribution"] = extract_text_from_pattern_to_end(item["abstract"], patterns)

# Count abstracts with non-empty contributions
abstracts_with_contributions = [
    item for item in data if item["contribution"].strip()
]

# Display statistics about extracted text segments
print(f"Number of abstracts processed: {len(data)}")
print(
    f"Number of abstracts with contributions extracted: {len(abstracts_with_contributions)}"
)
print(
    f"Percentage with contributions extracted: {len(abstracts_with_contributions) / len(data) * 100:.2f}%"
)
print(
    f"Average length of contributions: {sum(len(item['contribution']) for item in abstracts_with_contributions) / max(1, len(abstracts_with_contributions)):.2f} characters"
)

# Display the first few contribution texts (up to 500 chars) along with their DOIs
for i, item in enumerate(abstracts_with_contributions[:3]):
    print(f"Article {i+1} (DOI: {item['doi']}):")
    contribution = item["contribution"]
    print(f"{contribution[:500]}..." if len(contribution) > 500 else contribution)
    print("-" * 80)

# For backward compatibility, you can still create the list if needed
contribution_texts = [item["contribution"] for item in data]
non_empty_contributions = [text for text in contribution_texts if text.strip()]

Number of abstracts processed: 512
Number of abstracts with contributions extracted: 458
Percentage with contributions extracted: 89.45%
Average length of contributions: 1021.56 characters
Article 1 (DOI: 10.1038/s41598-025-91206-6):
To address these issues, this study developed a multi-scale global perceptron network based on Transformer and CNN using novel encoder-decoders for enhancing contextual representation of buildings. Specifically, an improved multi-head-attention encoder is employed by constructing multi-scale tokens to enhance global semantic correlations. Meanwhile, the context refinement decoder is developed and synergistically uses high-level semantic representation and shallow features to restore spatial deta...
--------------------------------------------------------------------------------
Article 2 (DOI: 10.1016/j.ijcce.2024.12.003):
Purpose: This paper proposes a solution to address matching challenges, such as incomplete descriptions, reversed word order, and the d

## Save the Results

In [15]:
# Save the modified data back to a JSON file
with open("../data/04_document_search_results_with_intros_and_contributions.json", "w") as file:
    json.dump(data, file, indent=4)