# Data contamination exploration

Notebook for a first analysis regarding the degree of contamination and on how to remove contaminated parts.

# 1. Setup

In [1]:
import sys
import re
import json
import pandas as pd
from warnings import filterwarnings

filterwarnings('ignore', category=FutureWarning)
# === CONFIG ===
BASE_DIR = "../../"
sys.path.append(BASE_DIR)

from src.general_functions_and_patterns_for_detection import (
    analyze_df_for_specific_hints_of_llms,
    check_contamination_in_df,
    load_dataframe_from_json,
    load_json_file_from_all_folders,
    PATTERN_CLEANUP,
    PATTERN_COMBINED,
    BENCHMARK_DIR,
    TASK_DIR,
    json_path_abstract,
    json_path_writing,
    json_path_xsum,
    json_path_review,
)

2025-09-19 08:23:25.413841: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-19 08:23:25.430469: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758270205.450825 3654188 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758270205.456254 3654188 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758270205.470923 3654188 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

# 2. DetectRL arxiv dataset

In [2]:
df = load_dataframe_from_json(json_path_review)
df.head()

Unnamed: 0,id,start,content,direct_prompt,llm_type,domain,prompt_few_shot,prompt_SICO,paraphrase_polish_human,paraphrase_polish_llm,adversarial_character_human,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm
0,1,I don't know what Dr. Goldberg was like before...,I was going to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,ChatGPT,yelp_review,I had the misfortune of scheduling an appointm...,I don't know what Dr. Goldberg was like before...,I used to be a patient of Dr. Johnson's before...,I recently had the unfortunate experience of b...,I was going to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,I was going to Dr. Johnson before he left and ...,I possess the misfortune of becoming a patient...,I was gоing to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,"I was going to go to Dr. Johnson to leave, and...",I have recently become a patient in Dr. Goldbu...,He is not a caring doctor. He doesn’t give pre...,From the moment I stepped into the waiting roo...
1,2,I'm writing this review to give you a heads up...,The office staff and administration are very u...,I'm writing this review to give you a heads up...,ChatGPT,yelp_review,I had the most unpleasant experience during my...,I'm writing this review to give you a heads up...,I am highly disappointed with the office staff...,I am writing this review to provide a warning ...,The office staff and administration are very u...,I'm writing this review to igve you a heads up...,The office staff and administration are acutel...,I'm writing this review to furnished you a hea...,The office staff and administration are very u...,I'm wrting this review to give you a heads up ...,Office staff and government are very unprofess...,I am writing this comment so that you can make...,"Second, and most important, make sure your ins...",Let me start by saying that my experience with...
2,3,Owning a driving range inside the city limits ...,I don't think I ask much out of a driving rang...,Owning a driving range inside the city limits ...,ChatGPT,yelp_review,There's always a consistent flow of customers ...,Owning a driving range inside the city limits ...,"I don't expect much from a driving range, real...",Having a driving range situated within city li...,I don't think I ask much out of a dHiving rang...,Owning a driving range insids the city limits ...,I don't consider I ask much out of a driving r...,Owning a driving fluctuates inside the city li...,I don't think yo ask much out of a driving ran...,Possession a driving range inside the city lim...,I don't think I ask too much within the scope ...,With a license to printed funds in the city's ...,"A decent mat, clean balls, and convenient hour...",The range offers state-of-the-art facilities f...
3,4,This place was DELICIOUS!!,My parents saw a recommendation to visit this ...,This place was DELICIOUS!! From the moment we ...,ChatGPT,yelp_review,"The moment I took my first bite, my taste buds...",This place was DELICIOUS!! I couldn't believe ...,"Based on Rick Sebak's ""25 Things I Like About ...",This establishment was absolutely delightful! ...,My parents saw a recommendation to visit this ...,This place was DELICIOiS!! From the moment we ...,My parents saw a recommendation to visit this ...,This place was PERFUMED!! From the moment we s...,My parents saw a recommendatiоn to visit this ...,This place was DELECTABLE!! From the moment we...,"My parents saw the suggestion of Rick Sebak, R...",This place is delicious! From the moment we in...,We went there today for a late lunch on Saturd...,"The menu was a riot of mouth-watering choices,..."
4,5,This place should have a lot more reviews - bu...,"nnIts been there ages, and looks it. If you're...",This place should have a lot more reviews - bu...,ChatGPT,yelp_review,I stumbled upon this hidden gem purely by chan...,This place should have a lot more reviews - bu...,"This place has been around for ages, and it de...","This hidden gem deserves far more reviews, but...","nnIts been there ages, and looPs it. If you're...",This place should have a lot more reviews - bu...,"nnIts been there ages, and listens it. Though ...",These place should possess a lot more reviews ...,"nnIts been there ageѕ, and looks it. If you're...",This place should have a lot more reviews - bu...,NNITS has been a long time and looks very simi...,There should be more comments in this place -b...,"If you want a swanky ambience, don't bother. T...",The staff greeted me with genuine warmth and l...


## 2.0 Take a look at some examples

In [3]:
df["llm_type"].value_counts()

llm_type
ChatGPT           700
Llama-2-70b       700
Claude-instant    700
Google-PaLM       700
Name: count, dtype: int64

In [4]:
# for item in df["llm_type"].unique():
item = "Llama-2-70b"
df_temp = df[df["llm_type"] == item]
print(item)
for i in range(len(df_temp)):
    print(i, ". ", df_temp["start"].iloc[i])
    print(df_temp["direct_prompt"].iloc[i])
    print("\n")

Llama-2-70b
0 .  I'd give it ZERO stars if i could.
I'm so sorry to hear that you had a negative experience with too much repetition. Here are 19 sentences to help you continue your review:1. Unfortunately, the restaurant failed to meet my expectations in every aspect.2. The food was overcooked, underseasoned, and tasted like it had been sitting out for hours.3. The service was slow and unfriendly, with a distinct lack of attention to detail.4. The ambiance was stale and uninviting, with dirty tables and chairs.5. I was disappointed by the lack of variety on the menu, especially for the price.6. The presentation of the dishes was unappealing, with sloppy plating and messy execution.7. I asked for a refill on my drink, but it took over 20 minutes to arrive.8. The manager did not come to check on us once during our meal, despite evident issues.9. I did not feel welcome or valued as a customer during my time at the restaurant.10. The noise level was excessive, with loud conversations and 

In [5]:
df_temp = df[df["llm_type"] == "Claude-instant"]
for i in range(10):
    print(i, ". ", df_temp["start"].iloc[i])
    print(df_temp["direct_prompt"].iloc[i])
    print("\n")
print("\n\n\n")

0 .  I recently visited Charlotte this past weekend for the race and needed a casual good food experience.
Here is a 13 sentence continuation of the review:I recently visited Charlotte this past weekend for the race and needed a casual good food experience. We came across 5Church Charlotte located just outside of uptown and decided to check it out. The restaurant had a relaxing atmosphere with large windows overlooking the street. Upon being seated, the friendly server brought bread and hummus which were both tasty. For my entree, I ordered the BBQ brisket sandwich which was smoked in-house. It was smothered in their signature BBQ sauce and came with fries - everything was cooked to perfection. My friend received the mac and cheese which had a nice crispy breadcrumb topping. We also shared an order of the truffle Parmesan fries that were addictively flavorful. The portions were generous for the reasonable prices. After our large meals, we still had room to sample a couple of their home

## 2.1 Test Clean-up Claude dataset

In [6]:
# Combined pattern with OR
pattern = r"^(Here is a \d+.*)|(Here is an \d+.*)|(Voici un.*)"

# Filter rows NOT matching any of the patterns
non_matching_rows = df_temp[~df_temp['direct_prompt'].str.match(pattern, na=False)]

for item in non_matching_rows['direct_prompt']:
    print(item)

Here is a continued 13 sentence review:I have been to several Rock Bottoms including Denver, CO and this one was less than impressive. The atmosphere lacked energy and it was nearly empty on a Saturday night. The bar area was dimly lit making it difficult to see drink options clearly. When we tried to order drinks, the bartender seemed annoyed to be serving us. Our drinks arrived slowly and were weak poured. We decided to check out the dance floor but found it sparse with only a couple people moving listlessly to the music. The DJ's song selection did little to get people excited or in a dancing mood. After a few more drinks, we gave up on dancing and played a round of pool on one of the dingy tables. Even the bathroom was disappointing with its stale smell and lack of soap or paper towels. I'd come here wanting a fun night out but left feeling bored and underwhelmed. Unless this location radically improves its vibe and service, I don't see myself ever returning. There are much better 

In [7]:
# Combined pattern with OR
pattern = r"^(Here is.*)|(Voici un.*)"

# Filter rows NOT matching any of the patterns
matching_rows = df_temp[df_temp['direct_prompt'].str.match(pattern, na=False)]
print(len(matching_rows))

673


## 2.2 Clean-up Claude dataset

First test for cleaning up the claude samples based on simple regular expressions. Regular expressions and the test logic are further improved later.

In [8]:
# Define the regex pattern
pattern = r"^(Here is.*?|Voici un.*?)[.:]"


# Function to remove the matched part if it starts with the pattern
def remove_prefix(text):
    if pd.isna(text):
        return text
    return re.sub(pattern, '', text, count=1).lstrip()


# Apply the function to the column
df['direct_prompt_cleaned'] = df['direct_prompt'].apply(remove_prefix)

In [9]:
df_claude = df[df["llm_type"] == "Claude-instant"]
for counter, item in enumerate(df_claude["direct_prompt_cleaned"].to_list()):
    if counter > 20:
        break
    else:
        print(item, "\n\n")

I recently visited Charlotte this past weekend for the race and needed a casual good food experience. We came across 5Church Charlotte located just outside of uptown and decided to check it out. The restaurant had a relaxing atmosphere with large windows overlooking the street. Upon being seated, the friendly server brought bread and hummus which were both tasty. For my entree, I ordered the BBQ brisket sandwich which was smoked in-house. It was smothered in their signature BBQ sauce and came with fries - everything was cooked to perfection. My friend received the mac and cheese which had a nice crispy breadcrumb topping. We also shared an order of the truffle Parmesan fries that were addictively flavorful. The portions were generous for the reasonable prices. After our large meals, we still had room to sample a couple of their homemade desserts. The bourbon bread pudding and coconut cream pie were both wonderful finishes. I would highly recommend 5Church for its delicious comfort food

# 3. Check contamination within the different Benchmark_Data subsets and tasks

## 3.0 General Assessment of the cleaning

In [10]:
text = "Sure! Here's a story in a human style, writing about incredible magical abilities as if described by a bland college textbook:Magic, a concept often shrouded in mystery and intrigue, has long been a topic of fascination for many.However, for those seeking a more nuanced understanding, a dusty old college textbook provides a peculiar perspective.The tome, its cover worn and faded, offers a dry, clinical examination of magical abilities.Within its pages, one can find descriptions of remarkable powers, each presented in an unremarkable, almost mundane manner.For instance, telekinesis, the ability to manipulate objects with one's mind, is discussed in a dry, academic tone.The text explains the physics behind the phenomenon, discussing vectors and forces with yawn-inducing detail.Even the most extraordinary feats, such as levitating entire buildings, are presented as mere examples of calculus and mechanics.The text also touches on more esoteric abilities, like time manipulation and dream walking.Yet, rather than evoking a sense of wonder and awe, the descriptions feel akin to a list of chemical reactions.In this world, magic is not a mystical force, but rather a branch of physics that can be studied, quantified, and explained away.Despite the text's dull presentation, the sheer potential of magic cannot be ignored.One cannot help but feel a twinge of excitement, imagining the possibilities such abilities could bring.However, the textbook's bland tone serves as a reminder that, even in a world of incredible magic, there is still a need for a rational, scientific approach.In the end, it is this unlikely blend of enchantment and tedium that makes the subject of magic so captivating."

match = re.sub(r".*in a human style[\w\s,]{0,100}?[.:!?]", "", text)
match

"Magic, a concept often shrouded in mystery and intrigue, has long been a topic of fascination for many.However, for those seeking a more nuanced understanding, a dusty old college textbook provides a peculiar perspective.The tome, its cover worn and faded, offers a dry, clinical examination of magical abilities.Within its pages, one can find descriptions of remarkable powers, each presented in an unremarkable, almost mundane manner.For instance, telekinesis, the ability to manipulate objects with one's mind, is discussed in a dry, academic tone.The text explains the physics behind the phenomenon, discussing vectors and forces with yawn-inducing detail.Even the most extraordinary feats, such as levitating entire buildings, are presented as mere examples of calculus and mechanics.The text also touches on more esoteric abilities, like time manipulation and dream walking.Yet, rather than evoking a sense of wonder and awe, the descriptions feel akin to a list of chemical reactions.In this 

In [11]:
text = """SUMMARY: This was about the 6th time I've dined at Mert's.  While I'm usually disappointed by most chain restaurants, Mert's consistently delivers quality food and service. The menu features a variety of appetizers, salads, sandwiches and entrees that use locally-sourced ingredients. My favorite is the turkey burger which is juicy and full of flavor. In addition to the tasty food, the staff is always friendly and efficient without being overbearing. The atmosphere is casual and comfortable, perfect for families, couples or groups of friends. I'm also impressed by how affordable the prices are considering the quality of ingredients and size of portions. Parking is convenient with a large lot located right in front of the restaurant. Overall, Mert's is a reliable choice when I'm looking for a satisfying meal done right without breaking the bank. I'll definitely be back again soon. 
"""

text = """Here is a 9 sentence continuation of the review:review's first sentence: continued review: SUMMARY: This was about the 6th time I've dined at Mert's.  While I'm usually disappointed by most chain restaurants, Mert's consistently delivers quality food and service. The menu features a variety of appetizers, salads, sandwiches and entrees that use locally-sourced ingredients. My favorite is the turkey burger which is juicy and full of flavor. In addition to the tasty food, the staff is always friendly and efficient without being overbearing. The atmosphere is casual and comfortable, perfect for families, couples or groups of friends. I'm also impressed by how affordable the prices are considering the quality of ingredients and size of portions. Parking is convenient with a large lot located right in front of the restaurant. Overall, Mert's is a reliable choice when I'm looking for a satisfying meal done right without breaking the bank. I'll definitely be back again soon."""
PATTERN_REVIEW = "(.*review's first sentence|.*continued review|.*SUMMARY):"

for _ in range(4):
    text = re.sub(PATTERN_COMBINED, "", text, count=5, flags=re.IGNORECASE).lstrip()
text

"This was about the 6th time I've dined at Mert's.  While I'm usually disappointed by most chain restaurants, Mert's consistently delivers quality food and service. The menu features a variety of appetizers, salads, sandwiches and entrees that use locally-sourced ingredients. My favorite is the turkey burger which is juicy and full of flavor. In addition to the tasty food, the staff is always friendly and efficient without being overbearing. The atmosphere is casual and comfortable, perfect for families, couples or groups of friends. I'm also impressed by how affordable the prices are considering the quality of ingredients and size of portions. Parking is convenient with a large lot located right in front of the restaurant. Overall, Mert's is a reliable choice when I'm looking for a satisfying meal done right without breaking the bank. I'll definitely be back again soon."

## 3.1 Contamination checks

Retrieve the number of contaminated items of the claude LLM by domain and afterwards re-do the analysis to the Task directory.

In [12]:
PATTERN_COMBINED

"^((\\[SYSTEM\\]|\\*{0,2}assistant\\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\\w{1,10}![ ]?)?[^.:!?]{0,100}(Voici un|Here is|Here are|Here's|Sure[,!]?\\s?here)[^.:!?]{0,300}([:!.?]+|[:]?[\\*]{2})|^((\\[SYSTEM\\]|\\*{0,2}assistant\\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\\w{1,10}![ ]?)?[^.:!?]{0,100}(\\d+ sentences|sentence|\\[assistant\\]|summary)[^.:!?]{0,300}([:!.?]+|[:]?[\\*]{2})|(.*I apologize, upon further reflection.*?|.*a fake review.*|.*((only)|(just)) a language model.*|.*I cannot provide.*|.*As an AI language model, I am unable to engage with content that may violate my usage guidelines.*|.*upon reflection I do not.*|.*As an AI.*|.*(I apologize, (but\\w?)?(as an AI|upon reflection)).*)|^((\\[SYSTEM\\]|\\*{0,2}assistant\\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\\w{1,10}![ ]?)?[^.:!?]{0,100}(given article title|provided article title)[^.:!?]{0,300}([:!.?]+|[:]?[\\*]{2})|^((\\[SYSTEM\\]|\\*{0,2}assistant\\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\\w{1,10}![ ]?)?[^

In [13]:
# === Load the direct prompts ===
# Load JSON data
with open(json_path_abstract, 'r', encoding='utf-8') as file:
    data = json.load(file)

# filter for claude
direct_prompt_df = pd.DataFrame(data)
direct_prompt_df = direct_prompt_df[direct_prompt_df["llm_type"] == "Claude-instant"]
print(len(direct_prompt_df))
direct_prompts = set(direct_prompt_df['direct_prompt'].dropna().unique())
print(len(direct_prompts))
direct_prompt_df.head()

700
700


Unnamed: 0,id,title,abstract,direct_prompt,llm_type,domain,prompt_few_shot,prompt_SICO,paraphrase_polish_human,paraphrase_polish_llm,...,adversarial_character_human,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm
1400,1401,Real Time Turbulent Video Perfecting by Image ...,Image and video quality in Long Range Observat...,Here is a 5 sentence abstract for the provided...,Claude-instant,arxiv,Here is a 5 sentence academic article abstract...,Here is a 5 sentence academic article abstract...,Here is a polished 5-sentence academic abstrac...,Here is a polished 5 sentence abstract for the...,...,Image and video quality in Long Range Observat...,Here is a 5 sentence abstract for the providd ...,Image and video quality in Long Range Observat...,Here is a 5 condemnation abstract for the prov...,Image and video qualifications in Long Range O...,Hеre is a 5 sentence abstⲅact for the provided...,The image and video quality in the remote obse...,This is the 5 sentences of the article on the ...,This paper presents a real-time method for imp...,This paper presents a novel approach for perfo...
1401,1402,Finite Euler products and the Riemann Hypothesis,We show that if the Riemann Hypothesis is true...,Here is a 10 sentence abstract for the given a...,Claude-instant,arxiv,Here is a 10 sentence abstract for the given t...,Here is a 10 sentence academic article abstrac...,Here is a 10 sentence polished academic abstra...,Here is a polished 10 sentence abstract for th...,...,We show that if the Riemann Hypothesis is true...,Here is a 10 sentence abstract for the given a...,We displayed that if the Riemann Hypothesis is...,Here is a 10 sentences summed for the given ar...,"We show that if the Remann Hypothesis is trսe,...",Here is a 10 sentence astract for the given ar...,We show that if the RIEMANN assumes that it is...,This is 10 sentences of a given article title ...,"In the converse, if the approximation by produ...",It concerns the location of the zeros of the R...
1402,1403,An Adaptive Strategy for the Classification of...,One of the major problems in computational bio...,Here is an 8 sentence abstract for the article...,Claude-instant,arxiv,Here is an 8 sentence academic article abstrac...,Here is an 8 sentence academic article abstrac...,Here is an 8 sentence academic style abstract ...,Here is an 8 sentence academic-style abstract ...,...,One of the majMor problems in computational bi...,Here is an 8 setence abstract for the article ...,One of the enormous problems in computational ...,Here is an 8 sentence abstract for the article...,One of the major problems in computational bio...,Here is an 8 sentence abstract for the article...,One of the main problems of calculating biolog...,This is an 8th sentence summary of the article...,Many machine learning tools have been applied ...,Accurate classification of GPCRs into families...
1403,1404,Detailed Models of super-Earths: How well can ...,The field of extrasolar planets has rapidly ex...,Here is a 7 sentence abstract for the given ar...,Claude-instant,arxiv,Here is a 7 sentence abstract for the given ti...,Here is a 7 sentence academic article abstract...,Here is a 7 sentence polished academic abstrac...,Here is a polished 7 sentence abstract for the...,...,zThe field of extrasolar planets has rapidly e...,Here is a 7 sentence abstract for the Uiven ar...,The field of extrasolar asteroids has rapidly ...,Here is a 7 sentence abstract for the given ar...,The field of extrasolar planets has rapidly ex...,Currently is a 7 sentence abstract for the giv...,The fields of polar planets quickly expand the...,This is 7 sentences of a given article title: ...,This paper describes our detailed interior mod...,"However, characterization of the bulk properti..."
1404,1405,The Distribution of AGN in Clusters of Galaxies,We present a study of the distribution of AGN ...,Here is a 6 sentence abstract for the article ...,Claude-instant,arxiv,Here is a 6 sentence academic article abstract...,Here is a 6 sentence academic article abstract...,Here is a polished 6 sentence academic abstrac...,Here is a revised 6 sentence academic-style ab...,...,We present a study of the distribution of AGN ...,Her is a 6 sentence abstract for the article t...,We present a consideration of the dispensed of...,Here is a 6 sentence condensed for the article...,We pⲅesent a study of the distribution of AGN ...,Here is a 6 sentences abstract for the article...,We introduced a study. Among the eight cluster...,This is the six sentences of the article title...,We find that the 12 AGN with LX > 1042 erg s-1...,This study uses X-ray and radio data to identi...


In [14]:
benchmark_df = load_json_file_from_all_folders(BENCHMARK_DIR)
benchmark_df.head()

Unnamed: 0,text,label,data_type,llm_type,domain,dataset,text_length
0,A fully differential calculation in perturbati...,human,abstract,ChatGPT,Data_Mixing,multi_llm_mixing_train.json,
1,The evolution of Earth-Moon system is describe...,human,abstract,ChatGPT,Data_Mixing,multi_llm_mixing_train.json,
2,A rather non-standard quantum representation o...,human,abstract,ChatGPT,Data_Mixing,multi_llm_mixing_train.json,
3,A general formulation was developed to represe...,human,abstract,ChatGPT,Data_Mixing,multi_llm_mixing_train.json,
4,We discuss the results from the combined IRAC ...,human,abstract,ChatGPT,Data_Mixing,multi_llm_mixing_train.json,


In [15]:
benchmark_df, contamination_df, summary = check_contamination_in_df(benchmark_df, uncleaned_dataframe_values=direct_prompts)
summary

Unnamed: 0,domain,dataset,contamination_count
0,Data_Mixing,data_mixing_attacks_test.json,16
1,Data_Mixing,multi_llm_mixing_test.json,66
2,Data_Mixing,multi_llm_mixing_train.json,634
3,Data_Mixing_Human,data_mixing_attacks_test.json,129
4,Data_Mixing_Human,data_mixing_attacks_train.json,1271
5,Data_Mixing_Human,human_centered_mixing_test.json,63
6,Data_Mixing_Human,human_centered_mixing_train.json,637
7,Data_Mixing_Human,multi_human_mixing_test.json,66
8,Data_Mixing_Human,multi_human_mixing_train.json,634
9,Direct_Prompt,direct_prompt_test.json,63


In [16]:
task_df = load_json_file_from_all_folders(TASK_DIR)
task_df.head()

Unnamed: 0,text,label,data_type,llm_type,domain,dataset,text_length
0,We study the properties of the Heisenberg anti...,human,abstract,ChatGPT,Task1,data_mixing_attacks_test.json,
1,Close pre-main-sequence binary stars are expec...,human,abstract,ChatGPT,Task1,data_mixing_attacks_test.json,
2,The Special Theory of Relativity and the Theor...,human,abstract,ChatGPT,Task1,data_mixing_attacks_test.json,
3,If the supermassive black hole (SMBH) at the c...,human,abstract,ChatGPT,Task1,data_mixing_attacks_test.json,
4,"In this paper, we model the evolution and self...",human,abstract,ChatGPT,Task1,data_mixing_attacks_test.json,


In [17]:
task_df, contamination_df_task, summary_task = check_contamination_in_df(task_df, uncleaned_dataframe_values=direct_prompts)

In [18]:
summary_task.head(20)

Unnamed: 0,domain,dataset,contamination_count
0,Task1,data_mixing_attacks_test.json,16
1,Task1,multi_domains_arxiv_test.json,28
2,Task1,multi_domains_arxiv_train.json,672
3,Task1,multi_llms_Claude-instant_test.json,28
4,Task1,multi_llms_Claude-instant_train.json,672
5,Task2,data_mixing_attacks_test.json,16
6,Task2,multi_domains_arxiv_test.json,28
7,Task2,multi_domains_arxiv_train.json,672
8,Task2,multi_llms_Claude-instant_test.json,28
9,Task2,multi_llms_Claude-instant_train.json,672


# 4. Check other domains besides Arxiv

In [19]:
df_abstract = load_dataframe_from_json(json_path_abstract, filter_llm=False)
df_writing = load_dataframe_from_json(json_path_writing, filter_llm=False)
df_xsum = load_dataframe_from_json(json_path_xsum, filter_llm=False)
df_review = load_dataframe_from_json(json_path_review, filter_llm=False)

df_writing.head(20)

Unnamed: 0,id,story,story_prompt,direct_prompt,llm_type,domain,paraphrase_polish_human,paraphrase_polish_llm,prompt_few_shot,prompt_SICO,...,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm,icl_prompt
0,1,The mountain stood still and large beneath the...,Through Iron And Flame,Through Iron and FlameDeep in the heart of the...,ChatGPT,writing_prompt,The massive mountain loomed beneath the Warrio...,Through Iron and FlameDeep in the heart of the...,"The war raged on, its fury echoing through the...","Through Iron And FlameBarefoot and fearless, s...",...,Through Iron and FlameDeep in the heart of the...,The mountain stood furthermore and large benea...,Through Iron and FlameDeep in the heart of the...,The mountain stod still and large beneath the ...,Through Iron and FlameDeep in the heart of the...,The mountain stood below the warrior to stand ...,It is an extraordinary journey through iron an...,It had not trembled since the day when the peo...,"A young blacksmith named Alistair, with a fier...",
1,2,"""Sadie! I told you not to stand under the tree...","You are at the park with your kids, when you s...","It was a sunny Saturday afternoon, and I decid...",ChatGPT,writing_prompt,"""Sadie! I explicitly told you to avoid standin...","It was a sunny Saturday afternoon, and I decid...","It was a sunny afternoon at the park, and I wa...","So, dude, picture this: I'm at the park with m...",...,"It was a sunny Saturday afternoon, and I decid...","""Lottie! I told you not to stand under the tre...","It was a sunny Saturday afternoon, and I decid...","""Sadie! j told you not to stand under the tree...","It was a sunny Saturday afternoon, and I decid...","""Sadi! I tell you not to stand under the tree ...","It was a sunny Saturday afternoon, and I decid...",I told you not to stand under the tree during ...,They were so excited when we arrived that they...,
2,3,"Janice turned to me, her big blue eyes still f...",""" My fellow Americans... "" The newly elected P...","""My fellow Americans,"" the newly elected Presi...",ChatGPT,writing_prompt,"Janice turned to me, her big, innocent blue ey...","""My fellow Americans,"" the newly elected Presi...","""My fellow Americans,"" the newly elected Presi...","""My fellow Americans,"" the newly elected Presi...",...,"""yM fellow Americans,"" the newly elected Presi...","Janice turned to me, her big blue eyes still e...","""My fellow Americano,"" the newly elected Presi...","Janice turned to me, her Ьig blue eyes still f...","""My fellow Aｍericans,"" the newly elected Presi...","Janice turned to me, and her big blue eyes wer...","The newly elected president began to say, ""My ...","“Daddy,” she said, “what does the president me...","""I stand before you today to make a deeply per...","""My fellow Americans,"" the newly elected Presi..."
3,4,Roslyn stepped down the ladder facing forward ...,What' s on the tape?,As Anna rummaged through her grandmother's att...,ChatGPT,writing_prompt,"Roslyn descended the ladder, facing forward, a...",As Anna carefully rummaged through her grandmo...,I stumbled upon an old cardboard box in the co...,As I rummaged through the dusty box that had b...,...,As Anna rummaged through her grandmother's att...,Roslyn stepped down the ladder facing forward ...,As Anna rummaged through her grandmother's att...,Roslyn stepped down the ladder facing forward ...,As Anna rummaged through her grandmother's att...,Roslyn walked down the ladder and headed forwa...,"When Anna read on her grandmother's loft, she ...",She caught it with her left hand. She lugged t...,She blew off the dust and opened it with care....,
4,5,""" Aw, do n' t cry my sweet little girl! You we...","Write a story that is perfectly normal, until ...","Once upon a time, in the small town of Willowb...",ChatGPT,writing_prompt,"""Oh, don't cry, my sweet little girl! You were...","Once upon a time, in the quaint town of Willow...","Sarah woke up early in the morning, the sunlig...",Samantha woke up to the sound of birds chirpin...,...,"Once upon a time, in the small town of Willowb...",""" Aw, do n' t cry my belle little girl! You we...","Once upon a time, in the small midtown of Will...",""" Aw, do n' t weep my sweet little girl! You w...","After upon a time, in the small town of Willow...","""Oh, cry, my cute little girl! You are very qu...","Once upon a time, everything was calm and stab...","She's heavy. She was so quiet before, even wit...","The sun shone brightly in the clear blue sky, ...",Emily woke up to the sound of birds chirping o...
5,6,""" Do you ever think about what it' s like up t...","Even with all the stars on the sky, the night ...",Even with all the stars scattered across the e...,ChatGPT,writing_prompt,"""Do you ever think about what it's like up the...",Despite the multitude of stars scattered acros...,"Even with all the stars in the sky, the night ...",Even with all the stars sprinkling the sky abo...,...,Even with all the stars scattered across the e...,""" Do you ever figured about what it' s like up...",Even with all the stars scattered across the e...,""" Do you ever think about what it' s loves up ...",Even with all the stars scattered across the e...,"""Have you ever thought about it?"" Her hair was...",Even though all the stars are scattered on the...,The city was far away. Her hair was spread out...,"For generations, men had gazed at the sky, mar...",
6,7,The world came crashing down in minutes. Many ...,"Over night, 90 % of the world' s population ha...","Overnight, a cataclysmic event struck the worl...",ChatGPT,writing_prompt,"The world crumbled in a matter of minutes, sha...","In the blink of an eye, a cataclysmic event ra...","Overnight, the world was consumed by an eerie ...","Wow, have you ever imagined waking up to a wor...",...,"Overinght, a cataclysmic event struck the worl...",The globe came crashing down in minutes. Many ...,"Overnight, a cataclysmic happenings struck the...",The world ϲame crashing down in minutes. Many ...,"Overnight, a cataclysmic event struck the worl...","A few minutes later, the world collapsed. Many...","Overnight, a catastrophic event attacked the w...",Many of us were asleep when it happened and di...,"The survivors, cautiously emerging from their ...",
7,8,"""Mommy, I' m scared. ""The little girl stood at...","Gay marriage is now legal woldwide, and the co...",In a world where gay marriage had become legal...,ChatGPT,writing_prompt,"""Mummy, I'm scared,"" the little girl quivered ...",In a world where gay marriage had become legal...,"The world had changed overnight, and the conse...",Who would've thought that it would come to thi...,...,In a orld where gay marriage had become legal ...,"""Mama, I' m scared. ""The little maid stood at ...",Between a world where gay marriage had become ...,"""Mommy, I' m terrified. ""The little girl stood...",In a world where gay marriage had become legal...,"""Mom, I'm scared."" The little girl stood on th...",In a world of homosexual marriage that is lega...,""" The little girl stood at the top of the stai...",The world was experiencing a kind of pseudo-zo...,
8,9,The blind pilots fly And we thank them for the...,No Ordinary Mist,"In the small town of Elmwood, nestled between ...",ChatGPT,writing_prompt,"The blind pilots soar through the skies, and w...","In the serene town of Elmwood, nestled amidst ...","In the small town of Willowbrook, a dense fog ...","In the sleepy town of Mistwood, nestled amidst...",...,"In the small tCwn of TElmwood, nestled between...",The blind pilots hovers And we acknowledgement...,"In the marginal town of Elmwood, nestled betwe...",The blind pilots fly And we thank them for the...,"During the small town of Elmwood, nestled betw...",Blind pilots flew. We thank them for their mis...,"In Elmwood, which is located in the hills of t...","The Sun burns hot, bold, and bright. What is t...",The No Ordinary Mist was said to grant unimagi...,
9,10,We' d been wandering for what felt like years....,Describe a game of Civilization from the persp...,I had spent my entire life in the bustling cit...,ChatGPT,writing_prompt,We had been wandering for what felt like years...,I had spent my entire life in the bustling cit...,"From my humble abode, nestled within the heart...",I couldn't contain my excitement as the city b...,...,I had spent my entire life in the bustling cty...,We' d been wandering for what suspected like y...,I had spent my entire life in the bustling cit...,We' d been wandering for what felt loves years...,I had spent my entire life in the bustling cit...,We have been wandering for many years. I could...,"I spent all my life in the city of Elmdale, wh...",We made camp near the mountain. It was suppose...,"As a humble citizen, I had little influence on...",


In [20]:
for column in ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm", "paraphrase_polish_human"]:
    print("Analysing column:", column)
    matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_writing, column_generated_text=column,
                                                                             print_results=False)

Analysing column: direct_prompt
Entries with typical LLM Patterns:  981
Entries without typical LLM Patterns:  1819

Entries without typical LLM Patterns: llm_type
ChatGPT           698
Llama-2-70b       669
Google-PaLM       451
Claude-instant      1
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    699
Google-PaLM       249
Llama-2-70b        31
ChatGPT             2
Name: count, dtype: int64
    
Analysing column: prompt_few_shot
Entries with typical LLM Patterns:  1128
Entries without typical LLM Patterns:  1672

Entries without typical LLM Patterns: llm_type
ChatGPT           699
Llama-2-70b       690
Google-PaLM       277
Claude-instant      6
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    694
Google-PaLM       423
Llama-2-70b        10
ChatGPT             1
Name: count, dtype: int64
    
Analysing column: prompt_SICO
Entries with typical LLM Patterns:  798
Entries without typical LLM Patterns:  2002

In [21]:
matching_rows.head()

Unnamed: 0,id,story,story_prompt,direct_prompt,llm_type,domain,paraphrase_polish_human,paraphrase_polish_llm,prompt_few_shot,prompt_SICO,...,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm,icl_prompt
737,738,"*You wan na talk about it? * "" No. "" *You sure...",There is a door in the house no one must open.,"As I walk through the house, I notice a door t...",Llama-2-70b,writing_prompt,"Sure, here's a polished version of the story:""...","As I traverse the residence, a peculiar entran...","As I walked through the house, I couldn't help...","As I walked through the house, I noticed a doo...",...,"As I walk through the house, I notice a door t...","*You bay na talk about it? * "" Either. "" *You ...","As I walk through the house, I apprised a door...","*Yоu wan na talk about it? * "" Nope. "" *Vous s...","As I walk through the houses, I notice a door ...","*Do you want to talk about it? *""No,"" *Are you...","When I walked through the house, I noticed a s...",* You sure? Because I think we really need to ...,"It is hidden behind a bookshelf, and there is ...",
775,776,"This is my hard, fast and odd way to make up n...",How do you come up with names in original scif...,"In the world of science fiction and fantasy, n...",Llama-2-70b,writing_prompt,"Sure, here's a polished version of the story:I...","In the realm of science fiction and fantasy, n...",I've always found that names in original scien...,"As a writer of science fiction and fantasy, I ...",...,"In the kworld of science fiction and fantasy, ...","This is my hard, fast and anomalous way to mak...","In the world of science fiction and fantasy, n...","This is my hard, fast and odd way to make up n...","In the world of science fiction and fantasy, n...","This is the difficult, fast and strange method...","In the world of sci -fi and fantasy, the names...",I use this method if I want something done qui...,How do authors come up with these names? For s...,
800,801,"John stepped out of his home block and sighed,...",There used to be a race living on Mars but the...,"Once, a race of beings lived on Mars, thriving...",Llama-2-70b,writing_prompt,Here's a polished version of the story:John st...,"In the distant past, a resilient race of being...","As the humans landed on Mars, they were met wi...",As the last remnants of the ancient Martian ra...,...,"Once, a race of beings lived on Mars, thriving...",John stepped out of his homes block and sighed...,"Once, a race of beings lived on Mars, blossom ...","John stepped out of his home bloc and sighed, ...","Once, a race of beings resided on Mars, thrivi...",John walked out of his main obstacle and sighe...,"Once, a group of people lived on Mars and flou...",He looked up at the protective dome that cover...,"Then, like a blood-donor rejecting incompatibl...",
814,815,""" Hey bar... whatcha thinkin "" bout? "" Fred as...",Never ask a writer what their thinking about.,"As a helpful AI assistant, I have learned to a...",Llama-2-70b,writing_prompt,"Here's a polished version of the story:Bar, th...","As a helpful AI assistant, I have learned to a...","As the writer sat at her desk, staring blankly...","As I sat at my desk, staring at the blank page...",...,"As a helpful AI assistant, I have learned to a...",""" Yup bar... whatcha thinkin "" bout? "" Fred in...","As a helpful AI assistant, I have learned to a...",""" Hi bar... whatcha thinkin "" bout? "" Fred ask...","As a helpful AI assistant, I have learned to a...","""Hey, bar ... WhatCha Thinkin"" round? ""Fred as...","As a useful AI assistant, I learned to avoid a...",Fred asked the legendary moderator of r/writin...,It’s a question that can lead to frustration a...,
819,820,""" If I say so myself, you look dashing hun'. ""...",You have the ability to stop time at will. How...,I woke up to the sound of my alarm blaring in ...,Llama-2-70b,writing_prompt,Here's a polished version of the story:Lance g...,I woke up to the sound of my alarm blaring in ...,"I had always thought it was just a myth, a leg...",I woke up to the sound of my alarm blaring in ...,...,I woke up to the sound of my alarm blaring in ...,""" If I say therefore myself, you look dashing ...",I woke up to the auditory of my alarm blaring ...,""" If I say so myself, you glance dashing hun'....",I woke up to the sound of my alarm blarinɡ in ...,"""If I say that, you look very tapped."" Lance s...","When I woke up, my alarm clock made a harsh so...","I love the combination of white and gold, it s...","I hit the snooze button and rolled out of bed,...",


In [22]:
for column in ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm", "paraphrase_polish_human"]:
    print("Analysing column:", column)
    matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_writing, column_generated_text=column,
                                                                             print_results=False)

Analysing column: direct_prompt
Entries with typical LLM Patterns:  981
Entries without typical LLM Patterns:  1819

Entries without typical LLM Patterns: llm_type
ChatGPT           698
Llama-2-70b       669
Google-PaLM       451
Claude-instant      1
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    699
Google-PaLM       249
Llama-2-70b        31
ChatGPT             2
Name: count, dtype: int64
    
Analysing column: prompt_few_shot
Entries with typical LLM Patterns:  1128
Entries without typical LLM Patterns:  1672

Entries without typical LLM Patterns: llm_type
ChatGPT           699
Llama-2-70b       690
Google-PaLM       277
Claude-instant      6
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    694
Google-PaLM       423
Llama-2-70b        10
ChatGPT             1
Name: count, dtype: int64
    
Analysing column: prompt_SICO
Entries with typical LLM Patterns:  798
Entries without typical LLM Patterns:  2002

In [23]:
matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_abstract, print_results=False)

Entries with typical LLM Patterns:  1428
Entries without typical LLM Patterns:  1372

Entries without typical LLM Patterns: llm_type
Google-PaLM    538
Llama-2-70b    429
ChatGPT        405
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    700
ChatGPT           295
Llama-2-70b       271
Google-PaLM       162
Name: count, dtype: int64
    


In [24]:
for column in ["prompt_SICO"]:
    print("Analysing column:", column)
    matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_abstract, column_generated_text=column,
                                                                             print_results=False)

Analysing column: prompt_SICO
Entries with typical LLM Patterns:  1016
Entries without typical LLM Patterns:  1784

Entries without typical LLM Patterns: llm_type
ChatGPT        688
Google-PaLM    654
Llama-2-70b    442
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    700
Llama-2-70b       258
Google-PaLM        46
ChatGPT            12
Name: count, dtype: int64
    


In [25]:
matching_rows[matching_rows["llm_type"] == "Google-PaLM"]["prompt_SICO"].head(10)

2103    **Abstract**This article explores the structur...
2114    **Abstract**This paper studies the focusing of...
2115    **Abstract**This article reviews the status of...
2145    Sure, here's a draft of a five-sentence abstra...
2159    **Abstract**The Standard Model (SM) of particl...
2160    **Abstract**This article explores a magnetores...
2163    **Abstract**This paper studies the hydrodynami...
2194    **Here is a draft of the abstract based on the...
2208    **Abstract**This article explores the dynamics...
2253    **Abstract**This article reports the observati...
Name: prompt_SICO, dtype: object

In [26]:
matching_rows[matching_rows["llm_type"] == "Google-PaLM"]["direct_prompt"].head(10)
matching_rows[matching_rows["llm_type"] == "ChatGPT"]["direct_prompt"].head(20)


112    In this article, we investigate the number of ...
133    This study investigates the synergistic effect...
178    In this study, we investigate the phenomenon o...
359    In this study, we present the results of the H...
400    In this study, we investigate the dynamics of ...
417    In this study, we investigate the particle num...
448    In this study, we investigate the dynamic phas...
486    This study investigates the hydrogen 2p--2s tr...
499    This article presents observational evidence s...
549    This study presents a comprehensive investigat...
574    The balance of forces in lipid bilayers serves...
589    This study investigates the low-temperature be...
Name: direct_prompt, dtype: object

In [27]:
matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_xsum, print_results=False)

Entries with typical LLM Patterns:  1052
Entries without typical LLM Patterns:  1748

Entries without typical LLM Patterns: llm_type
ChatGPT           696
Llama-2-70b       634
Google-PaLM       415
Claude-instant      3
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    697
Google-PaLM       285
Llama-2-70b        66
ChatGPT             4
Name: count, dtype: int64
    


In [28]:
matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_review, print_results=False)

Entries with typical LLM Patterns:  1245
Entries without typical LLM Patterns:  1555

Entries without typical LLM Patterns: llm_type
ChatGPT        660
Google-PaLM    490
Llama-2-70b    405
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    700
Llama-2-70b       295
Google-PaLM       210
ChatGPT            40
Name: count, dtype: int64
    


# 5. Test regular expressions

In [29]:
text = """The moment I stepped into Atria's, I was immediately struck by its lackluster atmosphere. The generic decor and dim lighting did nothing to inspire a sense of excitement or anticipation. As I perused the menu, my expectations dwindled further. The selection was uninspired, offering a mishmash of tired classics without any inventive twists or unique flavor combinations. I settled for a safe option, hoping it would at least be executed well.Unfortunately, the food at Atria's failed to impress. The dish I ordered arrived tepid and lacking in presentation. The flavors were indeed as described in the first sentence: blah. Bland, unseasoned, and devoid of any discernible character, it left me longing for a spark of creativity or passion. Each bite was a reminder of the missed opportunity to create something truly remarkable.To add insult to injury, the service was equally lackluster. The waitstaff went through the motions with a half-hearted attitude, displaying minimal interest in ensuring our dining experience was enjoyable. It's disheartening when even a simple request for a refill of water goes unanswered for far too long. The staff's indifference further contributed to the overall sense of mediocrity that permeated the establishment.As I looked around the restaurant, hoping for some redeeming qualities, I couldn't help but notice the lack of attention to detail. Crumbs scattered on the floor, smudges on the windows, and a general air of neglect left me questioning the level of care put into maintaining the cleanliness of the place. It was evident that Atria's had fallen into a state of complacency.Perhaps the most disappointing aspect of my experience was the realization that nothing stood out as memorable. Atria's failed to leave any lasting impression on my taste buds or my memory. It was as if I had stepped into a culinary void, where flavor and creativity were replaced with a generic, forgettable mediocrity."""

for match in re.finditer(PATTERN_COMBINED, text):
    print(match.group())

In [30]:
for _, item in matching_rows[matching_rows["llm_type"] == "ChatGPT"].iterrows():
    print(item["direct_prompt"])
    print("\n\n")

I'm writing this review to give you a heads up before you see this Doctor. Let me start by saying that my experience with Dr. Smith was utterly disappointing. From the moment I stepped into the waiting room, I was met with an unprofessional and disorganized atmosphere. The receptionist seemed overwhelmed and was unable to provide me with any clear information regarding my appointment. After a frustratingly long wait, I finally got to see Dr. Smith. However, their demeanor was distant and devoid of any empathy. It felt as though they were rushing through the consultation, barely listening to my concerns. I left the office feeling dismissed and unheard. Furthermore, Dr. Smith's treatment plan was questionable at best. They prescribed medications without thoroughly explaining the potential side effects or alternative options. This lack of transparency made me lose confidence in their medical expertise. On top of that, their follow-up was nonexistent. I had to proactively reach out to thei

In [31]:
text = """Here is a 12 sentence abstract for the article titled "Resolving the Formation of Protogalaxies. I. Virialization":We investigate the process by which the first protogalaxies in the early Universe virialized and formed cohesive structures. Using cosmological hydrodynamical simulations, we track the collapse and interaction of primordial gas clouds from redshifts 30 to 10. The simulations model gas cooling via hydrogen and helium collisional excitation and ionization, and include a simple model for the formation of the first generation of stars. We identify overdensities that exceed the threshold for collapse and analyze how they evolve over time. Many of the initial overdensities fail to fully virialize as they interact with surrounding material. However, some regions grow in mass and density through accretion and mergers to form the first self-gravitating systems. We examine when these systems first reach approximate virial equilibrium between their self-gravitational potential and internal kinetic energy. The simulations show that full virialization typically occurs around redshift 15 for structures with mass greater than 10^8 solar masses. We study how virialization depends on the mass and formation history of each system. The results from this study provide new insights into the physical process of assembly for the first generation of galaxies in the early Universe. Future work will focus on characterizing the properties and subsequent evolution of these first virialized protogalactic structures. 
"""
title = "Resolving the Formation of Protogalaxies. I. Virialization"

text = re.sub(rf"""["]*({title}["]*)""", '', text, count=1).lstrip()
text = re.sub(PATTERN_CLEANUP, '', text, count=1).lstrip()
text

'We investigate the process by which the first protogalaxies in the early Universe virialized and formed cohesive structures. Using cosmological hydrodynamical simulations, we track the collapse and interaction of primordial gas clouds from redshifts 30 to 10. The simulations model gas cooling via hydrogen and helium collisional excitation and ionization, and include a simple model for the formation of the first generation of stars. We identify overdensities that exceed the threshold for collapse and analyze how they evolve over time. Many of the initial overdensities fail to fully virialize as they interact with surrounding material. However, some regions grow in mass and density through accretion and mergers to form the first self-gravitating systems. We examine when these systems first reach approximate virial equilibrium between their self-gravitational potential and internal kinetic energy. The simulations show that full virialization typically occurs around redshift 15 for struct

In [32]:
text = """Here is an 11 sentence abstract for the article title "Holes within galaxies: the egg or the hen?":The formation of holes within galaxies remains an open question in astrophysics. Scientists have long debated whether these holes predate galaxy formation or emerge later through galactic evolutionary processes. This study aims to better understand the origin of holes by analyzing data from telescopes measuring interstellar gas and stars within several nearby galaxies. Preliminary findings show evidence that holes exist very early in galaxies' histories, with some observations datings holes to just a few hundred million years after the galaxies began forming. However, other observations reveal complex patterns of newer star formation surrounding certain holes, implying those voids emerged more recently through stellar activity and feedback effects within the galaxies. To help disentangle these possibilities, numerical simulations were run modeling the evolutionary interactions between galaxies, star clusters, and the interstellar medium. The simulation results tentatively support both scenarios happening under different conditions and timeframes within galaxies' lifetimes. Further modeling work is still needed to fully capture the full range of physical processes at play. Upcoming telescopic observations of more distant, younger galaxies should provide additional data to test theories about holes' primordial or ephemeral nature. Resolving this "egg versus hen" conundrum would improve understanding of the cyclical connection between galaxy and star formation across cosmic time. """
title = "Holes within galaxies: the egg or the hen?"

text = re.sub(rf"""["]*({title}["]*)""", '', text, count=1).lstrip()
text = re.sub(PATTERN_CLEANUP, '', text, count=1).lstrip()
text

'":The formation of holes within galaxies remains an open question in astrophysics. Scientists have long debated whether these holes predate galaxy formation or emerge later through galactic evolutionary processes. This study aims to better understand the origin of holes by analyzing data from telescopes measuring interstellar gas and stars within several nearby galaxies. Preliminary findings show evidence that holes exist very early in galaxies\' histories, with some observations datings holes to just a few hundred million years after the galaxies began forming. However, other observations reveal complex patterns of newer star formation surrounding certain holes, implying those voids emerged more recently through stellar activity and feedback effects within the galaxies. To help disentangle these possibilities, numerical simulations were run modeling the evolutionary interactions between galaxies, star clusters, and the interstellar medium. The simulation results tentatively support b