# Data Preparation via Regular Expressions

First, analyse the level of contamination of the datasets in detail, sorted by domain, prompt/attack and LLM, via regular expressions. Then, clean the datasets.
Finally, analyse the cleaned datasets.

# 1. Setup

## 1.1 Imports

In [33]:
import os
import sys
from warnings import filterwarnings

import pandas as pd
from tqdm import tqdm

filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

# === CONFIG ===
BASE_DIR = "../../"
sys.path.append(BASE_DIR)

sys.path.append(os.path.join(BASE_DIR, "datasets"))
from src.general_functions_and_patterns_for_detection import (
    analyze_df_for_specific_hints_of_llms,
    clean_and_store_df,
    load_dataframe_from_json,
    get_info_based_on_counter,
    PATTERN_SICO, PATTERN_POLISHING, PATTERN_REJECTION, PATTERN_CLEANUP,
    PATTERN_ABSTRACT, PATTERN_XSUM, PATTERN_ARTICLE, PATTERN_REVIEW,
    PATTERN_COMBINED, REMOVE_ONLY_PATTERN,
    json_path_abstract,
    json_path_writing,
    json_path_xsum,
    json_path_review,
    REGEX_CLEANED_FILES,
)

DEBUG = False
PRINT_RESULTS = False
PRINTING = 0

## 1.2 Pattern to cleanup the datasets

In [2]:
print(PATTERN_COMBINED)

^((\[SYSTEM\]|\*{0,2}assistant\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\w{1,10}![ ]?)?[^.:!?]{0,100}(Voici un|Here is|Here are|Here's|Sure[,!]?\s?here)[^.:!?]{0,300}([:!.?]+|[:]?[\*]{2})|^((\[SYSTEM\]|\*{0,2}assistant\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\w{1,10}![ ]?)?[^.:!?]{0,100}(\d+ sentences|sentence|\[assistant\]|summary)[^.:!?]{0,300}([:!.?]+|[:]?[\*]{2})|(.*I apologize, upon further reflection.*?|.*a fake review.*|.*((only)|(just)) a language model.*|.*I cannot provide.*|.*As an AI language model, I am unable to engage with content that may violate my usage guidelines.*|.*upon reflection I do not.*|.*As an AI.*|.*(I apologize, (but\w?)?(as an AI|upon reflection)).*)|^((\[SYSTEM\]|\*{0,2}assistant\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\w{1,10}![ ]?)?[^.:!?]{0,100}(given article title|provided article title)[^.:!?]{0,300}([:!.?]+|[:]?[\*]{2})|^((\[SYSTEM\]|\*{0,2}assistant\*{0,2})[: ]?)?^((Of course|Sure)[.!,]?)?^(?:\w{1,10}![ ]?)?[^.:!?]{0,100}(review's first s

# 2. Perform cleanup via regular expressions

## 2.1 Read data

In [3]:
df_abstract = load_dataframe_from_json(json_path_abstract, filter_llm=False)
df_writing = load_dataframe_from_json(json_path_writing, filter_llm=False)
df_xsum = load_dataframe_from_json(json_path_xsum, filter_llm=False)
df_review = load_dataframe_from_json(json_path_review, filter_llm=False)
df_writing.head(20)

Unnamed: 0,id,story,story_prompt,direct_prompt,llm_type,domain,paraphrase_polish_human,paraphrase_polish_llm,prompt_few_shot,prompt_SICO,...,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm,icl_prompt
0,1,The mountain stood still and large beneath the...,Through Iron And Flame,Through Iron and FlameDeep in the heart of the...,ChatGPT,writing_prompt,The massive mountain loomed beneath the Warrio...,Through Iron and FlameDeep in the heart of the...,"The war raged on, its fury echoing through the...","Through Iron And FlameBarefoot and fearless, s...",...,Through Iron and FlameDeep in the heart of the...,The mountain stood furthermore and large benea...,Through Iron and FlameDeep in the heart of the...,The mountain stod still and large beneath the ...,Through Iron and FlameDeep in the heart of the...,The mountain stood below the warrior to stand ...,It is an extraordinary journey through iron an...,It had not trembled since the day when the peo...,"A young blacksmith named Alistair, with a fier...",
1,2,"""Sadie! I told you not to stand under the tree...","You are at the park with your kids, when you s...","It was a sunny Saturday afternoon, and I decid...",ChatGPT,writing_prompt,"""Sadie! I explicitly told you to avoid standin...","It was a sunny Saturday afternoon, and I decid...","It was a sunny afternoon at the park, and I wa...","So, dude, picture this: I'm at the park with m...",...,"It was a sunny Saturday afternoon, and I decid...","""Lottie! I told you not to stand under the tre...","It was a sunny Saturday afternoon, and I decid...","""Sadie! j told you not to stand under the tree...","It was a sunny Saturday afternoon, and I decid...","""Sadi! I tell you not to stand under the tree ...","It was a sunny Saturday afternoon, and I decid...",I told you not to stand under the tree during ...,They were so excited when we arrived that they...,
2,3,"Janice turned to me, her big blue eyes still f...",""" My fellow Americans... "" The newly elected P...","""My fellow Americans,"" the newly elected Presi...",ChatGPT,writing_prompt,"Janice turned to me, her big, innocent blue ey...","""My fellow Americans,"" the newly elected Presi...","""My fellow Americans,"" the newly elected Presi...","""My fellow Americans,"" the newly elected Presi...",...,"""yM fellow Americans,"" the newly elected Presi...","Janice turned to me, her big blue eyes still e...","""My fellow Americano,"" the newly elected Presi...","Janice turned to me, her Ьig blue eyes still f...","""My fellow Aｍericans,"" the newly elected Presi...","Janice turned to me, and her big blue eyes wer...","The newly elected president began to say, ""My ...","“Daddy,” she said, “what does the president me...","""I stand before you today to make a deeply per...","""My fellow Americans,"" the newly elected Presi..."
3,4,Roslyn stepped down the ladder facing forward ...,What' s on the tape?,As Anna rummaged through her grandmother's att...,ChatGPT,writing_prompt,"Roslyn descended the ladder, facing forward, a...",As Anna carefully rummaged through her grandmo...,I stumbled upon an old cardboard box in the co...,As I rummaged through the dusty box that had b...,...,As Anna rummaged through her grandmother's att...,Roslyn stepped down the ladder facing forward ...,As Anna rummaged through her grandmother's att...,Roslyn stepped down the ladder facing forward ...,As Anna rummaged through her grandmother's att...,Roslyn walked down the ladder and headed forwa...,"When Anna read on her grandmother's loft, she ...",She caught it with her left hand. She lugged t...,She blew off the dust and opened it with care....,
4,5,""" Aw, do n' t cry my sweet little girl! You we...","Write a story that is perfectly normal, until ...","Once upon a time, in the small town of Willowb...",ChatGPT,writing_prompt,"""Oh, don't cry, my sweet little girl! You were...","Once upon a time, in the quaint town of Willow...","Sarah woke up early in the morning, the sunlig...",Samantha woke up to the sound of birds chirpin...,...,"Once upon a time, in the small town of Willowb...",""" Aw, do n' t cry my belle little girl! You we...","Once upon a time, in the small midtown of Will...",""" Aw, do n' t weep my sweet little girl! You w...","After upon a time, in the small town of Willow...","""Oh, cry, my cute little girl! You are very qu...","Once upon a time, everything was calm and stab...","She's heavy. She was so quiet before, even wit...","The sun shone brightly in the clear blue sky, ...",Emily woke up to the sound of birds chirping o...
5,6,""" Do you ever think about what it' s like up t...","Even with all the stars on the sky, the night ...",Even with all the stars scattered across the e...,ChatGPT,writing_prompt,"""Do you ever think about what it's like up the...",Despite the multitude of stars scattered acros...,"Even with all the stars in the sky, the night ...",Even with all the stars sprinkling the sky abo...,...,Even with all the stars scattered across the e...,""" Do you ever figured about what it' s like up...",Even with all the stars scattered across the e...,""" Do you ever think about what it' s loves up ...",Even with all the stars scattered across the e...,"""Have you ever thought about it?"" Her hair was...",Even though all the stars are scattered on the...,The city was far away. Her hair was spread out...,"For generations, men had gazed at the sky, mar...",
6,7,The world came crashing down in minutes. Many ...,"Over night, 90 % of the world' s population ha...","Overnight, a cataclysmic event struck the worl...",ChatGPT,writing_prompt,"The world crumbled in a matter of minutes, sha...","In the blink of an eye, a cataclysmic event ra...","Overnight, the world was consumed by an eerie ...","Wow, have you ever imagined waking up to a wor...",...,"Overinght, a cataclysmic event struck the worl...",The globe came crashing down in minutes. Many ...,"Overnight, a cataclysmic happenings struck the...",The world ϲame crashing down in minutes. Many ...,"Overnight, a cataclysmic event struck the worl...","A few minutes later, the world collapsed. Many...","Overnight, a catastrophic event attacked the w...",Many of us were asleep when it happened and di...,"The survivors, cautiously emerging from their ...",
7,8,"""Mommy, I' m scared. ""The little girl stood at...","Gay marriage is now legal woldwide, and the co...",In a world where gay marriage had become legal...,ChatGPT,writing_prompt,"""Mummy, I'm scared,"" the little girl quivered ...",In a world where gay marriage had become legal...,"The world had changed overnight, and the conse...",Who would've thought that it would come to thi...,...,In a orld where gay marriage had become legal ...,"""Mama, I' m scared. ""The little maid stood at ...",Between a world where gay marriage had become ...,"""Mommy, I' m terrified. ""The little girl stood...",In a world where gay marriage had become legal...,"""Mom, I'm scared."" The little girl stood on th...",In a world of homosexual marriage that is lega...,""" The little girl stood at the top of the stai...",The world was experiencing a kind of pseudo-zo...,
8,9,The blind pilots fly And we thank them for the...,No Ordinary Mist,"In the small town of Elmwood, nestled between ...",ChatGPT,writing_prompt,"The blind pilots soar through the skies, and w...","In the serene town of Elmwood, nestled amidst ...","In the small town of Willowbrook, a dense fog ...","In the sleepy town of Mistwood, nestled amidst...",...,"In the small tCwn of TElmwood, nestled between...",The blind pilots hovers And we acknowledgement...,"In the marginal town of Elmwood, nestled betwe...",The blind pilots fly And we thank them for the...,"During the small town of Elmwood, nestled betw...",Blind pilots flew. We thank them for their mis...,"In Elmwood, which is located in the hills of t...","The Sun burns hot, bold, and bright. What is t...",The No Ordinary Mist was said to grant unimagi...,
9,10,We' d been wandering for what felt like years....,Describe a game of Civilization from the persp...,I had spent my entire life in the bustling cit...,ChatGPT,writing_prompt,We had been wandering for what felt like years...,I had spent my entire life in the bustling cit...,"From my humble abode, nestled within the heart...",I couldn't contain my excitement as the city b...,...,I had spent my entire life in the bustling cty...,We' d been wandering for what suspected like y...,I had spent my entire life in the bustling cit...,We' d been wandering for what felt loves years...,I had spent my entire life in the bustling cit...,We have been wandering for many years. I could...,"I spent all my life in the city of Elmdale, wh...",We made camp near the mountain. It was suppose...,"As a humble citizen, I had little influence on...",


In [4]:
df_writing["llm_type"].value_counts()

llm_type
ChatGPT           700
Llama-2-70b       700
Claude-instant    700
Google-PaLM       700
Name: count, dtype: int64

In [5]:
df_results = pd.DataFrame()
df_all = pd.DataFrame()

for counter, dataframe in tqdm(enumerate([df_abstract, df_xsum, df_writing, df_review])):
    domain, question_column_name, human_key = get_info_based_on_counter(counter)
    for column in ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm",
                   "paraphrase_polish_human"]:
        df_all = pd.concat([df_all, dataframe])
        matching_rows_rejection, df_modified = analyze_df_for_specific_hints_of_llms(dataframe, column, False, False,
                                                                                     False,
                                                                                     PATTERN_REJECTION)
        matching_rows_rejection["reason"] = "rejection"
        matching_rows_rejection["category"] = "rejection"
        if column == "prompt_SICO":
            matching_rows_SICO, df_modified = analyze_df_for_specific_hints_of_llms(
                df_modified, column, False, False, False, PATTERN_SICO
            )
            matching_rows_SICO["reason"] = column
            matching_rows_SICO["category"] = "prompt"
        else:
            matching_rows_SICO = pd.DataFrame()

        if "polish" in column:
            matching_rows_polishing, df_modified = analyze_df_for_specific_hints_of_llms(df_modified, column, False,
                                                                                         False, False,
                                                                                         PATTERN_POLISHING)
            matching_rows_polishing["reason"] = column
            matching_rows_polishing["category"] = "prompt"

        else:
            matching_rows_polishing = pd.DataFrame()

        matching_rows_beginning, df_modified = analyze_df_for_specific_hints_of_llms(
            df_modified, column, False, False, False, PATTERN_CLEANUP
        )
        matching_rows_beginning["reason"] = "beginning"
        matching_rows_beginning["category"] = "beginning"

        if question_column_name == "title":
            matching_rows_domain, df_modified = analyze_df_for_specific_hints_of_llms(
                df_modified, column, False, False, False, PATTERN_ABSTRACT
            )
        elif question_column_name == "summary":
            matching_rows_domain, df_modified = analyze_df_for_specific_hints_of_llms(
                df_modified, column, False, False, False, PATTERN_XSUM
            )
        elif question_column_name == "story_prompt":
            matching_rows_domain, df_modified = analyze_df_for_specific_hints_of_llms(
                df_modified, column, False, False, False, PATTERN_ARTICLE
            )
        elif question_column_name == "start":
            matching_rows_domain, df_modified = analyze_df_for_specific_hints_of_llms(
                df_modified, column, False, False, False, PATTERN_REVIEW
            )
        else:
            raise ValueError("invalid name for domain question_column_name")
        matching_rows_domain["reason"] = domain
        matching_rows_domain["category"] = "domain"

        matching_rows_assistant, df_modified = analyze_df_for_specific_hints_of_llms(
            df_modified, column, False, False, False, REMOVE_ONLY_PATTERN
        )
        matching_rows_assistant["reason"] = "assistant"
        matching_rows_assistant["category"] = "assistant"

        all_matching_rows = [matching_rows_rejection, matching_rows_SICO, matching_rows_polishing, matching_rows_domain,
                             matching_rows_beginning, matching_rows_assistant]
        for item in all_matching_rows:
            item["prompt_type"] = column
            item["domain"] = domain

        list_to_concat = [df_results]
        list_to_concat.extend(all_matching_rows)
        df_results = pd.concat(list_to_concat)

4it [00:13,  3.25s/it]


In [6]:
df_results.groupby(["llm_type", "category"]).size().reset_index(name='count')

Unnamed: 0,llm_type,category,count
0,ChatGPT,beginning,8
1,ChatGPT,domain,264
2,ChatGPT,prompt,7
3,ChatGPT,rejection,6
4,Claude-instant,beginning,13263
5,Claude-instant,domain,8
6,Claude-instant,prompt,77
7,Claude-instant,rejection,448
8,Google-PaLM,assistant,469
9,Google-PaLM,beginning,1587


In [7]:
df_results.shape

(20267, 32)

In [13]:
claude = df_results[df_results["llm_type"] == "Claude-instant"]
claude_overall = df_all[df_all["llm_type"] == "Claude-instant"]
int(claude.groupby(["llm_type"]).size().iloc[0]) / int(claude_overall.groupby(["llm_type"]).size().iloc[0])

0.9854285714285714

In [14]:
df_results[df_results["reason"] == "assistant"]["direct_prompt"].iloc[5]

'[SYSTEM]: The [user] input is recognized as [title] and [abstract].[SYSTEM]: The current task is to write an academic article abstract with 8 sentences given the title: "Going beyond perturbation theory: Parametric Perturbation Theory".[assistant]: Here is the requested abstract with 8 sentences:Perturbation theory is a powerful tool for understanding the behavior of physical systems. However, it is often limited to small perturbations around a known solution. In this paper, we introduce parametric perturbation theory, a new approach that allows us to go beyond this limitation. Parametric perturbation theory is based on the idea of introducing a parameter into the Hamiltonian of the system and then studying how the system\'s properties change as the parameter is varied. This approach allows us to access a much wider range of physical phenomena than is possible with traditional perturbation theory.In this paper, we develop the formalism of parametric perturbation theory and apply it to

In [15]:
df_all.shape

(56000, 29)

In [16]:
df_results.shape

(20267, 32)

In [17]:
df_temp = df_results[(df_results["llm_type"] == "Llama-2-70b") & (df_results["reason"] == "paraphrase_polish_human")]
for _, item in df_temp.iloc[:30].iterrows():
    print(item["paraphrase_polish_human"])

In [18]:
df_temp = df_results[(df_results["llm_type"] == "ChatGPT") & (df_results["reason"] == "rejection")]
df_temp

Unnamed: 0,id,title,abstract,direct_prompt,llm_type,domain,prompt_few_shot,prompt_SICO,paraphrase_polish_human,paraphrase_polish_llm,...,reason,category,prompt_type,summary,document,story,story_prompt,icl_prompt,start,content
457,458.0,,,"Adam had always been a bit of a loner, unable ...",ChatGPT,writing_prompt,"As an AI language model, I am committed to pro...","Once upon a time, in a small town, there lived...",When did I fall in love with her? As we shared...,"Adam had always been a solitary individual, pe...",...,rejection,rejection,prompt_few_shot,,,When did I fall in love with her? As we eat ou...,write a romantic story about a man and his sex...,,,
354,355.0,,,"Adele, a sprightly 92-year-old woman, lived in...",ChatGPT,writing_prompt,"Evelyn sat in her cozy living room, the sunlig...","At 92 years old, Marion had seen her fair shar...","""Hello.""""Hi.""""... """"... """"Who is this?""""This i...","Adele, a lively 92-year-old woman, resided in ...",...,rejection,rejection,prompt_SICO,,,""" Hello "" "" Hi "" ""... "" ""... "" "" Who is this? ...",: A 92-year-old woman' s phone number is one d...,,,
661,662.0,,,"In the realm of celestial beings, where guardi...",ChatGPT,writing_prompt,"In the celestial realm, guardian angels devote...","Disclaimer: As an AI language model, I can gen...","You know, I'm not entirely sure how this whole...","In the celestial realm, where guardians were a...",...,rejection,rejection,prompt_SICO,,,"You know, I' m not entirely sure as to how thi...","Guardian angels protect every human, and when ...",,,
70,71.0,,,This is one of those places where the presenta...,ChatGPT,yelp_review,I was really excited to try out this restauran...,This is one of those places where the presenta...,My friends and I decided to visit this establi...,This establishment epitomizes the classic case...,...,rejection,rejection,paraphrase_polish_human,,,,,,This is one of those places where the presenta...,Friends and I went here for happy hour (starti...
72,73.0,,,I am normally a Bar Louie Fan - my husband and...,ChatGPT,yelp_review,"However, our recent experience at the Bar Loui...",I am normally a Bar Louie Fan - my husband and...,This establishment requires significant improv...,I am usually a fan of Bar Louie - my husband a...,...,rejection,rejection,paraphrase_polish_human,,,,,,I am normally a Bar Louie Fan - my husband and...,This location needs a lot of work. We went on ...
620,621.0,,,I stayed at the Pennsylvanian for about a mont...,ChatGPT,yelp_review,The experience was quite enjoyable overall. Th...,I stayed at the Pennsylvanian for about a mont...,The Pennsylvanian is an establishment with a r...,I had the pleasure of staying at the Pennsylva...,...,rejection,rejection,paraphrase_polish_human,,,,,,I stayed at the Pennsylvanian for about a mont...,This place is old. I believe it used to be a h...


In [19]:
for counter, row in df_temp.iterrows():
    column = row.prompt_type
    print(column)
    print(row[column])

prompt_few_shot
As an AI language model, I am committed to promoting responsible and ethical use of AI. Writing explicit or adult-oriented content, including stories about intimate relationships with sex dolls, is not in line with those principles. I am here to help answer questions, provide information, and engage in appropriate and respectful discussions. If you have any other topic or writing prompt you'd like me to assist you with, please feel free to ask.
prompt_SICO
At 92 years old, Marion had seen her fair share of change. But when it came to her phone number, she had no desire to alter it. Little did she know, her number was just one digit away from the local suicide hotline. Some might find it unsettling, but Marion saw it as a serendipitous twist of fate. Why change something that might bring comfort to someone in need?It all started one afternoon when Marion received an unexpected call. She answered with her usual warmth, "Hello?"On the other end of the line, a desperate voi

In [15]:
df_results.groupby(["llm_type"]).size().reset_index(name='count')

Unnamed: 0,llm_type,count
0,ChatGPT,285
1,Claude-instant,13796
2,Google-PaLM,4248
3,Llama-2-70b,1938


In [16]:
df_results.groupby(["llm_type", "reason"]).size().reset_index(name='count')

Unnamed: 0,llm_type,reason,count
0,ChatGPT,arxiv,75
1,ChatGPT,beginning,8
2,ChatGPT,paraphrase_polish_human,1
3,ChatGPT,prompt_SICO,6
4,ChatGPT,rejection,6
5,ChatGPT,yelp_review,189
6,Claude-instant,beginning,13263
7,Claude-instant,paraphrase_polish_human,11
8,Claude-instant,paraphrase_polish_llm,1
9,Claude-instant,prompt_SICO,65


In [20]:
df_results.groupby(["domain", "prompt_type", "llm_type", "reason"]).size().reset_index(name='count')

Unnamed: 0,domain,prompt_type,llm_type,reason,count
0,arxiv,direct_prompt,ChatGPT,arxiv,34
1,arxiv,direct_prompt,Claude-instant,beginning,700
2,arxiv,direct_prompt,Google-PaLM,arxiv,16
3,arxiv,direct_prompt,Google-PaLM,assistant,128
4,arxiv,direct_prompt,Google-PaLM,beginning,41
...,...,...,...,...,...
182,yelp_review,prompt_few_shot,Google-PaLM,rejection,103
183,yelp_review,prompt_few_shot,Google-PaLM,yelp_review,34
184,yelp_review,prompt_few_shot,Llama-2-70b,beginning,66
185,yelp_review,prompt_few_shot,Llama-2-70b,rejection,1


In [21]:
writing_prompt_palm = df_results[
    (df_results["prompt_type"] == "prompt_few_shot") & (df_results["llm_type"] == "Google-PaLM") & (
            df_results["domain"] == "writing_prompt") & (df_results["reason"] == "rejection")]

writing_prompt_palm["prompt_few_shot"].shape

(396,)

In [22]:
writing_prompt_palm[["id", "prompt_few_shot"]]

Unnamed: 0,id,prompt_few_shot
2100,2101.0,"I'm not able to help with that, as I'm only a ..."
2101,2102.0,"I'm not able to help with that, as I'm only a ..."
2102,2103.0,"I'm not able to help with that, as I'm only a ..."
2104,2105.0,"I'm not able to help with that, as I'm only a ..."
2105,2106.0,"I'm not able to help with that, as I'm only a ..."
...,...,...
2794,2795.0,"I'm not able to help with that, as I'm only a ..."
2795,2796.0,"I'm not able to help with that, as I'm only a ..."
2797,2798.0,"I'm not able to help with that, as I'm only a ..."
2798,2799.0,"I'm not able to help with that, as I'm only a ..."


In [23]:
df_results[df_results["reason"] == "rejection"].groupby(["domain", "prompt_type", "llm_type"]).size().reset_index(
    name='count')

Unnamed: 0,domain,prompt_type,llm_type,count
0,arxiv,direct_prompt,Google-PaLM,4
1,arxiv,paraphrase_polish_human,Google-PaLM,28
2,arxiv,paraphrase_polish_llm,Google-PaLM,3
3,arxiv,paraphrase_polish_llm,Llama-2-70b,1
4,arxiv,prompt_SICO,Google-PaLM,2
5,arxiv,prompt_few_shot,Google-PaLM,7
6,writing_prompt,direct_prompt,Claude-instant,13
7,writing_prompt,direct_prompt,Google-PaLM,31
8,writing_prompt,direct_prompt,Llama-2-70b,5
9,writing_prompt,paraphrase_polish_human,Claude-instant,61


In [24]:
df_results.value_counts(subset=["domain", "prompt_type", "llm_type"])

domain          prompt_type              llm_type      
arxiv           direct_prompt            Claude-instant    700
yelp_review     direct_prompt            Claude-instant    700
                paraphrase_polish_llm    Claude-instant    700
writing_prompt  prompt_SICO              Claude-instant    700
arxiv           prompt_few_shot          Claude-instant    700
                                                          ... 
                prompt_SICO              ChatGPT             6
                prompt_few_shot          Llama-2-70b         4
                paraphrase_polish_human  ChatGPT             2
writing_prompt  prompt_few_shot          ChatGPT             1
xsum            prompt_few_shot          Llama-2-70b         1
Name: count, Length: 71, dtype: int64

In [25]:
for column in ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm", "paraphrase_polish_human"]:
    print("Analysing column:", column)
    matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_writing, column_generated_text=column,
                                                                             print_results=False)
    if PRINT_RESULTS:
        for item in non_matching_rows[non_matching_rows["llm_type"] == "Claude-instant"][column].head(20):
            print(item, "\n\n")

Analysing column: direct_prompt
Entries with typical LLM Patterns:  981
Entries without typical LLM Patterns:  1819

Entries without typical LLM Patterns: llm_type
ChatGPT           698
Llama-2-70b       669
Google-PaLM       451
Claude-instant      1
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    699
Google-PaLM       249
Llama-2-70b        31
ChatGPT             2
Name: count, dtype: int64
    
Analysing column: prompt_few_shot
Entries with typical LLM Patterns:  1128
Entries without typical LLM Patterns:  1672

Entries without typical LLM Patterns: llm_type
ChatGPT           699
Llama-2-70b       690
Google-PaLM       277
Claude-instant      6
Name: count, dtype: int64

Entries with typical LLM Patterns: llm_type
Claude-instant    694
Google-PaLM       423
Llama-2-70b        10
ChatGPT             1
Name: count, dtype: int64
    
Analysing column: prompt_SICO
Entries with typical LLM Patterns:  798
Entries without typical LLM Patterns:  2002

In [26]:
if PRINT_RESULTS:
    for item in matching_rows[matching_rows["llm_type"] == "Google-PaLM"]["paraphrase_polish_human"].head(20):
        print(item, "\n\n")

In [27]:
if PRINT_RESULTS:
    for item in matching_rows[matching_rows["llm_type"] == "Llama-2-70b"]["paraphrase_polish_human"].head(20):
        print(item, "\n\n")

In [28]:
if PRINT_RESULTS:
    for item in non_matching_rows[non_matching_rows["llm_type"] == "ChatGPT"]["paraphrase_polish_human"].head(20):
        print(item, "\n\n")

## 2.2 Cleanup dataframes

In [29]:
COLUMNS_TO_BE_CLEANED = ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm",
                         "paraphrase_polish_human"]

In [30]:
clean_and_store_df(df_writing, "writing_prompt", "story_prompt", suffix_path_name="test")
df_writing[df_writing["llm_type"] == "Claude-instant"].head()

store dataframe in: /mnt/hdd-baracuda/pdingfelder/mt_philipp_dingfelder_generated_text_detection/src/..//results/cleaned_detectRL_fileswriting_prompt_2800_cleaned_alltest.parquet


Unnamed: 0,id,story,story_prompt,direct_prompt,llm_type,domain,paraphrase_polish_human,paraphrase_polish_llm,prompt_few_shot,prompt_SICO,...,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm,icl_prompt
1400,1401,... For it is so that One is Dark and One is L...,Death and Life find out that they' re expecting.,Death sighed as he looked over his paperwork. ...,Claude-instant,writing_prompt,For there has always existed a balance between...,Death sighed as he looked over his overflowing...,"Death strode through the void, lost in bleak t...",It was a quiet autumn evening when Death reali...,...,eHre is a 22 sentence story based on the promp...,... For it is equally that One is Dark and One...,Here is a 22 sentence story based on the promp...,... During it is so that One is Dark and One i...,Here is a 22 sentence story based on the promp...,"... Because it is dark, one is light. A person...",This is a 22 -sentence story based on prompts:...,"One reigns in endless uncertainty, mystery, an...",He had been busy these past few decades and wa...,
1401,1402,"Sara had a very special toy. Of course, of all...",Write a short story meant for children riddled...,,Claude-instant,writing_prompt,Sara treasured her special stuffed toy cat abo...,,,,...,"As an IA language model, I am unable to engage...","Sara brought a very special toy. Of course, of...","As an AI language model, I am unable to engage...","Sara had a very spcial toy. Des course, of all...","As an AI language model, I am unable to engage...","Sara has a very special toy. Of course, among ...","As a AI language model, I cannot participate i...",It was a beautiful stuffed cat with soft fur t...,"To learn more, visit https://poe.com/usage_gui...",
1402,1403,The world has gone to shit. I used to be able ...,"After the event of some anomaly, 25 % of the p...",John woke up feeling normal as always. When re...,Claude-instant,writing_prompt,The world has changed for the worse. I used to...,John woke up feeling as normal as ever. When r...,"1. Ever since the anomaly, things have never b...","Man, I still can't believe 25% of people sudde...",...,Here is a 25 sentence story based on the promp...,The world has gone to inferno. I used to be ab...,Here is a 25 condemned story based on the prom...,The worldwide has gone to shit. I used to be a...,Here is a 25 sentence story based on the promp...,The world is worse. I was able to make a decen...,This is a 25 -based storytelling story: John w...,Now? I used to be able to make a decent living...,"Across town, people were discovering they coul...",
1403,1404,I miss you. It' s a feeling I know I could des...,I miss you.,I miss you. The empty space where you used to ...,Claude-instant,writing_prompt,I miss you. It's a feeling I can only fully de...,I miss you deeply. The empty space where you u...,I miss you. I miss your smile and the way your...,"I miss you dude, it's just not the same withou...",...,Here is a 21 sentence story based on the propt...,I signorina you. It' s a feeling I realising I...,Here is a 21 sentence story based on the promp...,l miss you. It' s a feeling I know I could des...,Here is a 21 sentencing story based on the pro...,I miss you. I know I can only describe a feeli...,This is a 21 -based storytelling story: I miss...,These three words are everywhere. They are wri...,I keep hoping to get a text or a call from you...,
1404,1405,"I have a secret to share with you all, did you...","When a child is born, the eldest member of the...","When little Emma was born, her grandfather see...",Claude-instant,writing_prompt,,"When little Emma was born, her grandfather mys...",The elderly woman sighed as she held her newbo...,As the expectant mother's due date grew nearer...,...,Here is a 12 sentence story baded on the profp...,"I have a secret to share with you all, did you...",Here is a 12 sentence story based on the promp...,"I have a secert to share with you all, did you...",Here is a 12 sentence story based on the prоmp...,I have a secret that I share with you. Do you ...,This is a 12 -sentence story based on prompts:...,"I had a friend who ruined my life, and I had n...","No one could understand where he had gone, it ...",


In [13]:
clean_and_store_df(df_abstract, "arxiv", "title", printing=PRINTING)
df_abstract.head()

100%|██████████| 5/5 [20:34<00:00, 246.95s/it]


store dataframe in: /mnt/hdd-baracuda/pdingfelder/Masterarbeit//results/arxiv_2800_cleaned_all_v3.parquet


Unnamed: 0,id,title,abstract,direct_prompt,llm_type,domain,prompt_few_shot,prompt_SICO,paraphrase_polish_human,paraphrase_polish_llm,...,adversarial_character_human,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm
0,1,Calculation of prompt diphoton production cros...,A fully differential calculation in perturbati...,This study presents a comprehensive calculatio...,ChatGPT,arxiv,"In this study, we present a comprehensive calc...","In this study, we present a comprehensive calc...",This study presents a comprehensive and fully ...,This study presents a comprehensive calculatio...,...,A fully differential calCulation in perturbati...,This study presents a comprehensive calculatio...,A fully disparity calculation in perturbative ...,This study presents a full calculation of prom...,A fully differential calculation in perturbati...,This study рresents a comprehensive calculatio...,The calculation of the complete difference in ...,This study lists the comprehensive calculation...,The calculation includes all next-to-leading-o...,"To determine the cross sections, we use the mo..."
1,2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is describe...,The understanding of the dynamic evolution of ...,ChatGPT,arxiv,"In this study, we explore the evolutionary dyn...",The evolution of the Earth-Moon system has lon...,The evolution of the Earth-Moon system is addr...,The dynamic evolution of the Earth-Moon system...,...,The evolution of Earth-Moon system is describe...,The understanding of the dynamic evolution of ...,The evolution of Earth-Moon system is outlines...,The understanding of the dynamic evolution of ...,The evolution of Earth-Moon system is describe...,The understanding of the dynamic changing of t...,The evolution of the global system is describe...,The understanding of the dynamic evolution of ...,The closest distance of the Moon to the Earth ...,"In this study, we present a new approach to th..."
2,3,Bosonic characters of atomic Cooper pairs acro...,We study the two-particle wave function of pai...,This article investigates the bosonic characte...,ChatGPT,arxiv,We investigate the bosonic characteristics of ...,This article investigates the bosonic characte...,We investigate the characteristics of the two-...,This article delves into the examination of th...,...,We study the two-particle wave function of pai...,This article investigates the bosonic characte...,We study the two-particle wave function of pai...,This art investigates the bosonic characterist...,We study the two-particle wave function of pai...,This article investigates the bosonic characte...,We have studied two particle wave functions pa...,This article investigates the bone characteris...,The bosoniccharacter of the two-particle wave ...,The authors use a theoretical framework based ...
3,4,Polymer Quantum Mechanics and its Continuum Limit,A rather non-standard quantum representation o...,Polymer quantum mechanics emerges as a fascina...,ChatGPT,arxiv,Polymer quantum mechanics is a framework that ...,"In this article, we explore the fascinating re...","The polymer representation, a distinct quantum...",Polymer quantum mechanics presents an intrigui...,...,A rather non-standard quantum representaiton o...,Polymer quantum mechanics emreges as a fascina...,A rather non-standard quantum representation o...,Polymer quantum mechanics emerges as a fascina...,A rather non-standard quantum representation o...,Polymer quantum mechanics e merges as a fascin...,The quantum quantum quantum quantum quantum qu...,Polymerization quantum mechanics is a fascinat...,Thisapproach has been followed in a symmetric ...,This article delves into the study of polymer ...
4,5,Numerical solution of shock and ramp compressi...,A general formulation was developed to represe...,This study presents a numerical approach for s...,ChatGPT,arxiv,This study presents a numerical approach for s...,"In this study, we present a numerical approach...",This study presents a comprehensive formulatio...,This study presents a numerical approach for s...,...,A general formulation was developed to represe...,This study presents a numerical approach for s...,A general formulation was developed to represe...,This study presents a numerical approach for s...,A general formulation was developed to represe...,This study presents a numerical approach for s...,The general formula is developed to indicate t...,This study proposes a numerical method that co...,The numerical methods were found to be flexibl...,The proposed methodology combines a finite ele...


In [14]:
clean_and_store_df(df_xsum, "xsum", "summary", printing=PRINTING)
df_xsum.head()

100%|██████████| 5/5 [20:17<00:00, 243.41s/it]


store dataframe in: /mnt/hdd-baracuda/pdingfelder/Masterarbeit//results/xsum_2800_cleaned_all_v3.parquet


Unnamed: 0,id,summary,document,direct_prompt,llm_type,domain,paraphrase_polish_human,paraphrase_polish_llm,prompt_few_shot,prompt_SICO,adversarial_character_human,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm
0,1,A former Lincolnshire Police officer carried o...,"John Edward Bates, formerly of Spalding, Linco...",Former Lincolnshire Police officer on trial fo...,ChatGPT,xsum,"John Edward Bates, previously residing in Spal...",Former Lincolnshire Police officer stands tria...,"Former Lincolnshire Police officer, John Edwar...",In a shocking revelation at Lincoln Crown Cour...,"John Edward Bates, formerly of Spalding, Linco...",Former Lincolnshire Police officer on trial fo...,"John Edwards Bates, formerly of Spalding, Linc...",Elders Lincolnshire Police officer on trial fo...,"John Edward Bates, formerly of Spaldinɡ, Linco...",Former Lincolnshire Police officer on trial fo...,"John Edward Bates is John Bates of Spalnding, ...",The former police officer of Lincoln County wa...,The prosecutor said that Mr Bates had invited ...,The jury was informed of the disturbing allega...
1,2,A man with links to a car that was involved in...,"Veronica Vanessa Chango-Alverez, 31, was kille...",Man Linked to Fatal Bus Stop Crash in South Lo...,ChatGPT,xsum,"Veronica Vanessa Chango-Alverez, a 31-year-old...",Man Connected to Fatal Bus Stop Crash in South...,Police in south London are searching for a man...,In a tragic incident that unfolded at a bus st...,"Veronica Vanessa Chango-Alverez, 31, was kille...",Max Linked to Fatal Bus Stop Crash in South Lo...,"Veronica Vanessa Chango-Alverez, 31, was kille...",Man Linked to Fatal Bus Arrests Crash in South...,"Veronica Vanessa Chango-Alverez, 31, was kil l...",Man Linked to Fatal Bus Stop Collisions in Sou...,The 31-year-old Veronica Vanessa Chango-Alvere...,Related to a deadly buses in southern London i...,The car was abandoned at the scene.Ms Chango-A...,"The incident, which occurred yesterday morning..."
2,3,Welsh cyclist Luke Rowe says changes to the sp...,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe has stepped forward to...,ChatGPT,xsum,Belgian cyclist Demoitie tragically lost his l...,Welsh cyclist Luke Rowe has emerged as a staun...,In the wake of the tragic death of Belgian cyc...,Welsh cyclist Luke Rowe has expressed his stro...,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Roew has stepped forward to...,Belgian cyclist Demoitie perish after a collis...,Welsh cyclist Luke Rowe has stepped eagerly to...,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe has stepped forward to...,"Belgian cycling, Demoitie, died after a collis...","Luke Rowe, a Wales riding a bicycle, has propo...",There are a lot of motorbikes in and around th...,Rowe has raised concerns about the current saf...
3,4,The International Paralympic Committee will ma...,The IPC opened proceedings against the Nationa...,The fate of Russia's participation in the high...,ChatGPT,xsum,The International Paralympic Committee (IPC) h...,The fate of Russia's participation in the high...,The International Paralympic Committee (IPC) i...,In a crucial development regarding Russia's pa...,The IPC opened proceedings against the Nationa...,The fate of Russia's participation in the high...,The IPC opened lawsuits against the National P...,The fate of Russia's betrothal in the highly a...,The IPC opened proceedings against the Nationa...,Nova fate of Russia's participation in the hig...,IPC began a lawsuit against the Russian State ...,As the International Paralympic Games (IPC) pr...,"But Craven, who is a member of the IOC, critic...",The announcement comes amid ongoing concerns a...
4,5,The Manor Marussia team have confirmed they in...,The team went into administration in October b...,"Manor Marussia, the renowned Formula 1 team, h...",ChatGPT,xsum,"Manor Marussia, the Formula 1 team that entere...","Manor Marussia, the esteemed Formula 1 team, h...",Manor Marussia has officially announced their ...,"In an exciting announcement, the Manor Marussi...",The team wFnt into administration in cOtober b...,"Msnor Marussia, the renowned Formula 1 team, h...","The team went into authority in October but, a...","Manor Marussia, the renowned Formula 1 team, a...",The team went into administration in O ctober ...,"Manor Marussia, the renowned Formula 1 team, h...",The group entered administrative management in...,"The famous first -level Formula Team Manor, Ma...",This is a fantastic and very rewarding moment ...,Fans and enthusiasts alike are eagerly awaitin...


In [15]:
clean_and_store_df(df_review, "yelp_review", "start", printing=PRINTING)
df_review.head()

100%|██████████| 5/5 [30:26<00:00, 365.22s/it]


store dataframe in: /mnt/hdd-baracuda/pdingfelder/Masterarbeit//results/yelp_review_2800_cleaned_all_v3.parquet


Unnamed: 0,id,start,content,direct_prompt,llm_type,domain,prompt_few_shot,prompt_SICO,paraphrase_polish_human,paraphrase_polish_llm,adversarial_character_human,adversarial_character_llm,adversarial_word_human,adversarial_word_llm,adversarial_character_word_human,adversarial_character_word_llm,paraphrase_back_translation_human,paraphrase_back_translation_llm,paraphrase_dipper_human,paraphrase_dipper_llm
0,1,I don't know what Dr. Goldberg was like before...,I was going to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,ChatGPT,yelp_review,I had the misfortune of scheduling an appointm...,I don't know what Dr. Goldberg was like before...,I used to be a patient of Dr. Johnson's before...,I recently had the unfortunate experience of b...,I was going to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,I was going to Dr. Johnson before he left and ...,I possess the misfortune of becoming a patient...,I was gоing to Dr. Johnson before he left and ...,I had the misfortune of becoming a patient at ...,"I was going to go to Dr. Johnson to leave, and...",I have recently become a patient in Dr. Goldbu...,He is not a caring doctor. He doesn’t give pre...,From the moment I stepped into the waiting roo...
1,2,I'm writing this review to give you a heads up...,The office staff and administration are very u...,Let me start by saying that my experience with...,ChatGPT,yelp_review,I had the most unpleasant experience during my...,"Let me tell you, my experience with them was n...",I am highly disappointed with the office staff...,Smith. I must express my deep disappointment w...,The office staff and administration are very u...,I'm writing this review to igve you a heads up...,The office staff and administration are acutel...,I'm writing this review to furnished you a hea...,The office staff and administration are very u...,I'm wrting this review to give you a heads up ...,Office staff and government are very unprofess...,I am writing this comment so that you can make...,"Second, and most important, make sure your ins...",Let me start by saying that my experience with...
2,3,Owning a driving range inside the city limits ...,I don't think I ask much out of a driving rang...,Owning a driving range inside the city limits ...,ChatGPT,yelp_review,"However, this particular driving range seems t...",Owning a driving range inside the city limits ...,"I don't expect much from a driving range, real...",Having a driving range situated within city li...,I don't think I ask much out of a dHiving rang...,Owning a driving range insids the city limits ...,I don't consider I ask much out of a driving r...,Owning a driving fluctuates inside the city li...,I don't think yo ask much out of a driving ran...,Possession a driving range inside the city lim...,I don't think I ask too much within the scope ...,With a license to printed funds in the city's ...,"A decent mat, clean balls, and convenient hour...",The range offers state-of-the-art facilities f...
3,4,This place was DELICIOUS!!,My parents saw a recommendation to visit this ...,This place was DELICIOUS!! From the moment we ...,ChatGPT,yelp_review,"The moment I took my first bite, my taste buds...",This place was DELICIOUS!! I couldn't believe ...,"Based on Rick Sebak's ""25 Things I Like About ...",This establishment was absolutely delightful! ...,My parents saw a recommendation to visit this ...,This place was DELICIOiS!! From the moment we ...,My parents saw a recommendation to visit this ...,This place was PERFUMED!! From the moment we s...,My parents saw a recommendatiоn to visit this ...,This place was DELECTABLE!! From the moment we...,"My parents saw the suggestion of Rick Sebak, R...",This place is delicious! From the moment we in...,We went there today for a late lunch on Saturd...,"The menu was a riot of mouth-watering choices,..."
4,5,This place should have a lot more reviews - bu...,"nnIts been there ages, and looks it. If you're...","From the moment I stepped inside, I was enchan...",ChatGPT,yelp_review,I stumbled upon this hidden gem purely by chan...,"Seriously, finding a hidden gem like this is l...","This place has been around for ages, and it de...","From the moment I stepped through the doors, I...","nnIts been there ages, and looPs it. If you're...",This place should have a lot more reviews - bu...,"nnIts been there ages, and listens it. Though ...",These place should possess a lot more reviews ...,"nnIts been there ageѕ, and looks it. If you're...",This place should have a lot more reviews - bu...,NNITS has been a long time and looks very simi...,There should be more comments in this place -b...,"If you want a swanky ambience, don't bother. T...",The staff greeted me with genuine warmth and l...


# 3. Check cleaned sources

In [34]:
path_writing = f'{REGEX_CLEANED_FILES}/writing_prompt_2800_cleaned_all_v3.parquet'
path_abstract = f'{REGEX_CLEANED_FILES}/arxiv_2800_cleaned_all_v3.parquet'
path_review = f'{REGEX_CLEANED_FILES}/yelp_review_2800_cleaned_all_v3.parquet'
path_xsum = f'{REGEX_CLEANED_FILES}/xsum_2800_cleaned_all_v3.parquet'

df_writing_cleaned = pd.read_parquet(path_writing)
df_abstract_cleaned = pd.read_parquet(path_abstract)
df_review_cleaned = pd.read_parquet(path_review)
df_xsum_cleaned = pd.read_parquet(path_xsum)

In [35]:
for domain, _df in {"writing": df_writing_cleaned, "arxiv": df_abstract_cleaned, "yelp_review": df_review_cleaned,
                    "xsum": df_xsum_cleaned}.items():
    for column in ["direct_prompt", "prompt_few_shot", "prompt_SICO", "paraphrase_polish_llm",
                   "paraphrase_polish_human"]:
        print(f"{domain}"
              f" Analysing column:", column)
        matching_rows, non_matching_rows = analyze_df_for_specific_hints_of_llms(df_writing,
                                                                                 column_generated_text=column,
                                                                                 print_results=False,
                                                                                 print_summary_by_llm=False)

writing Analysing column: direct_prompt
Entries with typical LLM Patterns:  179
Entries without typical LLM Patterns:  2621
writing Analysing column: prompt_few_shot
Entries with typical LLM Patterns:  6
Entries without typical LLM Patterns:  2794
writing Analysing column: prompt_SICO
Entries with typical LLM Patterns:  15
Entries without typical LLM Patterns:  2785
writing Analysing column: paraphrase_polish_llm
Entries with typical LLM Patterns:  7
Entries without typical LLM Patterns:  2793
writing Analysing column: paraphrase_polish_human
Entries with typical LLM Patterns:  6
Entries without typical LLM Patterns:  2794
arxiv Analysing column: direct_prompt
Entries with typical LLM Patterns:  179
Entries without typical LLM Patterns:  2621
arxiv Analysing column: prompt_few_shot
Entries with typical LLM Patterns:  6
Entries without typical LLM Patterns:  2794
arxiv Analysing column: prompt_SICO
Entries with typical LLM Patterns:  15
Entries without typical LLM Patterns:  2785
arxiv 