# ClearML Pipeline - Generate Essays
<img src="images/generate_essays_drawio_.png" width='1000ps' alt="Alt Text" >

<span style="display: inline-block;padding: 10px;background-color: #f4f3ee;border: 1px solid #FF1493;border-radius: 4px;margin-bottom: 10px;line-height: 1.5;color: #333;" class="tip"><b>NOTE: The attached dataset was created during the development of this code. In the end, I would say 90% of the 
essays were created using these exact prompts.</b></span>

In [None]:
# ai_generated_essays = pd.read_pickle(f'{CFG.SCRATCH_PATH}/ai_generated.csv/ai_generated.pkl')
# ai_rewritten_essays = pd.read_pickle(f'{CFG.SCRATCH_PATH}/ai_rewritten.csv/ai_rewritten_essays.pkl')

# df_comb = pd.concat([ai_generated_essays, ai_rewritten_essays], axis=0)
# df_comb = df_comb.reset_index(drop=True)
# print(df_comb.head())

# df_comb.to_csv(f'{CFG.SCRATCH_PATH}/ai_generated_essays_llm_detect_kaggle.csv', index=False)


In [None]:
%pip install clearml -q
%pip install nltk -q


## **Generate**
<img src="images/generate_test_essays_clearml.png" width='1000ps' alt="Alt Text" >

## **Framework for Generating Essays:**

Quick overview. What are some ways stuents might use LLM's to conceal the origin of the essays? 

1. **Simple Topic Essay** 
Nothing fancy, this is for the student who just wants it done: 

> <span style="display: inline-block;padding: 5px;background-color: #f4f3ee;margin-bottom: 5px;line-height: 1.5;color: #333;" class="tip"> *Generate a quality and detailed middle or highschool essay that directly addresses the prompt:*  **+ prompt** </scpan>


2. **Getting Creative**
Now we need to build a prompt that will be labeled as 0 (human generated) when generated by GPT4. 


> <span style="display: inline-block;padding: 5px;background-color: #f4f3ee;margin-bottom: 5px;line-height: 1.5;color: #333;" class="tip"> *Generate an essay that closely resembles a high-quality, B+ level essay written by a 8th to 12th grade high-school student. The essay should reflect a deep understanding of the topic, with coherent arguments and clear structure. However, to closely mimic human writing, include subtle imperfections typical of a student at this level. These may include - Occasional grammatical errors: Introduce minor grammatical mistakes that a student might make under exam conditions or in a final draft, such as slight misuse of commas,  or occasional awkward phrasing. - Varying sentence structure: Use a mix of simple, compound, and complex sentences, with some variation in fluency to reflect a student's developing writing style. - Personal touch: Include personal opinions, anecdotes, or hypothetical examples where appropriate, to give the essay a unique voice. - Argument depth: While the essay should be well-researched and informed, the depth of argument might not reach the sophistication of a more experienced writer. Arguments should be sound but might lack the nuance a more advanced writer would include. - Conclusion: Ensure the essay has a clear conclusion, but one that might not fully encapsulate all the complexities of the topic, as a student might struggle to tie all threads together neatly. Remember, the goal is to create a piece that balances high-quality content with the authentic imperfections of a human student writer. The essay should be on the following topic:* <b>**+ prompt**</b> </span>

3. **Rewrites Prompt**
If I were to generate content for Accidamia, I would write everything, including all substantive facts, and ask for a clear rewrite. Does this constitute ai generated content if you ask to rewrite and grammar check? It's exactly API-generated content. It may not be a 'fake' essay per se, but if you send a prompt to an LLM, no matter what the prompt is, it's AI-generated text and should be labeled a 1. 

> <span style="display: inline-block;padding: 5px;background-color: #f4f3ee;margin-bottom: 5px;line-height: 1.5;color: #333;" class="tip"> *Rewrite the following student essay by enhancing its structure, vocabulary, and overall quality. It's important to keep the same tone. Do not change facts or opinons. Ensure that the content and meaning of the original essay are preserved. Keep the lengths the same within 50 words. :* **+ prompt**</span>

<span style="display: inline-block;padding: 10px;background-color: #f4f3ee;border: 1px solid #FF1493;border-radius: 4px;margin-bottom: 10px;line-height: 1.5;color: #333;" class="tip">If you plan to execute this, input your ClearML account information or comment out the pipeline directives above each function. Note that imports are included within the functions, as ClearML constructs self-contained functions for each pipeline component.</span>

## **ClearML Pipeline**

In [None]:
%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
#%env CLEARML_API_ACCESS_KEY=
#%env CLEARML_API_SECRET_KEY=


import os
os.environ['OPENAI_API_KEY'] = ''

# ClearML - 
from clearml import Task, Dataset
from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

model="gpt-3.5-turbo-1106",
#model = "gpt-4-1106-preview",

### STEP ONE ####################
####################
####################
@PipelineDecorator.component(return_values=["unique_prompts_df"],name='Pull Unique Prompts', cache=True, task_type=TaskTypes.data_processing)
def process_csv_files(directory):
    import glob
    import pandas as pd
    import os
    
    # Get a list of CSV files in the specified directory
    csv_files = glob.glob(directory + '/*.csv')

    # Create an empty DataFrame
    df = pd.DataFrame()

    # Iterate over each CSV file
    for file in csv_files:
        if os.path.isfile(file):
            # Read the CSV file and append it to the DataFrame
            temp_df = pd.read_csv(file)
            df = pd.concat([df, temp_df])

    # Select the desired columns
    selected_columns = ['prompt_name', 'text', 'source', 'prompt', 'fold', 'label']
    df_selected = df[selected_columns]

    # Select rows where 'prompt' is not NaN
    selected_rows = df_selected[df_selected['prompt'].notnull()]
    selected_rows = df_selected[df_selected['text'].notnull()]

    # Group the DataFrame by 'prompt' column and keep the first occurrence of each group
    unique_prompts_df = selected_rows.groupby('prompt').first().reset_index()[['prompt', 'source', 'text', 'label']]
    
    print(unique_prompts_df.columns.tolist())

    return unique_prompts_df

### STEP TWO ####################
####################
####################
@PipelineDecorator.component(return_values=["sample_prompts_df"],name='Append Instructions to Prompts', cache=True, task_type=TaskTypes.data_processing)
def append_instructions_to_prompts(unique_prompts_df):
    import pandas as pd
    import numpy as np

    ## THIS IS OUR STANDARD PROMPT FOR ESSAY GENERATION
    standard_instruction = "Generate a quality and detailed middle or highschool essay that directly addresses the prompt: "
    
    ## ths is THE PROPT CREATED TO FOOL BASELINE MODEL 
    detailed_instructions = '''Generate an essay that closely resembles a high-quality, B+ level essay written by a 8th to 12th grade 
    high-school student. The essay should reflect a deep understanding of the topic, with coherent arguments and clear structure. 
    However, to closely mimic human writing, include subtle imperfections typical of a student at this level. These may include 
    - Occasional grammatical errors: Introduce minor grammatical mistakes that a student might make under exam conditions or in a 
    final draft, such as slight misuse of commas,  or occasional awkward phrasing. 
    - Varying sentence structure: Use a mix of simple, compound, and complex sentences, with some variation in fluency to reflect a 
    student's developing writing style. 
    - Personal touch: Include personal opinions, anecdotes, or hypothetical examples where appropriate, to give the essay a unique voice. 
    - Argument depth: While the essay should be well-researched and informed, the depth of argument might not reach the sophistication of 
    a more experienced writer. Arguments should be sound but might lack the nuance a more advanced writer would include. 
    - Conclusion: Ensure the essay has a clear conclusion, but one that might not fully encapsulate all the complexities of the topic, 
    as a student might struggle to tie all threads together neatly. Remember, the goal is to create a piece that balances high-quality 
    content with the authentic imperfections of a human student writer. 
    The essay should be on the following topic:'''

    rewrite_instruction = "Rewrite the following student essay by enhancing its structure, vocabulary, and overall quality. It's important to keep the same tone. Do not change facts or opinons. Ensure that the content and meaning of the original essay are preserved. Keep the lengths the same within 50 words. : "
  
    def append_instructions(df):    
        df['standard_prompt'] = df['prompt'].apply(lambda x: standard_instruction + x)        
        df['altered_prompt'] = df['prompt'].apply(lambda x: detailed_instructions + x) 
        df['rewrite_prompt'] = df['text'].apply(lambda x: rewrite_instruction + x )    
        return df
    
    # Apply the function to your DataFrame
    sample_prompts_df = append_instructions(unique_prompts_df)
    print(sample_prompts_df.head())
    # Return the updated DataFrame
    return sample_prompts_df

 
### STEP THREE ####################
####################
####################
# Function to preprocess text
@PipelineDecorator.component(return_values=["df_essays"],
                             name='Clean Text', cache=True, task_type=TaskTypes.data_processing)
def pipeline_etl_clean_text(df):
    import logging
    import markdown
    from bs4 import BeautifulSoup
    import re
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    import nltk
    import pandas as pd
    from pathlib import Path
    # Download necessary NLTK packages
    nltk.download('punkt', quiet=True)
    nltk.download('wordnet', quiet=True)
    # Ensure the necessary NLTK packages are downloaded
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('wordnet', quiet=True)
    except Exception as e:
        logging.error(f"An error occurred while downloading NLTK packages: {e}")

    # Function to remove markdown formatting
    def remove_markdown(text):
        try:
            html = markdown.markdown(text)
            soup = BeautifulSoup(html, features="html.parser")
            return soup.get_text()
        except Exception as e:
            logging.error(f"Error in remove_markdown: {e}")
            return text

    # Function to remove 'Task' prefix from the prompt
    def remove_task_on_prompt(text):
        try:
            pattern = r'^(?:Task(?:\s*\d+)?\.?\s*)?'
            return re.sub(pattern, '', text)
        except Exception as e:
            logging.error(f"Error in remove_task_on_prompt: {e}")
            return text

    # Function to replace newline and carriage return characters
    def replace_newlines(text):
        try:
            return re.sub(r'[\n\r]+', ' ', text)
        except Exception as e:
            logging.error(f"Error in replace_newlines: {e}")
            return text

    # Function to remove extra whitespaces
    def remove_extra_whitespace(text):
        try:
            return ' '.join(text.split())
        except Exception as e:
            logging.error(f"Error in remove_extra_whitespace: {e}")
            return text

    # Function to remove punctuation except for specified characters
    def remove_punctuation_except(text, punctuation_to_retain):
        try:
            punctuation_to_remove = r'[^\w\s' + re.escape(punctuation_to_retain) + ']'
            return re.sub(punctuation_to_remove, '', text)
        except Exception as e:
            logging.error(f"Error in remove_punctuation_except: {e}")
            return text

    def remove_emojis_and_newlines(text):
        # Regex pattern for matching emojis
        emoji_pattern = re.compile("["
                            u"\U0001F600-\U0001F64F"  # emoticons
                            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                            u"\U0001F680-\U0001F6FF"  # transport & map symbols
                            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                            u"\U00002500-\U00002BEF"  # chinese characters
                            u"\U00002702-\U000027B0"
                            u"\U00002702-\U000027B0"
                            u"\U000024C2-\U0001F251"
                            u"\U0001f926-\U0001f937"
                            u"\U00010000-\U0010FFFF"
                            u"\u2640-\u2642"
                            u"\u2600-\u2B55"
                            u"\u200d"
                            u"\u23cf"
                            u"\u23e9"
                            u"\u231a"
                            u"\ufe0f"  # dingbats
                            u"\u3030"
                            "]+", flags=re.UNICODE)

        # Remove newline characters
        text = re.sub('\n+', ' ', text)
        # Remove emojis
        text = emoji_pattern.sub(r'', text)
        return text

    def replace_newlines(text):
        return re.sub(r'[\r\n]+', ' ', text)
    # Function to tokenize and lemmatize text
    def process_text(text):
        try:
            tokens = word_tokenize(text)
            lemmatizer = WordNetLemmatizer()
            lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
            return lemmatized_tokens
        except Exception as e:

            logging.error(f"Error in process_text: {e}")
            return []
    # Main preprocessing logic
    try:
        PUNCTUATION_TO_RETAIN = '.?!,'  # Punctuation characters to retain
        for index, row in df.iterrows():
            text = row['prompt']
            text = remove_markdown(text)
            text = replace_newlines(text)
            text = remove_extra_whitespace(text)
            text = remove_task_on_prompt(text)
            text = remove_punctuation_except(text, PUNCTUATION_TO_RETAIN)
            #text = remove_emojis_and_newlines(text)
            text = re.sub('\n+', '', text)
            text = re.sub(r'[A-Z]+_[A-Z]+', '', text)
            text = replace_newlines(text)
            # Remove occurrences of \n\n from the text
            # text = text.replace('\n\n', '')
            tokens = process_text(text)
            preprocessed_text = ' '.join(tokens)

            # Update the 'preprocessed_text' column with the processed text
            df.at[index, 'prompt'] = preprocessed_text

        return df
    except Exception as e:
        logging.debug(e)

        logging.error(f"Error in preprocess_text: {e}")
        return text
    
### STEP FOUR ####################

@PipelineDecorator.component(return_values=["df"], name="Run Essay Generation", cache=True, task_type=TaskTypes.data_processing)
def run_essay_generation(df): 
    import pandas as pd
    from openai import OpenAI
    client = OpenAI()  # Replace with your actual API key

    def generate_essay(client, prompt):
        response = client.chat.completions.create(
            model=model,
            #model = "gpt-4-1106-preview",
            messages=[
                {"role": "system", "content": 
                '''You are a high-school essay writer. The essays you genererate should score an A or a B. 
                 They are between 400 to 700 words long. Only output the essay. Nothing else. No headers. 
                 No tagging of any kind'''},
                
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    # Process each row to generate essays for both prompts
    for index, row in df.iterrows():
        try:
            # Generate essay for the first prompt
            prompt1_essay = generate_essay(client, row['standard_prompt'])
            df.loc[index, 'standard_essay'] = prompt1_essay

            # Generate essay for the second prompt
            prompt2_essay = generate_essay(client, row['altered_prompt'])
            df.loc[index, 'altered_essay'] = prompt2_essay
            
            # rewrite the student essay - Only for kaggle supplied data 
            # Generate essay for the second prompt
            prompt3_essay = generate_essay(client, row['rewrite_prompt'])
            df.loc[index, 'rewrite_prompt'] = prompt3_essay

            print(f"Processed essays for row {index}")
            #task.get_logger().report_text(f"Processed essays for row {index}")
            # Save the DataFrame with the generated essays (append mode)
            df.to_csv("scratch/gpt-3.5-turbo-1106.gen.iter.csv", index=False)

        except Exception as e:
            print(f"Error processing essays for row {index}: {e}")
            
    print("All essays processed and saved to CSV.")
    return df

@PipelineDecorator.component(name="Save Generated Essays", cache=True, task_type=TaskTypes.data_processing)
def save_generated_essays(df, new_dataset_name, description="", tags=[], file_name="dataset.pkl"):
    from pathlib import Path
    from clearml import Dataset
    import pandas as pd
    import logging
    try:
      logging.basicConfig(level=logging.DEBUG)
      print(df.head())
      file_path = Path(file_name)
      pd.to_pickle(df, file_path)
      new_dataset = Dataset.create(dataset_project='LLM-detect-ai-gen-text-LIVE/datasets', dataset_name=new_dataset_name)
      new_dataset.add_files(str(file_path))
      if description:
          new_dataset.set_description(description)
      if tags:
          new_dataset.add_tags(tags)
      new_dataset.upload()
      new_dataset.finalize()
    except Exception as e:
      logging.debug(e)

    print(f"New dataset '{new_dataset_name}' uploaded and finalized with description and tags.")
    
@PipelineDecorator.pipeline(name="Generate Test Essays", project="LLM-detect-ai-gen-text-LIVE/dev/generate-training-essays")
def executing_pipeline(directory, id=1):

    import pandas as pd
    from tqdm import tqdm
    import time
    # ## STEP ONE - Get unique prompts
    print("launch step one")
    # Usage example
    unique_prompts = process_csv_files(directory)
    unique_prompts = unique_prompts.sample(6)  ## FOR TESTING ONLY

    # ## STEP TWO - Clean Prompts
    print("launch step two")
    unique_prompts= pipeline_etl_clean_text(unique_prompts)

    # ## Generate the props and append instructions
    print("launch step three")
    generated_prompts = append_instructions_to_prompts(unique_prompts)

    ## Run the essay generation 
    print("launch step four")
    generated_essays = run_essay_generation(generated_prompts.head()) ## Sample two 
    
    ## Save the generated essays
    print("launch step five")
    save_generated_essays(generated_essays, "train_essays", "GPT Generated Essays for training data", ["training", "unprocessed"], "generated_training_essays.pkl")
    
    generated_essays.to_csv(f"data/gpt-3.5-turbo-1106-{id}.csv", index=False)


if __name__ == "__main__":
    directory = 'data'
    PipelineDecorator.run_locally()
    executing_pipeline(directory, id=55)


## OpenAI Assistant API **(unused)**

I did not use this in the end, but I love what the Assistant API is evolving into. It's The amount of potential use cases are staggering. 

In [None]:
# from openai import OpenAI

# # Initialize OpenAI Client
# client = OpenAI(api_key='')

# # Configuration settings
# CFG = { 
#     "MODEL": "gpt-3.5-turbo-1106",
#     "MAX_TOKENS": 500  
# }

# # Function to create or retrieve an assistant
# def create_assistant(client, model):
#     # Create a new assistant
#     assistant = client.beta.assistants.create(
#         model=model,
#         name="Essay Generation Assistant v.001",
#         description="An assistant for testing purposes",
#         instructions="Generate a 500 to 1000 word essay based on the prompt. The should score and A or a B"
#     )
#     return assistant

# # Function to post a message and retrieve the result
# def post_message_and_get_response(client, assistant_id, prompt):
#     # Create a new thread
#     thread = client.beta.threads.create()

#     # Send the prompt message to the thread
#     client.beta.threads.messages.create(
#         thread_id=thread.id,
#         role="user",
#         content=prompt
#     )

#     # Create a run for the thread
#     run = client.beta.threads.runs.create(
#         thread_id=thread.id,
#         assistant_id=assistant_id
#     )

#     # Wait for run to complete (may need to adjust based on API response time)
#     while True:
#         run_status = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
#         if run_status.status in ['completed', 'failed']:
#             break

#     # Retrieve the response
#     messages = client.beta.threads.messages.list(thread_id=thread.id)
#     response_message = next((message for message in messages.data if message.role == 'assistant'), None)
#     return response_message.content if response_message else "No response received."

# # Function to generate essay based on sample prompt
# def generate_essay(sample_prompt):
#     # Create an assistant
#     assistant = create_assistant(client, CFG["MODEL"])

#     # Post the message and get the response
#     response = post_message_and_get_response(client, assistant.id, sample_prompt)

#     return response

# # Example usage
# # Sample prompt
# sample_prompt = 'Generate a quality and detailed middle or highschool essay that directly addresses the prompt: Research and evaluate the benefits of students attending school from home Examine how students with anxiety or depression may benefit from attending school from home Consider the impact that drama and rumors may have on student performance in a traditional school setting Analyze the potential advantages of distance learning on the student body Develop an argument in support of students attending school from home The essay should be well-structured, clear, and concise.' # Use a specific prompt string for testing
# essay = generate_essay(sample_prompt)
# print("Generated Essay:", essay)


In [None]:
# import pandas as pd
# from openai import OpenAI
# import time
# import logging

# # Initialize logging
# logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# # Initialize OpenAI Client
# client = OpenAI()  # Replace with your actual API key

# # Configuration settings
# CFG = {
#     "MODEL": "gpt-3.5-turbo-1106",
#     "MAX_TOKENS_PER_ESSAY": 500,
#     "ESSAYS_PER_RUN": 1
# }

# # Load your DataFrame with prompts
# df = sample_prompts_df.head(1) ## use ours from above 

# # print(df['standard_prompt'].iloc[0])

# # Function to create or retrieve an assistant
# def get_or_create_assistant(client, model, get_premade_assistant, assistant_id=None):
#     if get_premade_assistant and assistant_id:
#         # Retrieve existing assistant
#         assistant = client.beta.assistants.retrieve(assistant_id)
#     else:
#         # Create a new assistant
#         assistant = client.beta.assistants.create(
#             model=model,
#             name="Essay Generation Assistant v.001",
#             description="An assistant to generate essays",
#             instructions="Generate a midle or high scool student essay. Limit the essay to 1,000 tokens. The essay should score an A or a B"
#         )
#     return assistant

# # Function to create or retrieve a thread
# def get_or_create_thread(client, get_previous_thread, thread_id=None):
#     if get_previous_thread and thread_id:
#         # Retrieve existing thread
#         thread = client.beta.threads.retrieve(thread_id)
#     else:
#         # Create a new thread
#         thread = client.beta.threads.create()
#     return thread

# # Set flags for whether to use premade assistant and thread
# get_premade_assistant = False
# get_previous_thread = False
# assistant_id_to_use = ""  
# thread_id_to_use = "" 

# # Get or create assistant
# assistant = get_or_create_assistant(client, CFG["MODEL"], get_premade_assistant, assistant_id_to_use)

# # Process essays with limit based on ESSAYS_PER_RUN
# for index, row in df.head(CFG["ESSAYS_PER_RUN"]).iterrows():
#     try:
#         # Get or create thread
#         thread = get_or_create_thread(client, get_previous_thread, thread_id_to_use)

#         # Send standard prompt message to the thread
#         client.beta.threads.messages.create(
#             thread_id=thread.id,
#             role="user",
#             content=row['standard_prompt']
#         )

#         # Wait for the response
#         while True:
#             messages = client.beta.threads.messages.list(thread_id=thread.id)
#             if len(messages.data) > 1:
#                 break
#             time.sleep(1)  # Adjust sleep time as needed

#         # Retrieve and store the standard essay response
#         standard_essay_message = next((message for message in messages.data[::-1] if message.role == 'assistant'), None)
#         df.at[index, 'standard_essay'] = standard_essay_message.content if standard_essay_message else ''

#         # Send altered prompt message to the thread
#         client.beta.threads.messages.create(
#             thread_id=thread.id,
#             role="user",
#             content=row['altered_prompt']
#         )

#         # Wait for the response
#         while True:
#             messages = client.beta.threads.messages.list(thread_id=thread.id)
#             if len(messages.data) > 2:
#                 break
#             time.sleep(1)  # Adjust sleep time as needed

#         # Retrieve and store the altered essay response
#         altered_essay_message = next((message for message in messages.data[::-1] if message.role == 'assistant'), None)
#         df.at[index, 'altered_essay'] = altered_essay_message.content if altered_essay_message else ''

#         logging.info(f"Processed essay for prompt at index {index}")

#     except Exception as e:
#         logging.error(f"Error processing essay for prompt at index {index}: {e}")

# # Save DataFrame to CSV
# df.to_csv("generated_essays.csv", index=False)
# logging.info("All essays processed and saved to CSV.")
