# Augment the ASAG dataset
This notebook is to augment to ASAG dataset.

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
# Import libraries
import pandas as pd
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import re

In [3]:
df = pd.read_csv('drive/MyDrive/ESL_Text_Classification/ASAG_cleaned.csv')
df = df.drop(['Unnamed: 0'], axis=1)

In [4]:
df.head()

Unnamed: 0,L1,question,answer,level,question_type,num_sentences
0,French,What are your daily habits? What time do you g...,everyday i get up at 8 a clock. I always turn ...,2,Paragraph writing,9.0
1,French,Describe your family.,"I have one sister, she is married and she has ...",2,Paragraph writing,5.0
2,French,Describe your family.,I have a mother and a father and they are stil...,2,Paragraph writing,3.0
3,French,Describe your hobbies.,I really like playing video games online with ...,2,Paragraph writing,3.0
4,French,Describe your family.,"I have a little family, i live with my father ...",2,Paragraph writing,4.0


In [5]:
df.level.value_counts()

level
3    71
2    50
4    38
5    13
Name: count, dtype: int64

We're going to augment level 5 as much as possible.

In [6]:
# Divide the levels
level_5 = df[df['level'] == 5]

In [7]:
# Load pre-trained GPT model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [8]:
def generate_augmented_answer(original_answer, temperature=0.7, top_p=0.9, max_repetitions=1):
    '''
    Rephrases a paragraph and continues writing in the same style as the original paragraph.
    Top-k and nucleus sampling are used to ensure the consistency in style.
    Truncates the last half of the paragraph to get a new paragraph.
    '''
    # Tokenize the original answer
    input_ids = tokenizer.encode(original_answer, return_tensors="pt", max_length=len(original_answer), truncation=True)

    # Calculate the max length for generation (twice the length of the original answer)
    max_length = min(len(original_answer) * 2, tokenizer.model_max_length)

    try:
        # Generate text using the model with top-k sampling and nucleus sampling
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True, temperature=temperature, top_p=top_p, top_k=50, pad_token_id=tokenizer.eos_token_id)

        # Decode the generated text
        augmented_answer = tokenizer.decode(output[0], skip_special_tokens=True)

        # Find sentence boundaries in the generated text
        sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", augmented_answer)

        # Calculate the midpoint index
        midpoint_index = len(sentences) // 2

        # Extract sentences from the second half of the generated text
        second_half_sentences = sentences[midpoint_index:]

        # Keep track of unique sentences to detect repetition
        unique_sentences = set()

        # Construct the new paragraph, avoiding repetition
        new_paragraph = ''
        repetitions = 0
        for sentence in second_half_sentences:
            if sentence not in unique_sentences:
                new_paragraph += sentence + ' '
                unique_sentences.add(sentence)
            else:
                repetitions += 1
                if repetitions >= max_repetitions:
                    break

        # Check if the new paragraph contains at least three sentences
        if len(new_paragraph.split('.')) < 3:
            # Regenerate the text until it contains at least three sentences
            return generate_augmented_answer(original_answer, temperature, top_p, max_repetitions)
        else:
            return new_paragraph.strip()
    except IndexError as e:
        print("Error:", e)
        return None

In [9]:
# Look at a sample to make sure it's working properly
sample_idx = 0
sample = df.answer[sample_idx]
print(df.answer[sample_idx])
generate_augmented_answer(sample)

everyday i get up at 8 a clock. I always turn my music on. To have a nice day. Firstly, i take my shower. In a second time I breakfast. Then i go to school. But that's depend about my timetable. I don't begin at 8h30 everday so sometimes I watch tv


"I'm a very good singer. I'm very good at singing. But I also have a lot of different things to do. I'm not a singer. I'm a really good singer. \nWhat are your biggest challenges as a singer? \nI'm a very good singer."

In [10]:
# Apply augmentation to level 5
augmented_answers = []

for answer in level_5['answer']:
    augmented_answer = generate_augmented_answer(answer)
    augmented_answers.append(augmented_answer)

# Add augmented answers back to the DataFrame
level_5['augmented_answer'] = augmented_answers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  level_5['augmented_answer'] = augmented_answers


In [11]:
# Append augmented samples back to the original DataFrame
df = pd.concat([df, level_5], ignore_index=True)

df.to_csv('drive/MyDrive/ESL_Text_Classification/ASAG_Augmented.csv')