# Augment the PELIC Dataset
This notebook is to augment the level 2 paragraph answers, of which there are only 99. For each level 2 paragraph, GPT2 is used to continue writing the paragraph in the same style as the original paragraph. The last part of the paragraph is then truncated to get a new paragraph.

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
# Import libraries
import pandas as pd
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import re

In [3]:
df = pd.read_csv('drive/MyDrive/ESL_Text_Classification/PELIC_cleaned.csv')

In [4]:
# Filter rows where level equals 2 and sample 1049 rows
level_2 = df[df['level'] == 2]

In [5]:
level_2.head()

Unnamed: 0.1,Unnamed: 0,level,L1,question_type,question,answer,num_sentences
651,1508,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,Beautiful City\nI visited Beirut because it is...,17.0
715,1603,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"2 mounth ago, I visited in Toronto in Canada w...",5.0
717,1605,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I went to Busan last summer vacation with my f...,10.0
720,1618,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year.I went to...,10.0
729,1632,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"When i went to high school, i went to BuSan in...",8.0


In [6]:
level_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 99 entries, 651 to 4222
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     99 non-null     int64  
 1   level          99 non-null     int64  
 2   L1             99 non-null     object 
 3   question_type  99 non-null     object 
 4   question       99 non-null     object 
 5   answer         99 non-null     object 
 6   num_sentences  99 non-null     float64
dtypes: float64(1), int64(2), object(4)
memory usage: 6.2+ KB


In [7]:
# Load pre-trained GPT model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
def generate_augmented_answer(original_answer, temperature=0.7, top_p=0.9, max_repetitions=1):
    '''
    Rephrases a paragraph and continues writing in the same style as the original paragraph.
    Top-k and nucleus sampling are used to ensure the consistency in style.
    Truncates the last half of the paragraph to get a new paragraph.
    '''
    # Tokenize the original answer
    input_ids = tokenizer.encode(original_answer, return_tensors="pt", max_length=len(original_answer), truncation=True)

    # Calculate the max length for generation (twice the length of the original answer)
    max_length = min(len(original_answer) * 2, tokenizer.model_max_length)

    try:
        # Generate text using the model with top-k sampling and nucleus sampling
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True, temperature=temperature, top_p=top_p, top_k=50, pad_token_id=tokenizer.eos_token_id)

        # Decode the generated text
        augmented_answer = tokenizer.decode(output[0], skip_special_tokens=True)

        # Find sentence boundaries in the generated text
        sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", augmented_answer)

        # Calculate the midpoint index
        midpoint_index = len(sentences) // 2

        # Extract sentences from the second half of the generated text
        second_half_sentences = sentences[midpoint_index:]

        # Keep track of unique sentences to detect repetition
        unique_sentences = set()

        # Construct the new paragraph, avoiding repetition
        new_paragraph = ''
        repetitions = 0
        for sentence in second_half_sentences:
            if sentence not in unique_sentences:
                new_paragraph += sentence + ' '
                unique_sentences.add(sentence)
            else:
                repetitions += 1
                if repetitions >= max_repetitions:
                    break

        # Check if the new paragraph contains at least three sentences
        if len(new_paragraph.split('.')) < 3:
            # Regenerate the text until it contains at least three sentences
            return generate_augmented_answer(original_answer, temperature, top_p, max_repetitions)
        else:
            return new_paragraph.strip()
    except IndexError as e:
        print("Error:", e)
        return None

In [13]:
sample_idx = 0
sample = df.answer[sample_idx]
print(df.answer[sample_idx])
generate_augmented_answer(sample)

I met my friend Nife while I was studying in a middle school. I was happy when I met him because he was a good student in our school. We continued the middle and high school to gather in the same school. We were studying in the different classes in the middle school; however, in the high school we were studying in the same class. We went to many places in the free time while we were studying in the high school. When we finished from the high school, I went to K.S University and he went to I.M University. While we were enjoying in academic life, we made many achievement in these universities. I graduated when Nife was studying in the last semester in the university. After that, I got a job. Fortunately, it was nearby my home. I worked two years then I got scholarship from ministry of high education in my country. When I came here to U.S, my friend Nife arrange some documents to study at grad school in Malaysia.


'When we finished from the high school, I went to K.S University and he went to I.M University. While we were enjoying in academic life, we made many achievement in these universities. I graduated when Nife was studying in the last semester in the university. After that, I got a job. However, I did not work in the university in the last semester in the university. I was happy when I went to college. When I went to college, I was at school in the same place.'

In [14]:
# Apply augmentation to sampled rows
augmented_answers = []

for answer in level_2['answer']:
    augmented_answer = generate_augmented_answer(answer)
    augmented_answers.append(augmented_answer)

# Add augmented answers back to the DataFrame
level_2['augmented_answer'] = augmented_answers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  level_2['augmented_answer'] = augmented_answers


In [15]:
# Append augmented samples back to the original DataFrame
df = pd.concat([df, level_2], ignore_index=True)

df.to_csv('drive/MyDrive/ESL_Text_Classification/PELIC_Augmented.csv')

Check some of the augmented answers to see how similar they are to the original answers:

In [36]:
level_2.head(10)

Unnamed: 0.1,Unnamed: 0,level,L1,question_type,question,answer,num_sentences,augmented_answer
651,1508,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,Beautiful City\nI visited Beirut because it is...,17.0,She is a singer. I met my friend.
715,1603,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"2 mounth ago, I visited in Toronto in Canada w...",5.0,The water was so cold. \nSo I went to Niagara ...
717,1605,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I went to Busan last summer vacation with my f...,10.0,it was a shot period of trip but i was feel li...
720,1618,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year.I went to...,10.0,I visited Las Vegas August this year.I went to...
729,1632,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"When i went to high school, i went to BuSan in...",8.0,I met him again and he said that I could work ...
730,1635,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"2 mounth ago, I visited in Toronto in Canada w...",16.0,We were very happy with them. I don`t remember...
743,1669,2,Chinese,Paragraph writing,Write about a city you visited in the past. U...,I visited LA in summer vacation of 1993.\nThis...,11.0,"Afterword, I went to Disney land with my fathe..."
744,1670,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,I went to New York three month ago. I saw a so...,9.0,I was a good student. I went to the World Cup.
748,1680,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,In the weekend I traveled to New Yourk city. W...,9.0,I do not want to leave New Yourk city. I want ...
753,1713,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,i visited michagan last year with my friends. ...,8.0,and it was good. so i went to pittsburgh.


In [31]:
print('Original answer: ', level_2.answer[651])
print('\nAugmented answer: ', level_2.augmented_answer[651])

Original answer:  Beautiful City
I visited Beirut because it is beautiful and historic. It is capital of Lebanon. I visited Beirut last year. I lived 10 days there. I traveled with my family. We traveled by bus. We visited same my family and my friends. We ate Lebanon food and drank. We went to visit an old Mosque(The mosque is 900 years old). We saw historic a home. We played snowball in the mountain. We went to visit an old big carve and museum. We went to show please. We shopped and bought clothes and gifts. I rode horse. I played with my friends. My family and I enjoyed because it is beautiful city.

Augmented answer:  She is a singer. I met my friend.


In [32]:
print('Original answer: ', level_2.answer[715])
print('\nAugmented answer: ', level_2.augmented_answer[715])

Original answer:  2 mounth ago, I visited in Toronto in Canada with my friends. We saw nightview in Toronto. It was a eautifull. And next day, we went to Niagara falls. There

Augmented answer:  The water was so cold. 
So I went to Niagara Falls. 
I was so cold.


In [33]:
print('Original answer: ', level_2.answer[717])
print('\nAugmented answer: ', level_2.augmented_answer[717])

Original answer:  I went to Busan last summer vacation with my friends.I played in the water in Hae woon dea beach and then i took pictures. after i played, i went to Ja gal ci market and ate sushi for dinner.next day, i went to Busan museum admisstion was very high but relics was little.
i saw gwangan large bridge at night.night view was very wanderful.and i got on aboard a ship that i was looking around hae woon to oruck do. i saw many islets. it was very wonderful. it was a shot period of trip but i was feel like it was along trip and i became a close friend,So we promoted friendship advance a friendly feeling.
It was good trip~

Augmented answer:  it was a shot period of trip but i was feel like it was along trip and i became a close friend,So we promoted friendship advance a friendly feeling. It was good trip~
(S) - A guy named Gwen from Gwanghwan (Gwanghwan) had a great trip in Busan.He was very happy about the trip. He was really happy about this trip and he wanted to give his f

In [34]:
print('Original answer: ', level_2.answer[720])
print('\nAugmented answer: ', level_2.augmented_answer[720])

Original answer:  I visited Las Vegas August this year.I went to their with my family.I walked with my family and took many good pictures in their.I saw the fountains water show in Las Vegas.The water show played with quiet music.That was very beautiful show,I like that very much.I ate some good food in their.There were many gambling houses in Las Vegas.But I'm 19,so I couldn't gamble.My family and I were very happy at that time.

Augmented answer:  I visited Las Vegas August this year.I went to their with my family.I walked with my family and took many good pictures in their.I saw the fountains water show in Las Vegas.The water show played with quiet music.That was very beautiful show,I like that very much.I ate some good food in their.There were many gambling houses in Las Vegas.But I'm 19,so I couldn't gamble.My family and I were very happy at that time.I took my children to the movies.I had a great time.I was very happy with my family and I enjoyed the movie.I went to the movies an

In [35]:
print('Original answer: ', level_2.answer[729])
print('\nAugmented answer: ', level_2.augmented_answer[729])

Original answer:  When i went to high school, i went to BuSan in korea with my sister. I went to BuSan by drove a car. My sister was good driver. I read a book while we went to in BuSan. I thought about my work. Then i decided to met him because he said me to help with my work. I was so happy after his words. So i got another work.

Augmented answer:  I met him again and he said that I could work for him. I went to BuSan and got a job. He told me to get a job with him.


In [37]:
print('Original answer: ', level_2.answer[730])
print('\nAugmented answer: ', level_2.augmented_answer[730])

Original answer:  2 mounth ago, I visited in Toronto in Canada with my friends. We saw nightview in Toronto. It was a beautifull. And next day, we went to Niagara falls. We heard that if you see the Niagara Falls, you have to go Canada side. So, we went to Canada Side, and We could see wonderfull falls. We took picture, and drove around by a car. Last, We entered the casino. I didn`t have a chance to visited casino. Sometime, We won, but usually we lost money. It was a good experience to us. It was fun. 
 After that, We invited Buffalo by Min-kyu`s friend. We had a great time in Buffalo. At night, we moved in Pittburgh. This trip was very tired, but we looked many things and studied Canada.

Augmented answer:  We were very happy with them. I don`t remember the hotel, but we had a great time there.


In [38]:
print('Original answer: ', level_2.answer[743])
print('\nAugmented answer: ', level_2.augmented_answer[743])

Original answer:  I visited LA in summer vacation of 1993.
This is my fist time to take airplan, so
I was very happy ! My uncle lives here for a long time. I stayed this city for five days . During this week , I went to many interesting places . I like this city a lot . Afterword , I went to Disney land with my father , we played inside all day and night .I bought some gifts and
ate food in here . My dream came ture . 
Until today , I still remember that I have been to Disney land . I talked by myself , I will come back USA again . Finally , I am studying in USA .

Augmented answer:  Afterword, I went to Disney land with my father, we played inside all day and night.I bought some gifts and
ate food in here. My dream came ture. 
Until today, I still remember that I have been to Disney land. I talked by myself, I will come back USA again. Finally, I am studying in USA. I can enjoy my vacation here
Now I will go back to my home in LA, my family and my friends will be here, and I will enjo

In [39]:
print('Original answer: ', level_2.answer[744])
print('\nAugmented answer: ', level_2.augmented_answer[744])

Original answer:  I went to New York three month ago. I saw a soccer match in stadium with my friend. I wore jeans and at shirt. I drank coffee and ate dinner in time acquire. I visited statue of liberty. I walked in Central park. I watched move in New York. I cooked dinner with my friend. I shopped in New York.

Augmented answer:  I was a good student. I went to the World Cup.


In [40]:
print('Original answer: ', level_2.answer[748])
print('\nAugmented answer: ', level_2.augmented_answer[748])

Original answer:  In the weekend I traveled to New Yourk city. When I arrived to New Yourk city I go to restaurant. I lived in hotel. Next day I went to library. After that I watched the movie. After the movie I ate noodles and I drank water. I slept at 11:00 P.M. I wok up at 9 : 00A.M. I wrote some words for my father. I saw the some beautiful skyscraper in New Yourk city. I like this city.

Augmented answer:  I do not want to leave New Yourk city. I want to do my best. I am not a coward.


In [41]:
print('Original answer: ', level_2.answer[753])
print('\nAugmented answer: ', level_2.augmented_answer[753])

Original answer:  i visited michagan last year with my friends. we saw a lot of good places. then we went to park and we played and we looked around to see how is it beatiful. after that we took pictures of each other. then, we went to restaurant and we ate arabic food. after that, i danced to arabic music too. i tried to stay long in there but i dont have money enough, so we came back to pittsburgh after i traveled around the US

Augmented answer:  and it was good. so i went to pittsburgh.


Move the augmented answers to the answer column, and add a column to the dataframe that indicates whether the answer in that column in augmented.

In [42]:
# Step 1: Replace text in 'answer' column with text from 'augmented_answer' column
df['answer'] = df.apply(lambda row: row['augmented_answer'] if pd.notnull(row['augmented_answer']) else row['answer'], axis=1)

# Step 2: Add a column indicating original (0) or augmented (1) answer
df['is_augmented'] = df['augmented_answer'].notnull().astype(int)

# Step 3: Drop the 'augmented_answer' column
df.drop(columns=['augmented_answer'], inplace=True)

In [45]:
df[df.is_augmented == 0].head()

Unnamed: 0.1,Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented
0,0,4,Arabic,Paragraph writing,Write a paragraph about a relatioship that is...,I met my friend Nife while I was studying in a...,12.0,0
1,1,4,Thai,Paragraph writing,Write a paragraph about a relatioship that is...,"Ten years ago, I met a women on the train betw...",10.0,0
2,2,4,Turkish,Paragraph writing,"In five sentences or less, give instructions o...",In my country we usually don't use tea bags. F...,5.0,0
3,4,4,Korean,Paragraph writing,"In five sentences or less, give instructions o...","First, prepare a port, loose tea, and cup.\nSe...",5.0,0
4,6,4,Korean,Paragraph writing,"In five sentences or less, give instructions o...","First, prepare your cup, loose tea or bag tea,...",4.0,0


In [44]:
df[df.is_augmented == 1].head()

Unnamed: 0.1,Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented
13849,1508,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,She is a singer. I met my friend.,17.0,1
13850,1603,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,The water was so cold. \nSo I went to Niagara ...,5.0,1
13851,1605,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,it was a shot period of trip but i was feel li...,10.0,1
13852,1618,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year.I went to...,10.0,1
13853,1632,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I met him again and he said that I could work ...,8.0,1


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13948 entries, 0 to 13947
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     13948 non-null  int64  
 1   level          13948 non-null  int64  
 2   L1             13948 non-null  object 
 3   question_type  13948 non-null  object 
 4   question       13948 non-null  object 
 5   answer         13948 non-null  object 
 6   num_sentences  13948 non-null  float64
 7   is_augmented   13948 non-null  int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 871.9+ KB


In [47]:
df = df.drop(['Unnamed: 0'], axis=1)

In [50]:
df.to_csv('drive/MyDrive/ESL_Text_Classification/PELIC_augmented.csv')