# Balance the PELIC Dataset
<p>This notebook is to clean and balance the augmented PELIC dataset.</p>

In [84]:
# Import libraries
import pandas as pd
import re
import spacy

In [85]:
nlp = spacy.load('en_core_web_sm')

In [2]:
df = pd.read_csv('../data/PELIC_augmented.csv')
df = df.drop(['Unnamed: 0'], axis=1)

In [3]:
df.head()

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented
0,4,Arabic,Paragraph writing,Write a paragraph about a relatioship that is...,I met my friend Nife while I was studying in a...,12.0,0
1,4,Thai,Paragraph writing,Write a paragraph about a relatioship that is...,"Ten years ago, I met a women on the train betw...",10.0,0
2,4,Turkish,Paragraph writing,"In five sentences or less, give instructions o...",In my country we usually don't use tea bags. F...,5.0,0
3,4,Korean,Paragraph writing,"In five sentences or less, give instructions o...","First, prepare a port, loose tea, and cup.\nSe...",5.0,0
4,4,Korean,Paragraph writing,"In five sentences or less, give instructions o...","First, prepare your cup, loose tea or bag tea,...",4.0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13948 entries, 0 to 13947
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   level          13948 non-null  int64  
 1   L1             13948 non-null  object 
 2   question_type  13948 non-null  object 
 3   question       13948 non-null  object 
 4   answer         13948 non-null  object 
 5   num_sentences  13948 non-null  float64
 6   is_augmented   13948 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 762.9+ KB


## Remove repeating sentences from augmented answers
<p>The function that used GPT2 to augment the level 2 answers did a great job, but it wasn't perfect. Some of the augmented answers contain sentences that repeat multiple times. Before balancing the dataset, we're going to remove repeating sentences from the augmented answers.</p>
<p>First, let's take a look at the longest augmented answers to get examples of the ones that have repeating sentences.</p>

In [5]:
# Filter the df for augmented answers and arrange them in descending order by length
augmented = df[df['is_augmented'] == 1].copy()
augmented['length'] = augmented['answer'].apply(len)
augmented_sorted = augmented.sort_values(by='length', ascending=False)
augmented_sorted.head(10)

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,length
13852,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year.I went to...,10.0,1,3341
13935,2,Chinese,Paragraph writing,"Write a paragraph and help me ""see"" your favor...",There are two other places in the Shedd Aquari...,12.0,1,3332
13908,2,Arabic,Paragraph writing,Write a paragraph about your life here in Pitt...,So my life changed completely. ---------------...,7.0,1,3249
13900,2,Vietnamese,Paragraph writing,Think about a person you know. Write about hi...,I have a sweet friend.She is a nurse.She often...,5.0,1,3100
13863,2,Arabic,Paragraph writing,"Someone gave the ELI $200,000. You are the le...",We are going to buy new chairs.The chairs are ...,12.0,1,3088
13899,2,Vietnamese,Paragraph writing,Think about a person you know. Write about hi...,I have a sweet friend.She is a nurse.She often...,5.0,1,3055
13915,2,Arabic,Paragraph writing,Write a paragraph about your life here in Pitt...,"Also, in my country I was visiting my big fami...",8.0,1,3033
13922,2,Chinese,Paragraph writing,Write a paragraph about your life here in Pitt...,The fruits in the U.S.A are more expensive tha...,12.0,1,2729
13871,2,Mongol,Paragraph writing,What did you do this morning? Write a paragra...,Third I didn't find my belt because my apartme...,11.0,1,2616
13901,2,Vietnamese,Paragraph writing,Think about a person you know. Write about hi...,I have a sweet friend.She is a nurse.She often...,5.0,1,2520


Let's test a function that removes repeating sentences using Regex.

In [6]:
sample_idx = 13852
sample = augmented_sorted.answer[sample_idx]
sample

"I visited Las Vegas August this year.I went to their with my family.I walked with my family and took many good pictures in their.I saw the fountains water show in Las Vegas.The water show played with quiet music.That was very beautiful show,I like that very much.I ate some good food in their.There were many gambling houses in Las Vegas.But I'm 19,so I couldn't gamble.My family and I were very happy at that time.I took my children to the movies.I had a great time.I was very happy with my family and I enjoyed the movie.I went to the movies and I enjoyed the movie too.I was very happy with my family and I enjoyed the movie too.I was very happy with my family and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie and I enjoyed the movie too.I went to the movie

In [7]:
def split_into_sentences(text):
    # Define regex pattern to find all sentences in the text
    sentences = re.findall(r'[^.!?]+[.!?]', text)
    return sentences

def remove_repeating_sentence(text):
    # Split the text into sentences using custom logic
    sentences = split_into_sentences(text)

    # Keep track of the last three unique sentences
    unique_sentences = []

    # Iterate through the sentences
    for sentence in sentences:
        # If the current sentence is not the same as any of the last three unique sentences, add it
        if sentence not in unique_sentences:
            unique_sentences.append(sentence)

    # Join the remaining unique sentences into a single string
    cleaned_text = ' '.join(unique_sentences)
    return cleaned_text.strip()

In [8]:
remove_repeating_sentence(sample)

"I visited Las Vegas August this year. I went to their with my family. I walked with my family and took many good pictures in their. I saw the fountains water show in Las Vegas. The water show played with quiet music. That was very beautiful show,I like that very much. I ate some good food in their. There were many gambling houses in Las Vegas. But I'm 19,so I couldn't gamble. My family and I were very happy at that time. I took my children to the movies. I had a great time. I was very happy with my family and I enjoyed the movie. I went to the movies and I enjoyed the movie too. I was very happy with my family and I enjoyed the movie too. I went to the movie and I enjoyed the movie too."

It looks like this works well, so I'm doing to apply it to the augmented answers in my dataframe.

In [9]:
# Apply the function to rows where "is_augmented" is equal to 1
df.loc[df['is_augmented'] == 1, 'answer'] = df[df['is_augmented'] == 1]['answer'].apply(remove_repeating_sentence)

Let's take a look at the longest augmented answers again and see if that worked.

In [10]:
augmented = df[df['is_augmented'] == 1].copy()
augmented['length'] = augmented['answer'].apply(len)
augmented_sorted = augmented.sort_values(by='length', ascending=False)
augmented_sorted.head(10)

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,length
13935,2,Chinese,Paragraph writing,"Write a paragraph and help me ""see"" your favor...",There are two other places in the Shedd Aquari...,12.0,1,1308
13922,2,Chinese,Paragraph writing,Write a paragraph about your life here in Pitt...,The fruits in the U. S. A are more expensive t...,12.0,1,1185
13879,2,Arabic,Paragraph writing,Write a paragraph about your favorite vacation...,"I went to the hotel again on April 13, we went...",4.0,1,1153
13887,2,Mongol,Paragraph writing,Write a paragraph about your favorite vacation...,It was my first trip to another country. I wi...,7.0,1,1019
13940,2,Vietnamese,Paragraph writing,Which of the human achievements in Unit 7/2 co...,What's the difference between a public-private...,4.0,1,1005
13894,2,Vietnamese,Paragraph writing,Write a paragraph about your favorite vacation...,Posted by N. A. K. at 8:20 PM\nI'm not sure i...,3.0,1,847
13852,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year. I went t...,10.0,1,695
13916,2,Chinese,Paragraph writing,Write a paragraph about your life here in Pitt...,"Then, I liked to eat fruit before, but I can't...",12.0,1,683
13892,2,Arabic,Paragraph writing,Write a paragraph about your favorite vacation...,"The other day, I was on a bus in the morning, ...",5.0,1,656
13900,2,Vietnamese,Paragraph writing,Think about a person you know. Write about hi...,I have a sweet friend. She is a nurse. She oft...,5.0,1,650


In [11]:
sample_idx = 13935
sample = augmented_sorted.answer[sample_idx]
sample

'There are two other places in the Shedd Aquarium that have lots of interesting and interesting things to do.  There is the Shedd Aquarium and the Garden Center, which have many activities and programs that you could not find in the Shedd Aquarium Store.  There are also some restaurants that have lots of different kinds of seafood, and they have a lot of activities and programs that you could not find in the Shedd Aquarium Store.  There are also several restaurants that offer some different types of seafood, and they also have a lot of different kinds of fish and plants from ocean like Tropical, Octopus and Anemones.  \nThere are two other places that have a very interesting and interesting place to eat, which is the Shedd Aquarium.  They have a lot of different kinds of seafood and plants.  They have a lot of different kinds of fish and plants from ocean like Tropical, Octopus and Anemones.  There are also a lot of other places that have lots of different kinds of fish and plants from

In [12]:
sample_idx = 13922
sample = augmented_sorted.answer[sample_idx]
sample

'The fruits in the U. S. A are more expensive than in Taiwan, especially bananas.  You could spend only 3 dollars for a bunch of bananas in my country, but the price of the same bananas cost me around 6 dollars here.  Finally, I think these conversions of life habits influence me or other foreign people very much, but I believe these habits will become the best of my experiences living abroad.  \nMy first trip to Taiwan in May of 2011 was in the mountains, so I was a little nervous about going there.  I was going to go to a restaurant that I liked, and there was a girl who was very kind and nice and always nice to me.  I asked her for a taxi to the restaurant, and she said I would pay for her taxi to the hotel.  So I was like, "I\'ll wait, please don\'t ask me too much. " I said, "I can\'t wait to see you. " She said, "You\'re going to the airport?  You\'re going to a hotel, and then you\'re going to a place where you can get a ride to the airport?  You can go there and see me? " I sai

<p>Although it's not perfect, it's done a pretty god job. We can see that the longest paragraph only has a few repetitive sentences (they aren't exact repeats), and is half as long as the longest paragraph was before the function was applied.</p>
<p>We now need to add the length and the number of sentences to the level 2 data.</p>

In [83]:
level_2_selected = df[df.level == 2]

In [86]:
# Define a function to add the number of sentences per text
def num_sentences(df):
    # Create a copy of the DataFrame to avoid the SettingWithCopyWarning
    df = df.copy()
    # Iterate over rows in the DataFrame
    for index, row in df.iterrows():
        # Get the answer text from the DataFrame
        answer_text = row['answer']
        # Process the answer text with spaCy
        doc = nlp(answer_text)
        # Initialize variables to accumulate total tokens and count of sentences
        num_sentences = 0
        # Iterate over sentences and accumulate total tokens
        for sentence in doc.sents:
            num_sentences += 1
        # Add num_sentences in the DataFrame
        df.loc[index, 'num_sentences'] = num_sentences
    return df

In [92]:
level_2_selected = num_sentences(level_2_selected)

In [93]:
level_2_selected['length'] = level_2_selected['answer'].apply(len)

## Balance the dataset by level

First, I want to add a column to this dataframe that indicates which dataset the data belongs to. This will be useful later when it's merged with other datasets.

In [16]:
df['dataset'] = 'PELIC'

In [17]:
df.dataset

0        PELIC
1        PELIC
2        PELIC
3        PELIC
4        PELIC
         ...  
13943    PELIC
13944    PELIC
13945    PELIC
13946    PELIC
13947    PELIC
Name: dataset, Length: 13948, dtype: object

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13948 entries, 0 to 13947
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   level          13948 non-null  int64  
 1   L1             13948 non-null  object 
 2   question_type  13948 non-null  object 
 3   question       13948 non-null  object 
 4   answer         13948 non-null  object 
 5   num_sentences  13948 non-null  float64
 6   is_augmented   13948 non-null  int64  
 7   dataset        13948 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 871.9+ KB


In [19]:
df.level.value_counts()

level
4    6072
5    3928
3    3750
2     198
Name: count, dtype: int64

<p>Since the augmentation, the number of level 2 paragraphs has doubled; however, there are far more level 3, 4, and 5 paragraphs. For the model to be trained properly, the dataset will have to be balanced. Instead of randomly trimming paragraphs from each level, we're going to see if we can maintain a balance of questions. This way, we can maintain a variety of answers, and avoid a high number of paragraphs that answer the same question. Mainly, we don't want to end up randomly throwing away a whole set of answers to one of the questions.</p>
<p>There are also several answers in the dataframe that consist of more than one paragraph, even though the question type is "paragraph writing." This is simply due to the way that the data was collected and tagged. To maintain consistency with the other datasets being used for this project, we want to select only answers that contain one paragraph.</p>
<p>To meet these requirements, we're going to filter out the answers that contain more than one paragraph, and sort the answers in descending order. The reason for doing this is to maintain a greater volume of text data that can be used to train the model (the shortest answers won't provide enough input). Then, we're going to pass the answers through a function that iterates through the answers from the longest to the shortest answer, and adds the answer to a new dataframe if the question for that answer is not yet in the new dataframe. This way, we'll get the longest paragraphs, and no repeating questions.</p>

In [62]:
# Divide the df by levels and order them by the longest answers
level_3 = df[df.level == 3].copy()
level_3['length'] = level_3['answer'].apply(len)
level_3_sorted = level_3[~level_3['answer'].str.contains('\n')].sort_values(by='length', ascending=False)

level_4 = df[df.level == 4].copy()
level_4['length'] = level_4['answer'].apply(len)
level_4_sorted = level_4[~level_4['answer'].str.contains('\n')].sort_values(by='length', ascending=False)

level_5 = df[df.level == 5].copy()
level_5['length'] = level_5['answer'].apply(len)
level_5_sorted = level_5[~level_5['answer'].str.contains('\n')].sort_values(by='length', ascending=False)

In [63]:
# Define the function to select the answers we want
def select_unique_rows(df, num_answers):
    '''
    Iterates through the questions in the dataframe
    and appends the row to a new dataframe,
    only if the question is not already in the new dataframe.
    '''
    # Create an empty DataFrame to store unique rows
    unique_df = pd.DataFrame(columns=df.columns)

    # Iterate through the DataFrame
    for index, row in df.iterrows():
        # Check if the question is already in unique_df
        if row['question'] not in unique_df['question'].values:
            # Append the row to unique_df
            unique_df.loc[len(unique_df)] = row
            # Check if unique_df has reached the desired length
            if len(unique_df) == num_answers:
                break

    return unique_df

In [64]:
# Run the function on the level 3 data
level_3_selected = select_unique_rows(level_3_sorted, 198)

In [65]:
level_3_selected.head()

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,dataset,length
0,3,Turkish,Paragraph writing,Write a paragraph explaining the information f...,Each year the number of the internet user is i...,21.0,0,PELIC,1954
1,3,Arabic,Paragraph writing,Make revisions to your essay as needed and sub...,The English Language Institute (ELI) is a good...,27.0,0,PELIC,1941
2,3,Korean,Paragraph writing,Autobiography: Write a paragraph telling about...,I was born October first nineteen seventy-eigh...,32.0,0,PELIC,1780
3,3,Arabic,Paragraph writing,You have been asked to write an article for th...,The ELI is a good please for study foreign la...,25.0,0,PELIC,1771
4,3,Arabic,Paragraph writing,"Use time order words like: first, next, later...",We have two holidays we celebrate in my countr...,24.0,0,PELIC,1723


In [76]:
len(level_3_selected.question.unique())

198

In [66]:
# Display the longest answer
level_3_selected.answer[0]

"Each year the number of the internet user is increasing incredibly. As you know today internet is an important part of the our life and it is required at school, at home, at work almost everywhere. It is really necessary to do something such as homework, bank transactions, business meeting and so on. At the same time, internet is the one of the important and efficient communication medium such as e-mail and MSN. Today each person has a computer and each person can use internet. Because it ensures benefit about time and cost. In addiction, many goverments have started to set up the new internet based projects such as e- goverment. If we look up the result of the last research about internet use rates by age of head of household, we can easily see growth in the internet user rate. According to the graph, there is at least 10 percent growth in the rate of internet user who are less than 35 in between 1998 and 2001 and there is a steady decrease between 2001 and 2002. And they are more ac

In [69]:
# Run the function on the level 4 data
level_4_selected = select_unique_rows(level_4_sorted, 198)
level_4_selected.head()

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,dataset,length
0,4,Korean,Paragraph writing,Write a short summary of what you learned abou...,Everyone wants to be healthy and tries to keep...,42.0,0,PELIC,4454
1,4,Japanese,Paragraph writing,Make your corrections and submit your revised ...,There are four steps to make your weekend tri...,22.0,0,PELIC,2317
2,4,Japanese,Paragraph writing,Choose one of the situations on page 132 or 13...,I have three suggestions about your big confus...,17.0,0,PELIC,2118
3,4,Japanese,Paragraph writing,Write a paragraph about your week or weekend p...,How did you spend your three days' off? Some o...,27.0,0,PELIC,2100
4,4,Korean,Paragraph writing,Do pp.81-2 in your books. There are 10 steps ...,"These days, we always buy things with our mone...",26.0,0,PELIC,2005


In [77]:
len(level_4_selected.question.unique())

198

In [70]:
# Display the longest answer
level_4_selected.answer[0]

"Everyone wants to be healthy and tries to keep their weight properly. However, we already know it is not easy for us to maintain our body healthy. People are struggling to make their body healthy by their own ways. Fortunately, there is a trustful health planner, Pyramid. According to the food pyramid, there are two key factors to be healthy. One is nutrition and the other is a regular exercise. More specifically, the food guide pyramid can be classified by seven groups: grain, vegetables, fruits, milk, meat & beans, oils, discretionary calories, and physical activity. First, any food made from wheat, rice, oats cornmeal etc is a grain product. It can be divided into two subgroups, whole grains and refined grains. Whole grains contain the entire grain kernel such as the bran, germ, and endosperm. For example, there are whole-wheat flour, bulgur, oatmeal, whole cornmeal, and brown rice. On the other hand, refined grains have been milled, a process that removes the bran and germ. White 

In [71]:
# Run the function on the level 5 data
level_5_selected = select_unique_rows(level_5_sorted, 198)
level_5_selected.head()

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,dataset,length
0,5,Arabic,Paragraph writing,Write ONE PARAGRAPH about a movie you have see...,The movie that I have seen was about a lady h...,40.0,0,PELIC,3865
1,5,Arabic,Paragraph writing,above,Media is a double edged sword because it ...,29.0,0,PELIC,3020
2,5,Japanese,Paragraph writing,Make a generalization about a person that you ...,"She is an introvert person, but she is not sh...",31.0,0,PELIC,2802
3,5,Arabic,Paragraph writing,Write a narrative paragraph. You can write a ...,(Indent)There are many events you participate...,26.0,0,PELIC,2328
4,5,Arabic,Paragraph writing,Revise the narrative paragraph you wrote for A...,(Indent)There are many events you participate...,24.0,0,PELIC,2317


In [78]:
len(level_5_selected.question.unique())

198

In [72]:
# Display the longest answer
level_5_selected.answer[0]

" The movie that I have seen was about a lady her name is Olivia. She was living with her half- sister ANON_NAME_0, who was older than Olivia. The events started when Olivia and ANON_NAME_0 were having dinner with Olivia's father, who got married another woman after his wife who Olivia's mother died. After that he had a daughter her name is Sally. During the dinner they were talking and the little girl whose name is sally was talking with Olivia, but from her responds was obvious that Olivia was jalousie of sally, because she was living with her father. One day while Olivia was at her school, where she spends a lot of time, received a call which was from ANON_NAME_0. She told her that her father and his wife were killed in a car accident, which made her felt unconscious. Suddenly, Olivia found herself responsible for half - sister; her father's lawyer told her that. She refused that and so ANON_NAME_0 did because Olivia was 21 year old she just knew how to take care herself, but she wa

### Merge all dataframes into a single dataframe

In [94]:
merged_df = pd.concat([level_2_selected, level_3_selected, level_4_selected, level_5_selected])

In [95]:
merged_df.head()

Unnamed: 0,level,L1,question_type,question,answer,num_sentences,is_augmented,dataset,length
651,2,Arabic,Paragraph writing,Write about a city you visited in the past. U...,Beautiful City\nI visited Beirut because it is...,17.0,0,PELIC,592
715,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"2 mounth ago, I visited in Toronto in Canada w...",5.0,0,PELIC,156
717,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I went to Busan last summer vacation with my f...,10.0,0,PELIC,621
720,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,I visited Las Vegas August this year.I went to...,10.0,0,PELIC,415
729,2,Korean,Paragraph writing,Write about a city you visited in the past. U...,"When i went to high school, i went to BuSan in...",8.0,0,PELIC,315


In [96]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 792 entries, 651 to 197
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   level          792 non-null    object
 1   L1             792 non-null    object
 2   question_type  792 non-null    object
 3   question       792 non-null    object
 4   answer         792 non-null    object
 5   num_sentences  792 non-null    object
 6   is_augmented   792 non-null    object
 7   dataset        792 non-null    object
 8   length         792 non-null    object
dtypes: object(9)
memory usage: 61.9+ KB


In [97]:
merged_df.level.value_counts()

level
2    198
3    198
4    198
5    198
Name: count, dtype: int64

In [98]:
merged_df.to_csv('../data/PELIC_balanced.csv')