# Compile and Clean the PELIC Dataset
<p>This notebook is to load, compile and clean the PELIC dataset.</p>
<p>The dataset is found here: <a href="https://github.com/ELI-Data-Mining-Group/PELIC-dataset/">https://github.com/ELI-Data-Mining-Group/PELIC-dataset/</a></p>

In [116]:
# Import libraries
import pandas as pd
import spacy

In [36]:
# I'm going to be using this to process the data later on, so I'll load it now.
nlp = spacy.load('en_core_web_sm')

In [18]:
# Define file paths
path = '../data/PELIC-dataset/corpus_files/'
question = pd.read_csv(path + 'question.csv')
answer = pd.read_csv(path + 'answer.csv')
student_info = pd.read_csv(path + 'student_information.csv')
course = pd.read_csv(path + 'course.csv')

In [19]:
# Merge the DataFrames on 'question_id' and 'anon_id'
merged_df = pd.merge(answer, question, on='question_id', how='left')
merged_df.rename(columns={'stem': 'question'}, inplace=True)
merged_df = pd.merge(merged_df, student_info, on='anon_id', how='left')
merged_df.rename(columns={'native_language': 'L1'}, inplace=True)
merged_df = pd.merge(merged_df, course, on='course_id', how='left')

The dataset contains several question types; however, I'm only interested in the paragraphs. That being said, I need to map the question types and filter out the paragraphs.

In [20]:
# Mapping dictionary for question type
question_type_mapping = {
    1: 'Paragraph writing',
    2: 'Short answer',
    3: 'Multiple choice',
    4: 'Essay',
    5: 'Fill-in-the-blank',
    6: 'Sentence completion',
    7: 'Word bank',
    8: 'Chart',
    9: 'Word selection',
    10: 'Audio recording'
}

# Create the new 'question_type' column by mapping 'question_type_id' using the mapping dictionary
merged_df['question_type'] = merged_df['question_type_id'].map(question_type_mapping)

For the purposes of this project, I'm only interested in the student's level, the first language (L1), the question type, the question, and the answer. I'm going to choose these columns for my dataframe.

In [22]:
# Specify the column names you want to include in the new DataFrame
desired_columns = ['level_id','L1','question_type','question','text']
# Create the new DataFrame with only the desired columns
df = merged_df[desired_columns].copy()
# Rename the columns
df = df.rename(columns={'level_id': 'level', 'text': 'answer'})

In [25]:
df.question_type.value_counts()

question_type
Short answer           22808
Paragraph writing      17037
Essay                   3496
Fill-in-the-blank       2036
Sentence completion      450
Word bank                247
Audio recording            1
Name: count, dtype: int64

For the purposes of this project, I'm only interested in paragraphs.

In [27]:
# include only paragraphs in the dataframe.
df = df[df.question_type == "Paragraph writing"]

In [40]:
# Drop NA values
df = df.dropna()

In [42]:
# Check if there are any answers that contain only empty space
df.loc[df['answer'].str.isspace()].index

Index([], dtype='int64')

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16858 entries, 0 to 46203
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   level          16858 non-null  int64 
 1   L1             16858 non-null  object
 2   question_type  16858 non-null  object
 3   question       16858 non-null  object
 4   answer         16858 non-null  object
dtypes: int64(1), object(4)
memory usage: 790.2+ KB


Upon inspection of the answer column, there are still answers that are not full paragraphs (e.g., they only contain one or two words). We're going to include only the paragraphs that contain at least 3 sentences.

In [117]:
# Define a function to add the number of sentences per text
def num_sentences(df):
    # Create a copy of the DataFrame to avoid the SettingWithCopyWarning
    df = df.copy()
    # Iterate over rows in the DataFrame
    for index, row in df.iterrows():
        # Get the answer text from the DataFrame
        answer_text = row['answer']
        # Process the answer text with spaCy
        doc = nlp(answer_text)
        # Initialize variables to accumulate total tokens and count of sentences
        num_sentences = 0
        # Iterate over sentences and accumulate total tokens
        for sentence in doc.sents:
            num_sentences += 1
        # Add num_sentences in the DataFrame
        df.loc[index, 'num_sentences'] = num_sentences
    return df

In [49]:
# Apply the function that adds a column to the df with the number of sentences
df = num_sentences(df)

In [57]:
# Remove answers that contain only 1 or 2 sentences
df = df[(df.num_sentences != 1) & (df.num_sentences != 2)]

In [59]:
# Check to make sure that there are no answers with only 1 or 2 sentencs
df.num_sentences.unique()

array([ 12.,  10.,   5.,   4.,   7.,   6.,   3.,  13.,  11.,  15.,   9.,
        17.,  19.,  27.,  29.,  21.,  16.,   8.,  31.,  25.,  32.,  14.,
        20.,  18.,  26.,  39.,  30.,  24.,  28.,  23.,  85.,  40.,  22.,
        43.,  53.,  36.,  33.,  44.,  34.,  61.,  35.,  38.,  49.,  41.,
        46.,  51.,  52.,  48.,  45.,  42., 103.,  65.,  74.,  50.,  37.,
        64.,  66.,  78.,  75.,  69.,  71.,  70.,  80.,  67.,  60.,  54.,
        57.,  56.,  76.,  47.,  58., 110.,  93.,  73., 104.,  72., 119.,
        79.,  98.,  81., 105.,  59.,  77.,  86.])

In [64]:
# Check to see how many sentences the majorty of the answers have
df.num_sentences.value_counts()

num_sentences
7.0     1322
6.0     1322
8.0     1267
5.0     1165
9.0     1126
        ... 
78.0       1
69.0       1
60.0       1
67.0       1
86.0       1
Name: count, Length: 84, dtype: int64

In [67]:
# Get a random sample of answers just to see what they look like
print(df.answer.sample(n=50, random_state=42))

31736    Two weeks ago, I went to The Mall at Robinson ...
18946    1.I've bought an IPOD for 1 month.\n2.I've liv...
41528     When I was a child I grew up with my parents ...
4225     I've already hung the balloons, but I haven't ...
6478     During the high school, we always shared toget...
5682     Writhing5T\n5/20/07\nTopic: What is a good par...
22979    Get The Knots Out\nAccording to Heather gilles...
17607     George Walker Bush was born in New Haven, Con...
24757    First time when my grandmother had a stroke he...
43257    On February 20th , 2012, I and my friends made...
23494    EX1) Sweden government nurtured free enterpris...
37123                               The green hand     ...
1191     Dear Lan:\n\n Do you remember last time you as...
11265    My kitchen is very small space in my house. As...
14414    My apartment caught on fire at the night of No...
26440     Think about vacation will be exciding about m...
25554    Lose an accent could be occured by stopping of.

In [68]:
# Check to see how much data we have left
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13849 entries, 0 to 46203
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   level          13849 non-null  int64  
 1   L1             13849 non-null  object 
 2   question_type  13849 non-null  object 
 3   question       13849 non-null  object 
 4   answer         13849 non-null  object 
 5   num_sentences  13849 non-null  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 757.4+ KB


In [61]:
# We need the dataset to be balanced. Let's see how much data we have for each level
df.level.value_counts()

level
4    6072
5    3928
3    3750
2      99
Name: count, dtype: int64

There's not a lot of level 2 (pre-intermediate). We're going to have to try to augment that. However, I'm going to save my current df to a csv, and augment the data in a different notebook, since the num_sentences function takes a little while to run.

In [121]:
df.to_csv('../data/PELIC_cleaned.csv')