# Compile the PELIC Dataset
This notebook is to compile and clean the PELIC dataset.

In [11]:
# Import libraries
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
# I'm going to be using this to process the data later on, so I'll load it now.
nlp = spacy.load('en_core_web_sm')

In [19]:
# Define file path
path = '../data/PELIC-dataset/corpus_files/'
# Read the data
question = pd.read_csv(path + 'question.csv')
answer = pd.read_csv(path + 'answer.csv')
student_info = pd.read_csv(path + 'student_information.csv')
course = pd.read_csv(path + 'course.csv')
scores = pd.read_csv(path + 'test_scores.csv')

In [20]:
# Merge the DataFrames on 'question_id' and 'anon_id'
merged_df = pd.merge(answer, question, on='question_id', how='left')
merged_df = pd.merge(merged_df, student_info, on='anon_id', how='left')
merged_df = pd.merge(merged_df, course, on='course_id', how='left')
merged_df = pd.merge(merged_df, scores, on='anon_id', how='left')

In [21]:
# Rename some columns to maintain consistency with other data sets
merged_df.rename(columns={'stem': 'question'}, inplace=True)
merged_df.rename(columns={'text': 'answer'}, inplace=True)
merged_df.rename(columns={'native_language': 'L1'}, inplace=True)

In [22]:
# Map the question types
question_type_mapping = {
    1: 'Paragraph writing',
    2: 'Short answer',
    3: 'Multiple choice',
    4: 'Essay',
    5: 'Fill-in-the-blank',
    6: 'Sentence completion',
    7: 'Word bank',
    8: 'Chart',
    9: 'Word selection',
    10: 'Audio recording'
}

# Create the new 'question_type' column by mapping 'question_type_id' using the mapping dictionary
merged_df['question_type'] = merged_df['question_type_id'].map(question_type_mapping)

In [24]:
# Look at the columns, their types, and which columns have null values
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47667 entries, 0 to 47666
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   answer_id                   47667 non-null  int64  
 1   question_id                 47667 non-null  int64  
 2   anon_id                     47667 non-null  object 
 3   course_id                   47667 non-null  int64  
 4   version                     47667 non-null  int64  
 5   created_date                47667 non-null  object 
 6   text_len                    47667 non-null  int64  
 7   answer                      47664 non-null  object 
 8   tokens                      47667 non-null  object 
 9   tok_lem_POS                 47667 non-null  object 
 10  question_type_id            47538 non-null  float64
 11  question                    47241 non-null  object 
 12  allow_text                  47538 non-null  float64
 13  gender                      476

In [56]:
# Look at how many questions there are of each version
merged_df.version.value_counts()

version
1    42916
2     4154
3      597
Name: count, dtype: int64

The 'version' column indicates that there are different versions of the text; however, I can't find the difference between the versions. Instead of choosing a version, we're going to filter out duplicate answers.

In [60]:
# Find duplicate answer_id
duplicates = merged_df[merged_df.duplicated('answer_id', keep=False)]

# Display the duplicates
duplicates.head(6)

Unnamed: 0,answer_id,question_id,anon_id,course_id,version,created_date,text_len,answer,tokens,tok_lem_POS,...,semester_y,LCT_Form,LCT_Score,MTELP_Form,MTELP_I,MTELP_II,MTELP_III,MTELP_Conv_Score,Writing_Sample,question_type
24,25,10,gc5,114,1,2006-09-21 10:24:26,102,Last week I planned to go paintball match' but...,"['Last', 'week', 'I', 'planned', 'to', 'go', '...","[('Last', 'last', 'JJ'), ('week', 'week', 'NN'...",...,2006_fall,1.0,21.0,R,16.0,19.0,6.0,60.0,2.6,Paragraph writing
25,25,10,gc5,114,1,2006-09-21 10:24:26,102,Last week I planned to go paintball match' but...,"['Last', 'week', 'I', 'planned', 'to', 'go', '...","[('Last', 'last', 'JJ'), ('week', 'week', 'NN'...",...,2007_spring,1.0,24.0,R,15.0,16.0,7.0,57.0,3.0,Paragraph writing
454,492,44,gc5,102,1,2006-10-02 11:49:16,49,"screw, interval, concurrent, dence anticipate,...","['screw', ',', 'interval', ',', 'concurrent', ...","[('screw', 'screw', 'NN'), (',', ',', ','), ('...",...,2006_fall,1.0,21.0,R,16.0,19.0,6.0,60.0,2.6,Short answer
455,492,44,gc5,102,1,2006-10-02 11:49:16,49,"screw, interval, concurrent, dence anticipate,...","['screw', ',', 'interval', ',', 'concurrent', ...","[('screw', 'screw', 'NN'), (',', ',', ','), ('...",...,2007_spring,1.0,24.0,R,15.0,16.0,7.0,57.0,3.0,Short answer
1097,1140,107,gc5,114,1,2006-10-17 10:40:25,258,Limiting students on-line time may be seem a g...,"['Limiting', 'students', 'on-line', 'time', 'm...","[('Limiting', 'limit', 'VBG'), ('students', 's...",...,2006_fall,1.0,21.0,R,16.0,19.0,6.0,60.0,2.6,Paragraph writing
1098,1140,107,gc5,114,1,2006-10-17 10:40:25,258,Limiting students on-line time may be seem a g...,"['Limiting', 'students', 'on-line', 'time', 'm...","[('Limiting', 'limit', 'VBG'), ('students', 's...",...,2007_spring,1.0,24.0,R,15.0,16.0,7.0,57.0,3.0,Paragraph writing


In [70]:
# We can see above and below that there appears to be no difference between the two versions
print(merged_df[merged_df.answer_id == 25].answer[24])
print('\n')
print(merged_df[merged_df.answer_id == 25].answer[25])

Last week I planned to go paintball match' but I lived a healthy problem. I had hard lumbago last days. Therfore I asked my brother could you go instead of me? He accepted my order and he went paintball area yesterday, but yesterday was rainy and cold. On the other hand the other players who all of arabic spoked between native language. My brother didn't understand nothing what they spoked. Even if he played this game, he hadn't a good time. He came to home with funny statue. All wears was coloured and wet. He slept very well last night.


Last week I planned to go paintball match' but I lived a healthy problem. I had hard lumbago last days. Therfore I asked my brother could you go instead of me? He accepted my order and he went paintball area yesterday, but yesterday was rainy and cold. On the other hand the other players who all of arabic spoked between native language. My brother didn't understand nothing what they spoked. Even if he played this game, he hadn't a good time. He came 

In [65]:
# Find unique answer values
unique_answers = merged_df['answer'].value_counts() == 1

# Get the index of unique answer values
unique_index = unique_answers[unique_answers].index

# Filter the DataFrame to keep only rows with unique answer values
unique_df = merged_df[merged_df['answer'].isin(unique_index)]

unique_df.head()

Unnamed: 0,answer_id,question_id,anon_id,course_id,version,created_date,text_len,answer,tokens,tok_lem_POS,...,semester_y,LCT_Form,LCT_Score,MTELP_Form,MTELP_I,MTELP_II,MTELP_III,MTELP_Conv_Score,Writing_Sample,question_type
0,1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"['I', 'met', 'my', 'friend', 'Nife', 'while', ...","[('I', 'I', 'PRP'), ('met', 'meet', 'VBD'), ('...",...,2006_spring,1.0,5.0,P,5.0,7.0,0.0,28.0,1.0,Paragraph writing
1,2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","['Ten', 'years', 'ago', ',', 'I', 'met', 'a', ...","[('Ten', 'ten', 'CD'), ('years', 'year', 'NNS'...",...,2006_spring,1.0,11.0,P,15.0,9.0,5.0,45.0,2.3,Paragraph writing
2,3,12,dk5,115,1,2006-09-21 10:16:17,63,In my country we usually don't use tea bags. F...,"['In', 'my', 'country', 'we', 'usually', 'do',...","[('In', 'in', 'IN'), ('my', 'my', 'PRP$'), ('c...",...,,,,,,,,,,Paragraph writing
3,4,13,dk5,115,1,2006-09-21 10:16:17,6,I organized the instructions by time.,"['I', 'organized', 'the', 'instructions', 'by'...","[('I', 'I', 'PRP'), ('organized', 'organize', ...",...,,,,,,,,,,Paragraph writing
4,5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup.\nSe...","['First', ',', 'prepare', 'a', 'port', ',', 'l...","[('First', 'first', 'RB'), (',', ',', ','), ('...",...,2006_summer,1.0,23.0,R,22.0,30.0,14.0,81.0,2.0,Paragraph writing


In [69]:
# Double check to make sure that there are no more duplicate answer_ids, now that we've removed
# the duplicate answers
# Find duplicate answer_id
duplicates = unique_df[unique_df.duplicated('answer_id', keep=False)]

# Display the duplicates
duplicates.head(6)

Unnamed: 0,answer_id,question_id,anon_id,course_id,version,created_date,text_len,answer,tokens,tok_lem_POS,...,semester_y,LCT_Form,LCT_Score,MTELP_Form,MTELP_I,MTELP_II,MTELP_III,MTELP_Conv_Score,Writing_Sample,question_type


Some columns contain a high number of null values, but we don't want to lose data when we drop NA values. First, we're going to filter out the data that we want to use, and then We're going to select the columns that we're interested in and put them into a new dataframe.

In [76]:
# Firstly, let's look at the class (course) types
unique_df.class_id.value_counts()
# w = writing
# r = reading
# g = grammar
# l = listesing
# s = speaking

class_id
w    13098
r    12651
g    10267
l      889
s       37
Name: count, dtype: int64

In [96]:
# Look at the question types for each type of class
writing_question_types = unique_df.groupby(['class_id', 'question_type'])['allow_text'].value_counts().unstack(fill_value=0)
print(writing_question_types)

allow_text                    0.0   1.0
class_id question_type                 
g        Audio recording        0     1
         Essay                  0    55
         Fill-in-the-blank      0   423
         Paragraph writing      0  6410
         Sentence completion    0   423
         Short answer           0  2955
l        Essay                  0    39
         Paragraph writing      0   582
         Short answer          56   200
r        Essay                  0   104
         Fill-in-the-blank     11     0
         Paragraph writing      0  3423
         Short answer         233  8861
         Word bank              7     0
s        Short answer           0    37
w        Essay                  0  3136
         Paragraph writing      0  5190
         Short answer           0  4759


Answers from every class type allow for open answers, so we can keep data from all class types. The short answer question type has open-text answers as well as non-open-text answers (e.g., multiple choice). So, the most logical thing to do here is simply to drop all answers that don't allow for open text.

In [97]:
# Conserve open-text answers
unique_df = unique_df[unique_df.allow_text == 1]

In [99]:
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36598 entries, 0 to 47662
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   answer_id                   36598 non-null  int64  
 1   question_id                 36598 non-null  int64  
 2   anon_id                     36598 non-null  object 
 3   course_id                   36598 non-null  int64  
 4   version                     36598 non-null  int64  
 5   created_date                36598 non-null  object 
 6   text_len                    36598 non-null  int64  
 7   answer                      36598 non-null  object 
 8   tokens                      36598 non-null  object 
 9   tok_lem_POS                 36598 non-null  object 
 10  question_type_id            36598 non-null  float64
 11  question                    36304 non-null  object 
 12  allow_text                  36598 non-null  float64
 13  gender                      36598 no

<p>Now, we need to drop null values from the dataset, but also conserve as much information that we're interested in as possible for both X and y values. Intuitively, it would seem that 'level_id' would be the best y; however, this is the level of the class that they student was in at the time of writing the answer - it doesn't necessarily indicate the level of the answer itself. Further, the level of the class doesn't line up neatly with a distinct CEFR level. We may be better off using one of the Michigan Test of English Language Proficiency (MTELP) scores, or the 'Writing Sample' column as our y.</p>
<br>
<table>
    <tr>
        <th>level_id</th>
        <th>Level description</th>
        <th>CEFR level</th>
    </tr>
    <tr>
        <td>2</td>
        <td>Pre-Intermediate</td>
        <td>A2/B1</td>
    </tr>
    <tr>
        <td>3</td>
        <td>Intermediate</td>
        <td>B1</td>
    </tr>
    <tr>
        <td>4</td>
        <td>Upper-Intermediate</td>
        <td>B1+/B2</td>
    </tr>
    <tr>
        <td>5</td>
        <td>Advanced</td>
        <td>B2+/C1</td>
        <td></td>
    </tr>
</table>
<br>
<table>
    <tr>
        <th>Column Name</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>MTELP_I</td>
        <td>Grammar section</td>
    </tr>
    <tr>
        <td>MTELP_II</td>
        <td>Reading section</td>
    </tr>
    <tr>
        <td>MTELP_III</td>
        <td>Listening section</td>
    </tr>
    <tr>
        <td>MTELP_Conv_Score</td>
        <td>Total combined score</td>
    </tr>
    <tr>
        <td>Writing_Sample</td>
        <td>In-house writing test score (scale of 1-6)</td>
    </tr>
</table>
<br>
<p>Since we need to keep the test score columns, we're going to drop anything that we don't need that has fewer than 34,941 non-null values. The only column below that value that I would consider keeping is the age column; however, for the purposes of the current project, I don't think it will be relevant.</p>

In [111]:
# Select columns to keep
columns_to_keep = ['answer_id', 'question_id', 'anon_id', 'course_id', 'version',
       'created_date', 'text_len', 'answer', 'tokens', 'tok_lem_POS',
       'question_type_id', 'question', 'allow_text', 'gender',
       'L1', 'class_id', 'level_id', 'section', 'MTELP_I', 'MTELP_II',
       'MTELP_III', 'MTELP_Conv_Score', 'Writing_Sample', 'question_type']

# Put them in a new df
df = unique_df[columns_to_keep].dropna()

# Drop na values
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 34660 entries, 0 to 47662
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   answer_id         34660 non-null  int64  
 1   question_id       34660 non-null  int64  
 2   anon_id           34660 non-null  object 
 3   course_id         34660 non-null  int64  
 4   version           34660 non-null  int64  
 5   created_date      34660 non-null  object 
 6   text_len          34660 non-null  int64  
 7   answer            34660 non-null  object 
 8   tokens            34660 non-null  object 
 9   tok_lem_POS       34660 non-null  object 
 10  question_type_id  34660 non-null  float64
 11  question          34660 non-null  object 
 12  allow_text        34660 non-null  float64
 13  gender            34660 non-null  object 
 14  L1                34660 non-null  object 
 15  class_id          34660 non-null  object 
 16  level_id          34660 non-null  int64  
 17

Now we're going to filter out any answers that don't contain at least one subject and one verb. This will be our bare minimum criterion for text length, which will conserve data from the lower level classes.

In [114]:
# First, check if there are any answers that contain only empty space
df.loc[df['answer'].str.isspace()].index

Index([], dtype='int64')

In [115]:
# Make sure that the answer column is a string
df['answer'] = df['answer'].astype('string')

In [116]:
# Define a function to filter the data for answers that contain at least one subject and verb
def contains_subject_and_verb(text):
    '''
    Checks to see if a document contains
    at least one subject and one verb
    '''
    doc = nlp(text)
    # Check if the text contains at least one subject and one verb
    return any(token.dep_ == "nsubj" for token in doc) and any(token.pos_ == "VERB" for token in doc)

def filter_rows_with_subject_and_verb(df):
    '''
    Applies the contains_subject_and_verb function
    '''
    # Apply the contains_subject_and_verb function to each row in the 'answer' column
    mask = df['answer'].apply(contains_subject_and_verb)
    # Filter the DataFrame to keep only the rows where the condition is True
    return df[mask]

In [117]:
# Apply the function
df = filter_rows_with_subject_and_verb(df)

In [119]:
len(df)

29631

It looks like we eliminated 5,029 answers that don't contain at least one subject and one verb. Now, we're going to add columns for the number of sentences per text, the average sentence length, and the total number of tokens. These can be used in our y later on.

In [120]:
# the number of sentences per text,
# and the average sentence length per text to the dataframe
def sentence_length(df):
    '''
    Adds columns to the dataframe for the length of the answer,
    the number of sentences per answer,
    and the average sentence length per answer.
    '''
    # Load the English language model, if not already loaded at the top of the notebook
#     nlp = spacy.load("en_core_web_sm")
    # Create a copy of the DataFrame to avoid the SettingWithCopyWarning
    df = df.copy()
    # Iterate over rows in the DataFrame
    for index, row in df.iterrows():
        # Get the answer text from the DataFrame
        answer_text = row['answer']
        # Process the answer text with spaCy
        doc = nlp(answer_text)
        # Initialize variables to accumulate total tokens and count of sentences
        total_tokens = 0
        num_sentences = 0

        # Iterate over sentences and accumulate total tokens
        for sentence in doc.sents:
            num_tokens = len(sentence)
            total_tokens += num_tokens
            num_sentences += 1
            
        # Calculate the average sentence length
        avg_len = total_tokens / num_sentences
        # Add num_sentences and avg_len as new columns in the DataFrame
        df.loc[index, 'num_sentences'] = num_sentences
        df.loc[index, 'avg_sentence_length'] = avg_len
        df.loc[index, 'total_tokens'] = total_tokens

    return df

In [121]:
# Apply the function to the dataframe
df = sentence_length(df)

In [122]:
df.head()

Unnamed: 0,answer_id,question_id,anon_id,course_id,version,created_date,text_len,answer,tokens,tok_lem_POS,...,section,MTELP_I,MTELP_II,MTELP_III,MTELP_Conv_Score,Writing_Sample,question_type,num_sentences,avg_sentence_length,total_tokens
0,1,5,eq0,149,1,2006-09-20 16:11:08,177,I met my friend Nife while I was studying in a...,"['I', 'met', 'my', 'friend', 'Nife', 'while', ...","[('I', 'I', 'PRP'), ('met', 'meet', 'VBD'), ('...",...,M,5.0,7.0,0.0,28.0,1.0,Paragraph writing,12.0,16.083333,193.0
1,2,5,am8,149,1,2006-09-20 22:09:14,137,"Ten years ago, I met a women on the train betw...","['Ten', 'years', 'ago', ',', 'I', 'met', 'a', ...","[('Ten', 'ten', 'CD'), ('years', 'year', 'NNS'...",...,M,15.0,9.0,5.0,45.0,2.3,Paragraph writing,10.0,15.6,156.0
4,5,12,ad1,115,1,2006-09-21 10:19:01,59,"First, prepare a port, loose tea, and cup. Sec...","['First', ',', 'prepare', 'a', 'port', ',', 'l...","[('First', 'first', 'RB'), (',', ',', ','), ('...",...,Q,22.0,30.0,14.0,81.0,2.0,Paragraph writing,5.0,15.6,78.0
6,7,12,eg5,115,1,2006-09-21 10:19:02,39,"First, prepare your cup, loose tea or bag tea,...","['First', ',', 'prepare', 'your', 'cup', ',', ...","[('First', 'first', 'RB'), (',', ',', ','), ('...",...,Q,18.0,28.0,13.0,74.0,3.0,Paragraph writing,4.0,13.5,54.0
7,8,13,eg5,115,1,2006-09-21 10:19:02,35,"I organized the instructions by time, beacause...","['I', 'organized', 'the', 'instructions', 'by'...","[('I', 'I', 'PRP'), ('organized', 'organize', ...",...,Q,18.0,28.0,13.0,74.0,3.0,Paragraph writing,2.0,19.5,39.0


In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29631 entries, 0 to 47662
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   answer_id            29631 non-null  int64  
 1   question_id          29631 non-null  int64  
 2   anon_id              29631 non-null  object 
 3   course_id            29631 non-null  int64  
 4   version              29631 non-null  int64  
 5   created_date         29631 non-null  object 
 6   text_len             29631 non-null  int64  
 7   answer               29631 non-null  string 
 8   tokens               29631 non-null  object 
 9   tok_lem_POS          29631 non-null  object 
 10  question_type_id     29631 non-null  float64
 11  question             29631 non-null  object 
 12  allow_text           29631 non-null  float64
 13  gender               29631 non-null  object 
 14  L1                   29631 non-null  object 
 15  class_id             29631 non-null  obje

In [125]:
df.to_csv('../data/PELIC_compiled.csv')