# Compile the ASAG Dataset

<p>The ASAG dataset is in the form of XML files. In this notebook, the data is scraped using BeautifulSoup.</p>
<p>In this dataset, the 'level' column referrs to the level of the course that the student was in at the time of writing the text. However, the 'grade_majority_vote' column refers to the final grade of the answer after it was graded by three TOEFL examiners. To obtain the best y value, we're going to keep 'grade_majority_vote'.</p>

In [17]:
# Import libraries
import pandas as pd
import os
from bs4 import BeautifulSoup as bs

In [18]:
path = '../data/cefr-asag-dataset-1.0.1/corpus/release-1.0/labelled/'

In [19]:
# Create a function that reads the CEFR data from the XML files into a Pandas DataFrame
def create_asag_df(path):
    '''
    This function reads the CEFR data
    from XML files in the directory
    to a Pandas DataFrame
    '''

    # Define DataFrame columns
    df_columns = ['file_name', 'age_participant','sex_participant', 'education', 'L1', 'sex_examiner1', 'sex_examiner2', 'sex_examiner3', 'setting', 'question', 'word_limit', 'level', 'answer', 'grade_examiner1', 'grade_examiner2', 'grade_examiner3', 'grade_majority_vote']

    # Create an empty list to store dictionaries
    all_data = []

    # Loop through XML files in the directory
    for filename in os.listdir(path):
        if filename.endswith('.xml'):
            file_path = os.path.join(path, filename)

            # Read XML file
            with open(file_path, 'r') as file:
                contents = file.read()

            # Parse XML
            soup = bs(contents, 'xml')

            # Extract data from XML
            age_participant = soup.find('person', {'role': 'participant'})
            sex_participant = age_participant.get('sex') if age_participant else None
            age_participant = age_participant.get('age') if age_participant else None
            education = soup.find('education').get('type') if soup.find('education') else None
            L1 = soup.find('langKnown').get('tag') if soup.find('langKnown') else None
            sex_examiner1 = soup.find('person', {'xml:id': 'examiner.1'}).get('sex') if soup.find('person', {'xml:id': 'examiner.1'}) else None
            sex_examiner2 = soup.find('person', {'xml:id': 'examiner.2'}).get('sex') if soup.find('person', {'xml:id': 'examiner.2'}) else None
            sex_examiner3 = soup.find('person', {'xml:id': 'examiner.3'}).get('sex') if soup.find('person', {'xml:id': 'examiner.3'}) else None
            setting = soup.find('settingDesc').find('p').text if soup.find('settingDesc') else None
            question = soup.find('div', {'type': 'question'}).find('p').text if soup.find('div', {'type': 'question'}) else None
            word_limit = soup.find('note', {'type': 'word-limit'}).text if soup.find('note', {'type': 'word-limit'}) else None
            level = soup.find('label', {'type': 'level'}).find('span').text if soup.find('label', {'type': 'level'}) else None
            answer = soup.find('div', {'type': 'answer'}).find('p').text if soup.find('div', {'type': 'answer'}) else None
            grade_examiner1 = soup.find('label', {'corresp': '#examiner.1'}).find('span').text if soup.find('label', {'corresp': '#examiner.1'}) else None
            grade_examiner2 = soup.find('label', {'corresp': '#examiner.2'}).find('span').text if soup.find('label', {'corresp': '#examiner.2'}) else None
            grade_examiner3 = soup.find('label', {'corresp': '#examiner.3'}).find('span').text if soup.find('label', {'corresp': '#examiner.3'}) else None
            grade_majority_vote = soup.find('label', {'subtype': 'majority-vote'}).find('span').text if soup.find('label', {'subtype': 'majority-vote'}) else None

            # Append data to list as a dictionary
            all_data.append({
                'file_name': filename,
                'age_participant': age_participant,
                'sex_participant': sex_participant,
                'education': education,
                'L1': L1,
                'sex_examiner1': sex_examiner1,
                'sex_examiner2': sex_examiner2,
                'sex_examiner3': sex_examiner3,
                'setting': setting,
                'question': question,
                'word_limit': word_limit,
                'level': level,
                'answer': answer,
                'grade_examiner1': grade_examiner1,
                'grade_examiner2': grade_examiner2,
                'grade_examiner3': grade_examiner3,
                'grade_majority_vote': grade_majority_vote
            })

    # Create DataFrame from the list of dictionaries
    df_all = pd.DataFrame(all_data)

    return df_all

In [20]:
# Call the function and display the resulting DataFrame
df = create_asag_df(path)
df.head()

Unnamed: 0,file_name,age_participant,sex_participant,education,L1,sex_examiner1,sex_examiner2,sex_examiner3,setting,question,word_limit,level,answer,grade_examiner1,grade_examiner2,grade_examiner3,grade_majority_vote
0,0001.xml,18,M,higher-secondary,fr,F,F,F,collected in a university-level language learn...,What are your daily habits? What time do you g...,(at least 30 words),A1,everyday i get up at 8 a clock. I always turn ...,A1,A2,A2,A2
1,0002.xml,19,F,higher-secondary,fr,F,F,F,collected in a university-level language learn...,Describe your family.,(at least 30 words),A1,My family is very small. I have a big borther....,A1,A1,A1,A1
2,0003.xml,22,F,lower-secondary,fr,F,F,F,collected in a university-level language learn...,Describe your family.,(at least 30 words),A1,My name is {name},A1,A1,A1,A1
3,0004.xml,21,F,higher-secondary,fr,F,F,F,collected in a university-level language learn...,Describe your hobbies.,(at least 30 words),A1,"Hi my name is {name},",A2,A2,A2,A2
4,0005.xml,18,F,higher-secondary,fr,F,F,F,collected in a university-level language learn...,Describe your family.,(at least 30 words),A1,"I have one sister, she is married and she has ...",A2,A1,A2,A2


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   file_name            299 non-null    object
 1   age_participant      299 non-null    object
 2   sex_participant      299 non-null    object
 3   education            299 non-null    object
 4   L1                   299 non-null    object
 5   sex_examiner1        299 non-null    object
 6   sex_examiner2        299 non-null    object
 7   sex_examiner3        299 non-null    object
 8   setting              299 non-null    object
 9   question             299 non-null    object
 10  word_limit           299 non-null    object
 11  level                299 non-null    object
 12  answer               299 non-null    object
 13  grade_examiner1      299 non-null    object
 14  grade_examiner2      299 non-null    object
 15  grade_examiner3      299 non-null    object
 16  grade_ma

In [22]:
# Replace levels with integers.
replacement_dict = {
    'A1': 1,
    'A2': 2,
    'B1': 3,
    'B2': 4,
    'C1': 5,
    'C2': 6
}

# Perform the string replacements across the whole DataFrame
df.replace(replacement_dict, inplace=True)

In [23]:
# Define a mapping dictionary for language codes to language names
language_mapping = {
    'fr': 'French',
    'it': 'Italian',
    'ru': 'Russian',
    'es': 'Spanish',
    'sw': 'Swahili',
    'ar': 'Arabic',
    'kab': 'Kabyle',
    'fa': 'Persian',
    'nl': 'Dutch',
    'de': 'German',
    'bg': 'Bulgarian'
}

# Assuming 'L1' column contains the language codes, replace them with language names
df['L1'] = df['L1'].map(language_mapping)

In [24]:
# Rename 'level' to 'course_level' for clarity
df.rename(columns={'level': 'level_course'}, inplace=True)
# Rename the column 'grade_majority_vote' to 'level' to maintain consistency with PELIC
df.rename(columns={'grade_majority_vote': 'level'}, inplace=True)
# To maintain consistency with PELIC, add a question_type column
df['question_type'] = 'Paragraph writing'
# Add a column to indicate which dataset it comes from
df['dataset'] = 'ASAG'

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   file_name        299 non-null    object
 1   age_participant  299 non-null    object
 2   sex_participant  299 non-null    object
 3   education        299 non-null    object
 4   L1               299 non-null    object
 5   sex_examiner1    299 non-null    object
 6   sex_examiner2    299 non-null    object
 7   sex_examiner3    299 non-null    object
 8   setting          299 non-null    object
 9   question         299 non-null    object
 10  word_limit       299 non-null    object
 11  level_course     299 non-null    int64 
 12  answer           299 non-null    object
 13  grade_examiner1  299 non-null    int64 
 14  grade_examiner2  299 non-null    object
 15  grade_examiner3  299 non-null    int64 
 16  level            299 non-null    int64 
 17  question_type    299 non-null    ob

In [26]:
df.to_csv('../data/ASAG_compiled.csv')