<h1>Preprocessing and Exploratory Data Analysis</h1>

With the ultimate goal of training a BERT text classifier to identify the nationality/L1  of non-native writers of English, the following project:

1. Unifies and preprocesses data from multiple corpora 
2. Explores each corpus and L1 category quantitatively
3. Examines limitations, design issues, and questions related to these findings

<h1> Unifying Data from Multiple Corpora </h1>

Corpora included:

1. ICLE
2. EFCAMDAT
3. PELIC

Access Pending for ETS Non-native, through LDC.

The following code extracts samples from each corpus and unifies the labels and samples into a single dataset. Brief descriptions of each corpus are also provided.

In [1]:
%%HTML
<script src="require.js"></script>

In [2]:
import os
import re
import json 

#plotting
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import plotly
import plotly.io as pio
pio.renderers.default='notebook'
import kaleido

#data handling
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import bs4
from bs4 import BeautifulSoup

pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 10)

In [3]:
# main directories
project_dir = "/Users/paulp/Library/CloudStorage/OneDrive-UniversityofEasternFinland/UEF/Thesis"
data_dir = os.path.join(project_dir, 'Data')

# relative corpus directories
ICLE_dir = os.path.join(data_dir, "ICLE/split_texts")
EFCAMDAT_dir = os.path.join(data_dir, 'EFCAMDAT')
PELIC = os.path.join(data_dir, 'PELIC/PELIC_compiled.csv')

os.chdir(data_dir)

<h2> ICLE </h2>

https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html

<h3> Description </h3>

Version 2 of the International Corpus of Learner English from UC Louvain. Samples adhere closely to Atkins and Clear's (1992) corpus design criteria [ICLE]. Most samples in ICLE are argumentative essays collected from academic environments, representing a range of suggested topics as prompts.

The data available to UEF users in V2 does not represent the full range of L1/nationalities of interest. It should also be noted that nationalities, not L1, stands in for the target variable.

In [4]:
files = os.scandir(ICLE_dir)
nationalities = {}
for a in files:
    b = re.split('-', a.name)[1]
    if b not in nationalities.keys():
        nationalities[b] = 1
    else:
        nationalities[b] += 1
nationalities

{'GE': 281,
 'CN': 757,
 'JP': 365,
 'SW': 255,
 'PO': 350,
 'FIN': 193,
 'TR': 255,
 'RU': 250,
 'SP': 186}

In [5]:
dataset = pd.DataFrame(data = None, columns = ['Corpus','Target','Text'])

In [6]:
# fill dataframe with samples 
files = os.scandir(ICLE_dir)
for b,a in enumerate(files):
    target = re.split('-', a.name)[1]
    c = open(a)
    text = c.read()
    dataset.loc[b,'Target'] = target
    dataset.loc[b, 'Text'] = text
    dataset.loc[b, 'Corpus'] = 'ICLE'
    c.close()

In [7]:
# Remove some L1s due to data sparsity 
dataset = dataset[dataset['Target'] != 'SW']
dataset = dataset[dataset['Target'] != 'PO']
dataset = dataset[dataset['Target'] != 'FIN']
dataset = dataset[dataset['Target'] != 'TR']
len(dataset)

1839

<h2> EFCAMDAT </h2>

https://philarion.mml.cam.ac.uk/

<h3> Description </h3>

This corpus is a collaboration between EF Education First and the Department of Theoretical and Applied Linguistics at the University of Cambridge. The samples were collected from English Live, EF's online language school. Samples are sortable by nationality, level, and other provided variables. As in ICLE, nationality is assumed to correlate with L1.

<h3> Notes </h3>

At first, levels 10-16 were selected for this project; based on the corpus documentation, this corresponds to B2+ CEFR levels [], which is harmonious with the ICLE corpus. However, after this initial exploration, it seemed that the levels were inflated, perhaps because they represent overall English competence rather than being distinctly reflective of writing skills. Ultimately, levels 12-16 were selected to filter out some of the lower quality samples. 

To address an under-representation of Spanish language data, Spanish was also sampled from a few Latin American countries. These varieties of Spanish may well impact the model's ability to pick up on 'general' characteristics of Spanish-influenced L2 English, but for now the increase in volume and balanced representation will be assumed a benefit rather than a drawback. 

In [8]:
# Process the XML file from EFCAMDAT
efcamdat = os.path.join(EFCAMDAT_dir, 'EF201403_selection1855.xml')
with open(efcamdat) as fp:
    soup = BeautifulSoup(fp, features='lxml-xml')

In [9]:

efcamdat_ds = pd.DataFrame(data=None, columns = ['Corpus', 'Target', 'Text'])
nationalities = {'cn':'CN', 
                 'de':'GE', 
                 'es':'SP',  
                 'jp':'JP', 
                 'ru':'RU',
                 'mx': 'SP',
                 'ar':'SP',
                 'co': 'SP',
                 've':'SP',
                 'kw':'AR',
                 'om':'AR',
                 'qa':'AR',
                 'sa':'AR',
                 'sy': 'AR'
                }

# Build the DataFrame
for s in soup.find_all('writing'):
    level = int(s.get('level'))
    text = s.find_all('text')[0].text
    #filter out lower level texts
    if level >= 12:
        nationality = s.find_all('learner')[0].get('nationality')
        if nationality in nationalities:
            d = pd.DataFrame(data = {'Corpus': ['EFCAM'], 
                                    'Target': [nationalities[nationality]],
                                    'Text': [text]
                                    }
                            )
            efcamdat_ds = pd.concat([efcamdat_ds, d])
        else:
            pass
    else:
        pass
               

In [10]:
data = pd.concat([dataset, efcamdat_ds])
dataset['Target'] = pd.Categorical(dataset['Target'])
dataset['Corpus'] = pd.Categorical(dataset['Corpus'])

In [11]:
data.describe()

Unnamed: 0,Corpus,Target,Text
count,12063,12063,12063
unique,2,6,12005
top,EFCAM,GE,\n You can learn how to be a good leader....
freq,10224,3889,6


<h2> PELIC </h2>

https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/

<h3>Description</h3>

PELIC contains writing samples from students in the University of Pittsburg English Language Institute, an intensive EAP program. 

<h3> Notes </h3>

Because the data is longitudinal, only one writing sample per student was selected: this to prevent the model from identifying the characteristics of individual writers rather than the target group, although the number of samples per student is relatively small in relation to the corpus size. Levels 4-5, corresponding to B1+, were selected. This may later be narrowed to level 5 to better reflect the composition of the other corpora. 

In the case of PELIC, L1 (not nationality) is the variable label. Provided that the documentation of ICLE and EFCAMDAT are correct, it is reasonable to fuse nationality and L1 into a variable called 'Target' without significantly polluting the variable. 


In [12]:
pelic_ds = pd.read_csv(PELIC)

pelic_nationality_map = {'Arabic':'AR', 
                         'Chinese':'CN', 
                         'Japanese':'JP', 
                         'Spanish':'SP',
                         'Russian':'RU',
                         'German':'GE'
                        }

In [13]:
# get one sample per learner
reduced = pelic_ds.query('text_len > 170').groupby("anon_id").sample(n=1, random_state=1)

# Filter by level and L1
reduced = reduced.filter(items=['level_id', 'L1', 'text'])
reduced = reduced.query("level_id >= 4")

# get text and target, change target name
reduced = reduced.filter(items=['L1', 'text'])
reduced_pelic = reduced.apply(lambda row: row[reduced['L1'].isin(pelic_nationality_map.keys())])

# add corpus label and rename columns
reduced_pelic['Corpus'] = 'PELIC'
reduced_pelic = reduced_pelic.rename(columns={'L1':'Target', 'text':'Text'})
reduced_pelic['Target'] = reduced_pelic['Target'].apply(lambda row: pelic_nationality_map[row])

#append to main data
data = pd.concat([data, reduced_pelic])

In [14]:
data['Corpus'].value_counts()

EFCAM    10224
ICLE      1839
PELIC      553
Name: Corpus, dtype: int64

In [15]:
# run this to save the data generated above
data.to_csv('compiled_data_set.csv')

<h2> ETS Non-Native (TOEFL11) </h2>

Compiled in association with the University of Pennsylvania with the task of native language identification (NLI) in mind, the 12,100 TOEFL essay responses in TOEFL11 address many shortcomings of ICLE: topical imbalances, character encodings, and other cues that make ICLE less suitable are controlled for. 



In [16]:
data = pd.read_csv('compiled_data_set.csv', index_col=0).drop('Length', axis=1)

KeyError: "['Length'] not found in axis"

In [None]:
index = pd.read_csv(os.path.join(data_dir, 'ETS_Corpus_of_Non-Native_Written_English/data/text/index.csv'))
ets_dir = os.path.join(data_dir, 'ETS_Corpus_of_Non-Native_Written_English/data/text/responses/original')
index['Language'].unique()

In [None]:
language_ids = {'DEU':'GE',
               'SPA':'SP',
               'ARA':'AR',
               'JPN':'JP',
               'ZHO':'CN'} # no Russian )-:
ids = language_ids.keys()

index.rename(columns = {'Score Level':'Score_Level'}, inplace = True)
index.query('Language in @ids', inplace = True)
index.query("Score_Level in ('medium', 'high')", inplace = True)

In [None]:
for i in index.iterrows():
    j = i[1]
    filename = j['Filename']
    lang = j['Language']
    corpus = 'TOEFL11'
    with open(os.path.join(ets_dir, filename), 'r') as file:
        text = file.read()
    row = pd.DataFrame({'Corpus': [corpus],
                       'Target': [language_ids[lang]],
                       'Text': [text]})
    data = pd.concat([data, row])

In [None]:
data.query('Corpus == "TOEFL11"')

In [None]:
data.to_csv('compiled_data_set.csv')

<h1> Visualizing and Examining the Corpora </h1>

Thus far, there are three corpora in the dataset with the number of samples noted above, but more detail about the nature and distribution of the samples is needed, along with insight as to how this may influence results and inform design. The code and visualizations below show:
1. the number of samples in each corpus corresponding to each target group
2. the distribution of sample lengths in tokens for each target in each subcorpus

Note that the zoom feature can be used to isolate specific distributions in the visualizations for more clarity.

Design-related questions are addressed both throughout and at the end of the section.

In [None]:
# read the data from file if not generated above
data = pd.read_csv('masked_data_set.csv', index_col=0)

In [None]:
fig = px.bar(data, 
             x='Target', 
             color='Corpus', 
             opacity=0.8, 
             title = 'Number of Texts by Nationality Group'
            )

fig.update_traces(dict(marker_line_width=0)) #run this line if the visualization looks cloudy
fig.show(renderer='notebook')

In [None]:
fig.write_image(os.path.join(data_dir, "TARGET_COUNTS.png"))

In [None]:
# Calculate and Append text lengths using BERT tokenizer

from transformers import BertTokenizer

spec_tokens = ['<?>', '<*>', '<R>'] #one of the corpora uses these
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          additional_special_tokens = spec_tokens)
data['Length'] = None

In [None]:
# find the length of each sample in tokens and append to the main dataframe
Length = [lambda x: len(tokenizer(x)['input_ids'])]
data['Length'] = data['Text'].apply(func = Length, result_type='expand')

In [None]:

fig = px.strip(data, 
                y="Length", 
                x="Target", 
                color="Corpus", 
                hover_data=None,
                title='Distribution of Text Lengths',
               range_y = [0,2500]
              )

fig.show()

In [None]:
fig.write_image(os.path.join(data_dir, "TEXT_LENGTHS.png"))

In [None]:
px.histogram(data,
             x='Length', 
             color = 'Corpus', 
             range_x = [0,1500],
             opacity=1.0,
             title= 'Distribution of Text Lengths Overall'
             )

In [None]:
px.histogram(data,
             x='Length', 
             color = 'Target', 
             range_x = [0,1200],
             opacity=1.0,
             title= 'Distribution of Text Lengths Overall'
             )

Notice the many tiny samples with length <= 50 in EFCAMDAT and PELIC. These are mostly non-informative entries that indicate the task was beyond the students' abilities or they did not have time to complete the task. These are filtered out at a threshold of 120 tokens to make the training samples more informative and training more efficient. 

This threshold was chosen to minimize the number of excluded samples while also making sure the samples are substantial and worth training on. More implications of sample length regarding BERT models will be mentioned later and discussed more fully in the next stage of the project. 

In [None]:
# trim below 120 tokens
data = data.query('Length > 120')

In [None]:
px.histogram(data,
             x='Length', 
             color = 'Corpus', 
             cumulative = True,
             barmode = 'overlay',
             histnorm = 'percent',
             range_x = [120,1500],
             opacity=0.4,
             title = 'Cumulative Distribution of Text Lengths'
             )

In [None]:
# run this to save the data generated above
data.to_csv('compiled_data_set.csv')

<h2> Findings, Impacts, and Decisions </h2>

<h3> Target Representation </h3>

There are some data imbalance issues, namely that Turkish is underrepresented. One option would be to find data from a separate Turkish learner corpus for inclusion. As can be seen above, however, corpora can vary greatly in composition, quality, and length of samples. Introducing a corpus that represents only one target group might have confounding impact.

Another option is regularizing the model such that more prevalent target groups are not predicted arbitrarily: this approach 'punishes' the model for predicting German or Chinese or Arabic simply because they appear more frequently. 

A third option would be to drop Turkish from the data entirely. This would have the benefit of simplifying the classification problem, which is already quite complex, although it underscores a criticism of big data approaches to low-resource languages: although these are the languages in need of more research, they tend to be left out of data-heavy studies.  Although Turkish is not resource scarce, by comparison there is a lot less data at our disposal. 

<h3> Sample Lengths </h3>

A principle design decision in BERT models is setting the maximum sample length in number of tokens. Although this can hypothetically be set as high or low as desired, it comes at performance costs. The standard medium-sized, pretrained BERT model has a max length of 512 tokens. If a training sample is shorter than the max length, mask tokens are passed to the model so it ignores the empty spaces at the end of the sample. If it is longer than the max length, it is truncated, and the end of the sample is lost. 

Doubling the max length incurs a computational cost of (at least) a power of 2, as attention weights have to be calculated for each pair of tokens. My machine can handle max_len = 1024, although a single training epoch takes about two hours. Max length of 256 trains faster, but clips quite a bit off of longer samples, leading to massive data loss. This decision will be explored in more detail at the next stage of the project. 


<h1> References </h1>

Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.

The University of Pittsburgh English Language Institute Corpus (PELIC). (2022). PELIC. https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.

Geertzen, J. , Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF), Cascadilla Press, MA.

