This file is used to create selective datasets for training specific models, whether it is the general model (in which case all training data is grouped by classes here) or smaller fine-tuned models (in which case the data is picked out from the full dataset).

In [1]:
from collections import defaultdict, Counter
import pickle
from random import shuffle
import matplotlib.pyplot as plt

In [2]:
train = pickle.load(open('small_train_eo.pickle', 'rb'))

In [3]:
train[:5]

[("it's → it is", 'Inappropriate_register'),
 ('stades → stadiums', 'Formational_affixes'),
 ('On the first → Firstly,', 'Linking_device'),
 ('they → students', 'lex_item_choice'),
 ('are → is', 'Agreement_errors')]

In [4]:
tags = set((i[1] for i in train))
len(tags)

19

In [5]:
lexical_tags = ['lex_item_choice', 'lex_part_choice']
discourse_tags = ['Absence_comp_sent', 'Absence_explanation', 'Inappropriate_register', 
                 'Linking_device', 'Redundant_comp', 
                 'Ref_device']
gram_tags = ["Verb_pattern", "Confusion_of_structures", "Voice",
             "Comparison_degree", "Formational_affixes", "Prepositions",
             "Category_confusion", "Agreement_errors", "Numerals", 
             "Tense_form", "Relative_clause"]

The following cell is needed when creating data for smaller models -- the tags then should be created from the respective dataset.

In [10]:
lexical_errors, discourse_errors, gram_errors = [], [], []

In [8]:
id2label = {n: tag for n, tag in enumerate(gram_tags)}
label2id = {tag: n for n, tag in id2label.items()}

The following cell is for creating data for smaller models:

In [34]:
for ex, tag in train:
    if tag in gram_tags:
        gram_errors.append((ex, label2id[tag]))

In [15]:
shuffle(gram_errors)

In [16]:
pickle.dump(gram_errors, open('split_train_gram_eo.pickle', 'wb'))

In [38]:
# data = pickle.load(open('split_train_gram_eo.pickle', 'rb'))

The following cell is for creating data for the general model:

In [11]:
for ex, tag in train:
    if tag in lexical_tags:
        lexical_errors.append(ex)
    elif tag in discourse_tags:
        discourse_errors.append(ex)
    else:
        gram_errors.append(ex)

In [13]:
len(gram_errors), len(discourse_errors), len(lexical_errors)

(20767, 10388, 14377)

In [36]:
train = [(er, 0) for er in lexical_errors]
train.extend((er, 1) for er in discourse_errors)
train.extend((er, 2) for er in gram_errors)
train[0], train[-1]

(('they → students', 0), ('import → importation', 2))

In [37]:
len(train)

45532

In [38]:
shuffle(train)

In [40]:
pickle.dump(train, open('split_train_eo.pickle', 'wb'))