# Data Wrangling Exercise
Purpose: Classify learners by CEFR

Phase 1: Wrangle some datums

Notes from meeting with Scott:
1. Consider text length
2. Consider how representative each text is (e.g. of a given CEFR band). I am not sure if he was alluding to outliers or something else here.
3. Methods/Technologies to consider:

    a. Semantic spaces

    b. LSA (this was a strong suggestion)

    c. Word2Vec

## Information on EFCAMDAT

> EFCAMDAT consists of essays submitted to Englishtown, the online school of EF Education First, by language learners all over the world (Education First, 2012).  A full course in Englishtown spans 16 proficiency levels aligned with common standards such as TOEFL, IELTS and the Common European Framework of Reference for languages.

__[Overview of EFCAMDAT Data (2013)](https://corpus.mml.cam.ac.uk/faq/SLRF2013Geertzenetal.pdf)__

__[Study with recommendations for Dependency Parsing on this data set (2018)](https://corpus.mml.cam.ac.uk/faq/IJCL2018Huangetal.pdf)__



In [285]:
from lxml import etree
import re
import os.path
import unicodedata

print('Working Directory set to:', os.getcwd())

test_file = os.path.join('Original Files', 'Level 4 EF_camdat.txt')
with open(test_file, "r") as file:
    data = file.read()

Working Directory set to: /home/jovyan/work/efcamdat


### One option is to manually alter each illegal character into well-formatted XML. The following script could help with that, but a full-featured text editor like Notepad++ might be a better fit for the job.
__[Predefined characters in XML](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML)__ are mainly &, ", ', <, and >.

In [305]:
# Find illegal characters in <text> groups
# Regex makes three matching groups
# Middle group is illegal character
# Manual verification??

# illegal_character = '&'
# p = re.compile(r'<text>\n+(.+)(' + illegal_character + ')(.+)\s+<\/text>')
# matches = p.findall(data)

# for match in matches[:2]:
#     print(match[0])
#     print(match[1])
#     print(match[2])

### Another option is to wrap every \<TEXT\> block in \[\[CDATA\]\] tags, which might magically make the XML properly formatted
__[CDATA Sections in XML](https://www.tutorialspoint.com/xml/xml_cdata_sections.htm)__

In [286]:
cdata_blocked_off = re.sub(r'\s*</text>',r']]></text>', re.sub(r'<text>\s*',r'<text><![CDATA[', data))

# dump_file = os.path.join('Original Files', 'Level 2 fixing maybe.xml')
# with open(dump_file, "w") as f:
#     f.write(fixed_maybe)

### This seems to work for some of the levels, but not all of them. Let's try to troubleshoot some more.
Okay. It seems all(?) of the remaining issues are control characters. Python package unicodedata can handle this (thank you
StackOverflow). One way to double check that this is working as expected would be to look at a git difference between
cdata_blocked_off and controls_removed. This way we can ensure that only control characters are being removed.
At first I thought Scott might have intentionally made this data difficult to work with, but this is way
too crazy for anyone to have done it on purpose. One of the responses is literally just a string of control characters??

The remove_control_characters function could be narrowed in scope to just the text within CDATA tags, but I am pretty sure there are no problem bytes anywhere else... therefore it would have the same effect.

In [287]:
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

controls_removed = remove_control_characters(cdata_blocked_off)

In [307]:
root = etree.fromstring(bytes(controls_removed, encoding='utf8'))
print("The number of \'samples\' in this level {:,}:".format(len(root[1])))
#some of these samples are obviously useless...

In [309]:
# This is not DRY...
lvl_xml = root[1]
isample = 99
print('Writing ID:', lvl_xml[isample].attrib['id'])
print('Writing Level:', lvl_xml[isample].attrib['level'])
print('Writing Unit:', lvl_xml[isample].attrib['unit'])
print('Learner ID:', lvl_xml[isample][0].attrib['id'])
print('Learner Nationality:', lvl_xml[isample][0].attrib['nationality'])
print('Topic:', lvl_xml[isample][1].text)
print('Date:', lvl_xml[isample][2].text)
print('Grade:', lvl_xml[isample][3].text)
print('Text:', lvl_xml[isample][4].text)

Writing ID: C171892
Writing Level: 4
Writing Unit: 2
Learner ID: 22589833
Learner Nationality: cn
Topic: Describing routines
Date: 2012-05-27 10:35:40.370
Grade: 94
Text: Mary goes jogging in the evening every day. watches TV at 7am. does the housework in the evening. makes dinner at 7pm. washes  the dishes every day. does the ironing once a week. mops the floor in the morning. You  feed the dog at 7am in the morning. feed the dog  at 12  at noon. feed the dog at 6pm in the  afternoon. walk the dog in the afternoon. wash the dog once a week.


### This is looking good! Now we need to build a dataframe and extract the texts.
So we have writing IDs and Learner IDs, so do not make Learner ID the index (indexes should not have duplicate entries).


{}


### We may want to consider running a spellchecker on these responses.

Maybe we...

1. Eliminate useless responses. I think we can assume that all nearly-identical responses have language taken from the prompt. Even if these responses are not just echos of the prompt, I do not think they can tell us very much about the writer, since there is very little variation between them. Like most things, I'm not sure about this.

    a. I think it could be fun to calculate levenshtein distance on the responses, and eliminate responses that are too similar that way.
    
    b. I think the more robust method would be to calculate the TF-IDF and cosine similarity, but that seems a little complex to just find responses with low variance.
    
2. Once we have a relatively useful subset of samples, we can bifurcate

    a. Create .spacy on the samples
    
    b. Run spellchecker then create .spacy on a copy of the samples. This might not buy us anything, but it just might make a difference.
    
    **Actually, I think the best way to do this would be through spaCy.** I don't know if spaCy has an in-built spell checker, but we could add a spell checker to a custom spaCy pipeline. This would have the important advantage of preserving the original, misspelled response as well as a reasonable prediction of the intended orthography. I am not sure how we could incorporate both the uncorrected and corrected texts into a single model, but I sort of like the workflow here anyway.

### Errors I've noticed
1. Spaces before commas. I think this is **always** an error.
2. No space after commas. I think this is an error unless the character after the comma is a quotation mark.
3. Spaces before apostrophes/inverted commas. I think this is **always** an error.
4. Misspellings. I suspect we can improve the data automatically by spell checking, but we of course lose information about writers' spelling knowledge.
5. No space after periods (at end of sentence). This one is tricky. Periods should have a space when they are separating a sentence, but not when they are separating numbers (69.00 dollars) or acronyms (U.S.A.). 