# Data Wrangling Exercise
Purpose: Classify learners by CEFR

Phase 1: Wrangle some datums

Notes from meeting with Scott:
1. Consider text length
2. Consider how representative each text is (e.g. of a given CEFR band). I am not sure if he was alluding to outliers or something else here.
3. Methods/Technologies to consider:

    a. Semantic spaces

    b. LSA (this was a strong suggestion)

    c. Word2Vec

## Information on EFCAMDAT

> EFCAMDAT consists of essays submitted to Englishtown, the online school of EF Education First, by language learners all over the world (Education First, 2012).  A full course in Englishtown spans 16 proficiency levels aligned with common standards such as TOEFL, IELTS and the Common European Framework of Reference for languages.

__[Overview of EFCAMDAT Data (2013)](https://corpus.mml.cam.ac.uk/faq/SLRF2013Geertzenetal.pdf)__

__[Study with recommendations for Dependency Parsing on this data set (2018)](https://corpus.mml.cam.ac.uk/faq/IJCL2018Huangetal.pdf)__



In [1]:
from lxml import etree
import re
import os.path
import unicodedata
from IPython.display import display # Show me what's going on.
import pandas as pd

print('Working Directory set to:', os.getcwd())

test_file = os.path.join(os.pardir, 'Original Files', 'Level 16 EF_camdat.txt')
with open(test_file, "r") as file:
    data = file.read()

Working Directory set to: /home/jovyan/work/efcamdat/efcamdat-data-cleaning


### One option is to manually alter each illegal character into well-formatted XML. The following script could help with that, but a full-featured text editor like Notepad++ might be a better fit for the job.
__[Predefined characters in XML](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML)__ are mainly &, ", ', <, and >.

In [2]:
# illegal_character = '&'
# p = re.compile(r'<text>\n+(.+)(' + illegal_character + ')(.+)\s+<\/text>')
# matches = p.findall(data)

# for match in matches[:2]:
#     print(match[1])

### Another option is to wrap every \<TEXT\> block in \[\[CDATA\]\] tags, which might magically make the XML properly formatted
__[CDATA Sections in XML](https://www.tutorialspoint.com/xml/xml_cdata_sections.htm)__

In [9]:
def wrap_cdata(text_masquerading_as_xml):
    return re.sub(r'\s*</text>',r']]></text>', re.sub(r'<text>\s*',r'<text><![CDATA[', text_masquerading_as_xml))

# cdata_blocked_off = wrap_cdata(data)
# dump_file = os.path.join('Original Files', 'Level 2 fixing maybe.xml')
# with open(dump_file, "w") as f:
#     f.write(fixed_maybe)

### This seems to work for some of the levels, but not all of them. Let's try to troubleshoot some more.
Okay. It seems all(?) of the remaining issues are control characters. Python package unicodedata can handle this (thank you
StackOverflow). One way to double check that this is working as expected would be to look at a git difference between
cdata_blocked_off and controls_removed. This way we can ensure that only control characters are being removed.
At first I thought Scott might have intentionally made this data difficult to work with, but this is way
too crazy for anyone to have done it on purpose. One of the responses is literally just a string of control characters??

The remove_control_characters function could be narrowed in scope to just the text within CDATA tags, but I am pretty sure there are no problem bytes anywhere else... therefore it would have the same effect.

In [10]:
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

# controls_removed = remove_control_characters(cdata_blocked_off)

In [11]:
def get_writings(cleanish_xml):
    return etree.fromstring(bytes(controls_removed, encoding='utf8'))[1]

# parsed_xml_writings = get_writings(controls_removed)
# print("The number of \'samples\' in this level {:,}:".format(len(parsed_xml_writings)))
#some of these samples are obviously useless...

The number of 'samples' in this level 88,138:


### This is looking good! Now we need to build a dataframe and extract the texts.
So we have writing IDs and Learner IDs, so do not make Learner ID the index (indexes should not have duplicate entries).


In [8]:
# Let's just hard code this...

def xml_framer(xml_root, cols):
    lol = [] # list of lists
    for sample in xml_root:
        l = []
        l.append(sample.attrib['id'])
        l.append(sample.attrib['level'])
        l.append(sample.attrib['unit'])
        l.append(sample[0].attrib['id'])
        l.append(sample[0].attrib['nationality'])
        l.append(sample[1].text)
        l.append(sample[1].attrib['id'])
        l.append(sample[2].text)
        l.append(sample[3].text)
        l.append(sample[4].text)
        lol.append(l)
    df = pd.DataFrame(lol, columns=cols)
    return df

col_labels = ['id','lvl','unit','author_id','author_nationality','topic','topic_id','date','grade','text']

# test_df = xml_framer(parsed_xml_writings, col_labels)
# display(test_df)

Unnamed: 0,id,lvl,unit,author_id,author_nationality,topic,topic_id,date,grade,text
0,C18336,4,2,19148961,cn,Describing routines,4010,2011-10-06 21:30:17.860,100,Sam Mop the floor at 8am every morning. Wash t...
1,C18337,4,3,19148961,cn,Writing a party invitation,3975,2011-10-08 09:53:36.983,87,"Dear Jim, I'm writing to you to invite you to ..."
2,C18338,4,4,19148961,cn,Writing about what you like doing,4056,2011-10-09 11:03:50.967,0,"Dear Sam, How are you? It's a long time since..."
3,C18339,4,1,19148961,cn,Writing about what you do,5322,2011-10-10 21:07:52.060,96,"Dear Teacher, My name is Joe. I worked in Nan..."
4,C18340,4,5,19148961,cn,Writing a description of your family,4241,2011-10-18 01:57:27.000,97,There are three people in my family. My father...
...,...,...,...,...,...,...,...,...,...,...
88133,U725013,4,7,24476712,de,Complaining about chores,5532,2012-09-27 07:09:22.733,96,"Hi Julia, on Monday I did ironing and made the..."
88134,U725034,4,1,22647544,ru,Writing about what you do,5322,2012-09-27 10:47:06.827,97,My name is Pavel and I'm an engineer. I like m...
88135,U725050,4,2,24885871,sa,Describing routines,4010,2012-09-27 11:00:01.530,92,I always get up at 6 o'clock. I eat my breakfa...
88136,U725070,4,2,24692474,sa,Describing routines,4010,2012-09-27 15:59:47.280,98,"Hi, I get up at 7 am every day.I eat breakfast..."


### Putting it all together
If you have the RAM, might as well...

I mean, this is a bit silly, but I am more comfortable with pandas dataframes than anything else.

In [19]:
df = pd.DataFrame(columns=col_labels)
for i in range(1,17):
    file = os.path.join(os.pardir, 'Original Files', 'Level', str(i), 'EF_camdat.txt')
    with open(test_file, "r") as file:
        data = file.read()
    df = df.append(xml_framer(get_writings(remove_control_characters(wrap_cdata(data))), col_labels))
display(df)

Unnamed: 0,id,lvl,unit,author_id,author_nationality,topic,topic_id,date,grade,text
0,C18336,4,2,19148961,cn,Describing routines,4010,2011-10-06 21:30:17.860,100,Sam Mop the floor at 8am every morning. Wash t...
1,C18337,4,3,19148961,cn,Writing a party invitation,3975,2011-10-08 09:53:36.983,87,"Dear Jim, I'm writing to you to invite you to ..."
2,C18338,4,4,19148961,cn,Writing about what you like doing,4056,2011-10-09 11:03:50.967,0,"Dear Sam, How are you? It's a long time since..."
3,C18339,4,1,19148961,cn,Writing about what you do,5322,2011-10-10 21:07:52.060,96,"Dear Teacher, My name is Joe. I worked in Nan..."
4,C18340,4,5,19148961,cn,Writing a description of your family,4241,2011-10-18 01:57:27.000,97,There are three people in my family. My father...
...,...,...,...,...,...,...,...,...,...,...
88133,U725013,4,7,24476712,de,Complaining about chores,5532,2012-09-27 07:09:22.733,96,"Hi Julia, on Monday I did ironing and made the..."
88134,U725034,4,1,22647544,ru,Writing about what you do,5322,2012-09-27 10:47:06.827,97,My name is Pavel and I'm an engineer. I like m...
88135,U725050,4,2,24885871,sa,Describing routines,4010,2012-09-27 11:00:01.530,92,I always get up at 6 o'clock. I eat my breakfa...
88136,U725070,4,2,24692474,sa,Describing routines,4010,2012-09-27 15:59:47.280,98,"Hi, I get up at 7 am every day.I eat breakfast..."


In [21]:
df.to_csv('all_levels.csv')

### We may want to consider running a spellchecker on these responses.

Maybe we...

1. Eliminate useless responses. I think we can assume that all nearly-identical responses have language taken from the prompt. Even if these responses are not just echos of the prompt, I do not think they can tell us very much about the writer, since there is very little variation between them. Like most things, I'm not sure about this.

    a. I think it could be fun to calculate levenshtein distance on the responses, and eliminate responses that are too similar that way.
    
    b. I think the more robust method would be to calculate the TF-IDF and cosine similarity, but that seems a little complex to just find responses with low variance.
    
    c. We could use a cutoff with the "Grade attribute". Irrelevant or incomplete responses are graded lower (maybe anything below 50? below 70?).
    
2. Once we have a relatively useful subset of samples, we can bifurcate

    a. Create .spacy on the samples
    
    b. Run spellchecker then create .spacy on a copy of the samples. This might not buy us anything, but it just might make a difference.
    
    **Actually, I think the best way to do this would be through spaCy.** I don't know if spaCy has an in-built spell checker, but we could add a spell checker to a custom spaCy pipeline. This would have the important advantage of preserving the original, misspelled response as well as a reasonable prediction of the intended orthography. I am not sure how we could incorporate both the uncorrected and corrected texts into a single model, but I sort of like the workflow here anyway.

### Errors I've noticed
1. Spaces before commas. I think this is **always** an error.
2. No space after commas. I think this is an error unless the character after the comma is a quotation mark.
3. Spaces before apostrophes/inverted commas. I think this is **always** an error.
4. Misspellings. I suspect we can improve the data automatically by spell checking, but we of course lose information about writers' spelling knowledge.
5. No space after periods (at end of sentence). This one is tricky. Periods should have a space when they are separating a sentence, but not when they are separating numbers (69.00 dollars) or acronyms (U.S.A.). 