# 01 Data wrangling
### Author: Hugo Salas; Dataset: TOEFL11

This notebook uses the essays from the TOEFL11 dataset and its additional features. It creates additional features and cleans the free text (essay) data. Specifically:
- It creates measures of essay length, unique words used, lexical diversity, mispelled words
- One-hot encodes categorical features (language, prompt, score)
- Tokenizes each essay
- Changes mispelled words by correctly spelled words
- Creates TF-IDF


## 1. Import modules and data
#### 1a) Relevant modules and globals

In [122]:
# Modules
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
import nltk 
from spellchecker import SpellChecker

# Globals
PATH_DATA = '../data/text/'
PATH_RESP = f'{PATH_DATA}/responses/'
RESP_JSON_OR = 'toefl11_resp_or.json'

#### 1b) Data in CSV format: info of each essay

In [123]:
#Import CSV with some features
toefl11_df = pd.read_csv(f'{PATH_DATA}index.csv')
print(toefl11_df.shape)
toefl11_df.head()

(12100, 4)


Unnamed: 0,Filename,Prompt,Language,Score Level
0,88.txt,P6,KOR,high
1,278.txt,P6,DEU,medium
2,348.txt,P1,TUR,high
3,666.txt,P2,ZHO,medium
4,733.txt,P6,TEL,medium


#### 1c) Data in txt: actual essays

In [124]:
# Import essays (free text) and put on dict
toefl11_responses_dict = {}
txt_file_lst = toefl11_df['Filename'].values

for txt_file in txt_file_lst:
    with open(f'{PATH_RESP}/tokenized/{txt_file}') as f:
        text = f.read()
        toefl11_responses_dict[txt_file] = text

# Save as JSON for future use
with open(f'{PATH_RESP}/toefl11_resp_tok.json', 'w') as fp:
    json.dump(toefl11_responses_dict, fp)

print(toefl11_responses_dict['88.txt'][:100]) #Print first 100 characters of one essay

Some people might think that traveling in a group led by a tour guide is a good way .
But , a group 


## 2. Compute relevant features
#### 2a. Misspelled words, essay word length, unique tokens per essay, tokenize 

In [125]:
spell = SpellChecker()
contractions = ["'ll", "'ve", "'d", "'s", "n't", "'re", "'t"]
acronyms = ["tv", "tvs", "gps"]

essay_len = []
uniq_token_len = []
mispelled_wds = []
toefl11_tokens_dict = {}
toefl11_tokens_corrected_dict = {}

for i, txt_file in enumerate(txt_file_lst):
    essay = toefl11_responses_dict[txt_file]
    # Tokenize
    token_lst = nltk.tokenize.word_tokenize(essay)
    # Grab relevant length indicators
    essay_len.append( len(token_lst) ) # Essay length
    uniq_token_len.append( len(set(token_lst)) ) # Unique tokens

    no_cont_acr = []
    token_lst_corrected = []
    for word in token_lst:
        # List without acronyms
        if word.lower() not in acronyms + contractions:
            no_cont_acr.append(word)
        # List with all tokens corrected
        if spell.unknown([word])==set() or word.lower() in acronyms + contractions:
            token_lst_corrected.append( word )
        else:
            token_lst_corrected.append( spell.correction(word) )

    # Get all mispelled words (don't include contractions or acronyms)
    misspelled = spell.unknown(no_cont_acr)
    mispelled_wds.append(len(misspelled))

    # Save tokenized versions of the essay
    toefl11_tokens_dict[txt_file] = token_lst
    toefl11_tokens_corrected_dict[txt_file] = token_lst_corrected

    if i%1000==0:
        print(f"Iterated over {i} essays")

Iterated over 0 essays
Iterated over 1000 essays
Iterated over 2000 essays
Iterated over 3000 essays
Iterated over 4000 essays
Iterated over 5000 essays
Iterated over 6000 essays
Iterated over 7000 essays
Iterated over 8000 essays
Iterated over 9000 essays
Iterated over 10000 essays
Iterated over 11000 essays
Iterated over 12000 essays


In [126]:
# Save as JSON files for future use
with open(f'{PATH_RESP}/toefl11_tokenized_tok.json', 'w') as fp:
    json.dump(toefl11_tokens_dict, fp)

with open(f'{PATH_RESP}/toefl11_tokenized_corrected_tok.json', 'w') as fp:
    json.dump(toefl11_tokens_corrected_dict, fp)

# Add vars to DF
toefl11_df['Essay length'] = essay_len
toefl11_df['Unique tokens'] = uniq_token_len
toefl11_df['Mispelled words'] = mispelled_wds

#Delete objects we don't need
del token_lst, token_lst_corrected, essay, essay_len, uniq_token_len, mispelled_wds, text, no_cont_acr, spell

#### 2b. Calculate lexical diversity indicators. Specifically:
- words: number of words (w)
- terms: number of unique terms (t)
- ttr: type-token ratio computed as t / w (Chotlos 1944, Templin 1957)
- rttr: root TTR computed as t / sqrt(w) (Guiraud 1954, 1960)
- cttr: corrected TTR computed as t / sqrt(2w) (Carrol 1964)
- Herdan: log(t) / log(w) (Herdan 1960, 1964)
- Summer: log(log(t)) / log(log(w)) Summer (1966)
- Dugast: (log(w) ** 2) / (log(w) - log(t) Dugast (1978)
- Maas: (log(w) - log(t)) / (log(w) ** 2) Maas (1972)

In [144]:
toefl11_df['ttr'] = toefl11_df['Unique tokens'] / toefl11_df['Essay length']
toefl11_df['rttr'] = toefl11_df['Unique tokens'] / np.sqrt(toefl11_df['Essay length'])
toefl11_df['cttr'] = toefl11_df['Unique tokens'] / np.sqrt(2*toefl11_df['Essay length'])
toefl11_df['herdan'] = np.log(toefl11_df['Unique tokens']) / np.log(toefl11_df['Essay length'])
toefl11_df['summer'] = np.log(np.log(toefl11_df['Unique tokens'])) / np.log(np.log(toefl11_df['Essay length']))
toefl11_df['dugast'] = np.log(toefl11_df['Unique tokens']**2) / (np.log(toefl11_df['Essay length']) - np.log(toefl11_df['Unique tokens']))
toefl11_df['maas'] =  (np.log(toefl11_df['Essay length']) - np.log(toefl11_df['Unique tokens'])) / np.log(toefl11_df['Essay length']**2)

#### 2c. One-hot encoding of categorical vars (language, prompt and score)

In [156]:
for col, pref in [('Score Level', 'score'), ('Language', 'lang'), ('Prompt', 'prompt')]:
    toefl11_df = toefl11_df.merge(pd.get_dummies(toefl11_df[col], prefix=pref), left_index=True, right_index=True)

#### 2d. Create quantiles for continuous variables

In [145]:
# Since it seems like the lexical diversity measure that is more correlated with the scores is ttr and rttr,
# we will only analyze ttr which also happens to be the most simple
toefl11_df[['ttr', 'rttr', 'cttr', 'herdan', 'summer', 'dugast', 'maas', 'score_high', 'score_low', 'score_medium']].corr()

Unnamed: 0,ttr,rttr,cttr,herdan,summer,dugast,maas,score_high,score_low,score_medium
ttr,1.0,0.334179,0.334179,0.951878,0.88419,0.875594,-0.951878,-0.028126,0.239812,-0.123693
rttr,0.334179,1.0,1.0,0.596301,0.717627,0.401372,-0.596301,0.430634,-0.392711,-0.164988
cttr,0.334179,1.0,1.0,0.596301,0.717627,0.401372,-0.596301,0.430634,-0.392711,-0.164988
herdan,0.951878,0.596301,0.596301,1.0,0.983942,0.856242,-1.0,0.108371,0.0707,-0.147963
summer,0.88419,0.717627,0.717627,0.983942,1.0,0.820049,-0.983942,0.183049,-0.032905,-0.154283
dugast,0.875594,0.401372,0.401372,0.856242,0.820049,1.0,-0.856242,0.060051,0.123963,-0.135102
maas,-0.951878,-0.596301,-0.596301,-1.0,-0.983942,-0.856242,1.0,-0.108371,-0.0707,0.147963
score_high,-0.028126,0.430634,0.430634,0.108371,0.183049,0.060051,-0.108371,1.0,-0.256323,-0.794776
score_low,0.239812,-0.392711,-0.392711,0.0707,-0.032905,0.123963,-0.0707,-0.256323,1.0,-0.382907
score_medium,-0.123693,-0.164988,-0.164988,-0.147963,-0.154283,-0.135102,0.147963,-0.794776,-0.382907,1.0


In [168]:
toefl11_df['mispelled_ratio'] = toefl11_df['Mispelled words']/toefl11_df['Essay length']
for col, pref in [('Essay length', 'essaylen_quint'), ('Mispelled words', 'misp_quint'), ('rttr', 'rttr_quint'), ('Unique tokens', 'un_tok_quint'), ('mispelled_ratio', 'misp_ratio_quint')] :
    toefl11_df = toefl11_df.merge(pd.get_dummies( pd.qcut(toefl11_df['rttr'], 5, labels=False), prefix = 'rttr_quint'), left_index=True, right_index=True)

#### 2e. Save DF

In [171]:
toefl11_df.to_pickle(f'{PATH_RESP}/toefl11_DF.pkl')

In [172]:
toefl11_df.columns

Index(['Filename', 'Prompt', 'Language', 'Score Level', 'Essay length',
       'Unique tokens', 'Mispelled words', 'ttr', 'rttr', 'cttr', 'herdan',
       'summer', 'dugast', 'maas', 'score_high', 'score_low', 'score_medium',
       'lang_ARA', 'lang_DEU', 'lang_FRA', 'lang_HIN', 'lang_ITA', 'lang_JPN',
       'lang_KOR', 'lang_SPA', 'lang_TEL', 'lang_TUR', 'lang_ZHO', 'prompt_P1',
       'prompt_P2', 'prompt_P3', 'prompt_P4', 'prompt_P5', 'prompt_P6',
       'prompt_P7', 'prompt_P8', 'rttr_quint_0_x', 'rttr_quint_1_x',
       'rttr_quint_2_x', 'rttr_quint_3_x', 'rttr_quint_4_x', 'rttr_quint_0_y',
       'rttr_quint_1_y', 'rttr_quint_2_y', 'rttr_quint_3_y', 'rttr_quint_4_y',
       'rttr_quint_0', 'rttr_quint_1', 'rttr_quint_2', 'rttr_quint_3',
       'rttr_quint_4'],
      dtype='object')

In [173]:
toefl11_df.head()

Unnamed: 0,Filename,Prompt,Language,Score Level,Essay length,Unique tokens,Mispelled words,ttr,rttr,cttr,...,rttr_quint_0_y,rttr_quint_1_y,rttr_quint_2_y,rttr_quint_3_y,rttr_quint_4_y,rttr_quint_0,rttr_quint_1,rttr_quint_2,rttr_quint_3,rttr_quint_4
0,88.txt,P6,KOR,high,416,163,0,0.391827,7.991733,5.651008,...,0,1,0,0,0,0,1,0,0,0
1,278.txt,P6,DEU,medium,339,129,3,0.380531,7.006318,4.954215,...,1,0,0,0,0,1,0,0,0,0
2,348.txt,P1,TUR,high,396,195,5,0.492424,9.799119,6.929023,...,0,0,0,0,1,0,0,0,0,1
3,666.txt,P2,ZHO,medium,402,166,2,0.412935,8.279327,5.854369,...,0,0,1,0,0,0,0,1,0,0
4,733.txt,P6,TEL,medium,362,149,8,0.411602,7.831266,5.537541,...,0,1,0,0,0,0,1,0,0,0


## 3. Create Term Frequency-Inverse Document Frequency (TF-IDF)
#### Will do for 1000 and 1500 features and 1-3 ngrams 

In [227]:
def construct_tfidf(max_features, max_df = 0.9, min_df = 0.01, ngram_range = (1,1)):
    cvect = CountVectorizer(lowercase=True, max_features=max_features, max_df=max_df, min_df=min_df , ngram_range= ngram_range)
    tfIdfTransformer = TfidfTransformer(use_idf=True)
    cvect_df = cvect.fit_transform([" ".join(toefl11_tokens_corrected_dict[txt_file]) for txt_file in txt_file_lst])
    cvect_tfidf_df = tfIdfTransformer.fit_transform(cvect_df)
    cvect_tfidf_df = pd.DataFrame(cvect_tfidf_df.toarray(),
                                columns=cvect.get_feature_names())
    return cvect_tfidf_df

In [240]:
#Create and save dataframes
tfidf = {}
for name, max, ngr in [("tfidf_1k", 1000, (1,1)), 
                       ("tfidf_15k", 1500, (1,1)), 
                       ("tfidf_1k_3ngr", 1000, (1,3)), 
                       ("tfidf_15k_3ngr", 1500, (1,3))]:
                       
    tfidf[name] = construct_tfidf(max, ngram_range = ngr)
    tfidf[name].to_pickle(f'{PATH_RESP}/toefl11_{name}.pkl')