# Towards new features

Unfortunately, parameter tuning for both Random Forests and SVM seems not to lead to large increases in classification performance. We might be able to increase, however, by extracting new features. Currently, the similarity score is calculated as in the paper. This similarity score has a number of oddities:
- the presence of a k-gram in a model is weighted equally, regardless of k. Each k-gram leads to two (k-1)-grams. This means that k-grams with higher k are assigned lower weights.
- The equal weighting does not take into account whether a word is somewhat specific to one of the language distributions. In the paper, they show for example that "prove that" is associated with Russians and "refute" with the Dutch. The weighting scheme does not take such correlations into account. We could take them into account by constructing a language distribution per language and calculating the idf-tf per word present.

Even though it has problems, one should note that the highest classification performance in T1-language classification is typically reached by analyzing the n-grams directly, rather than similarity scores to n-gram models. Accuracies in the range of 80-90% have been reported for this task, which is conceptually more difficult than native-language classification as it concerns classification of the native language of the writer him/herself (e.g. Jarvis/Bestgen/Pepper, Gebre/Zampierie/Wittenburg). Some interesting results:
- In Groningen, character n-grams in the range of 8-10 alone led to very high classification accuracies. We might want to see if we can use these.
- Somewhat unexpectedly, an ensemble of learners applied to different types of n-grams has been reported to perform better than a single learner applied to the same n-grams by itself.

## Load the necessary stuff

In [1]:
# Import necessary packages
import pandas as pd
import time
import datetime
import numpy as np
import re
import sys
import random
import math
import gc
from functools import reduce
from scipy import sparse
from scipy.stats import norm
from nltk import FreqDist, ngrams, sent_tokenize, word_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, ParameterGrid, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
import _pickle as pickle

def save_object(obj, filename):
    with open(filename, 'wb') as output:  # Overwrites any existing file.
        p = pickle.Pickler(output) 
        p.fast = True 
        p.dump(obj)
        
def load_object(filename):
    with open(filename, 'rb') as f:
        x = pickle.load(f)
    return(x)

In [2]:
# Load the training data. Scramble the rows (sometimes this is important for training). We also downsample non-native
# English s.t. we have a 1:1 balance. This is required for a fair comparison with the work by Al-Rfou.
print("Loading the training and validation data...")
training = pd.read_csv("python_data/train",sep="\t",error_bad_lines=False,encoding="utf-8")
training = training.sample(frac=1, random_state = 54021)
training['native'] = np.where(training['native_lang']=='EN', "native", "non-native")
training = pd.concat([training[training.native == "non-native"].sample(sum(training.native == "native"), random_state = 1810), training[training.native=="native"]])
training = training.sample(frac=1, random_state = 1318910)
training.native = training.native.astype('category')

# Load the validation data. Again, downsample such that it is balanced.
validation = pd.read_csv("python_data/development",sep="\t",error_bad_lines=False,encoding="utf-8")
validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
validation = pd.concat([validation[validation.native == "non-native"].sample(sum(validation.native == "native"), random_state = 1), validation[validation.native=="native"]])
validation.native = validation.native.astype('category')
print("Data loaded")

# Write data to CSV. We will compute features line by line as doing it in memory is impossible for 20 languages.
training.to_csv("python_data/training_tfidf")
validation.to_csv("python_data/validation_tfidf")

Loading the training and validation data...
Data loaded


## Incorporating tf idf

We want to construct a tf-idf library for all languages separately. Hence, first derive a language distribution for all 20 languages. We end with the English language distribution. The odds ratio is there 1, trivially, but we still include the lower bound such that the measure reflects the sample size of a and c.

In [3]:
def sum_keys(d):
    return (0 if not isinstance(d, dict) else len(d) + sum(sum_keys(v) for v in d.values()))

def pruned_language_distribution(n, m, lowerfreqlimit, training, LANGUAGES):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m.
    @n: consider k-grams up to and including n for words, part of speech tags and word sizes.
    @m: consider k-grams up to and including m for characters. We assume m >= n.
    @lowerfreqlimit: number below which we consider words misspellings, odd words out or unique.
    @training: training data to retrieve the language distribution from.
    @LANGUAGES: languages based on which we classify.
    """
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
        
    # Iterate first over k. This is required as we need to know the full k-1 distributions to see if we should add a 
    # k-gram to the dictionary.
    kmax = 0
    for k in range(1, n+1):
        print("Deriving the {}-gram distribution".format(k))
        for language, text, struc in training.itertuples(index=False):
            
            for sentence in sent_tokenize(text):
                
                # Get the necessary input structures for the ngrams-function. It is sentence for "chars".
                token=word_tokenize(sentence) 
                wordlens = [len(word) for word in token]
                
                # Note, for any gram, there exist 2 subgrams of all but the first and all of the last element. Let us
                # only update the dictionary if the total count of these subgrams exceeds the lower limit. This prevents
                # an unnecessary combinatorial explosion.
                for gram in ngrams(sentence,k):
                    if k == 1: 
                        language_dist[language]["chars"][k][gram] += 1
                    elif language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                        
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["words"][k][gram] += 1
                    elif language_dist[language]["words"][k-1].get(gram[1:],0)+language_dist[language]["words"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["words"][k][gram] += 1
                        
                for gram in ngrams(wordlens,k):
                    if k == 1:
                        language_dist[language]["w_sizes"][k][gram] += 1
                    elif language_dist[language]["w_sizes"][k-1].get(gram[1:],0)+language_dist[language]["w_sizes"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["w_sizes"][k][gram] += 1
                        
            # Now for the tokenized structures (tags)
            for sentence in sent_tokenize(struc):
                token=word_tokenize(sentence)
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["tags"][k][gram] += 1
                    elif language_dist[language]["tags"][k-1].get(gram[1:],0)+language_dist[language]["tags"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["tags"][k][gram] += 1
                        
    # Also construct it for higher order k-grams for characters.
    for k in range(n+1, m+1):
        print("Deriving the {}-gram distribution for characters".format(k))
        for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
            for sentence in tokenized_sents:
                for gram in ngrams(sentence,k):
                    if language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                           
    return language_dist

def construct_lodds_ratio_dict(fname, lb):
    """ 
    In this function, we want to compute the odds ratio for each of the n-grams, and return a dictionary with these values.
    For each non-English language, we will add a pseudocount of .5 to prevent divisions by 0. We return an approximate lower
    bound at the alpha confidence level.
    @fname: File to load the language distribution from
    @alpha: level of alpha of lower bound. 
    """
    
    lang_dis = load_object(fname)
    
    n = len(lang_dis['EN']['words'])
    m = len(lang_dis['EN']['chars'])
    
    for lang in lang_dis.keys():
        if lang == 'EN':
            continue
        for gramtype in lang_dis[lang].keys():  
            for k in lang_dis[lang][gramtype].keys():
                b = sum(lang_dis[lang][gramtype][k].values()) + .5   #Total grams in foreign language
                d = sum(lang_dis['EN'][gramtype][k].values()) + .5   #Total grams in English
                for key in list(lang_dis[lang][gramtype][k].keys()):
                    
                    # Obtain the value by pop, i.e. delete key from dictionary.
                    a = lang_dis[lang][gramtype][k].pop(key,0) +.5   #Gram count for particular gram in foreign language
                    c = lang_dis['EN'][gramtype][k].get(key,0) +.5   #Gram count for particular gram in English
                    
                    if gramtype == "words" and "NNP" in key:
                        continue
                    
                    # If it occurs more often than the lower bound, set value to the lowerbound of odds ratio.
                    if a > lb:
                        lang_dis[lang][gramtype][k][key] = math.log((a*d)/(b*c))  # Calculate the log-odds ratio 
                    
    # Remove English from the language dictionary.
    lang_dis["EN"].clear()

    return(lang_dis)

We first need to construct a language distribution for different languages. Let us not do this by downsampling to the minimum number, but rather take the same training data as previously. The log-odds ratio takes into account imbalance in class.

In [None]:
# Derive the language distribution from the training data.
print("Deriving the language distribution from training data...")
start = time.time()
lang_dis = pruned_language_distribution(4,4,10,training[['native_lang','text_clean','text_structure']], training.native_lang.unique())
end = time.time()
print("Language distribution constructed in {} seconds".format(end-start))

# Save it and clear it to save memory.
save_object(lang_dis,"trained_lang_dis_20_lang_ll10")

We can let our vocabulary as to which terms we want to analyze be guided by the odds ratio. This ratio is defined as $\frac{(a/b)}{(c/d)}$. Since the count of a particular n-gram is negligible in comparison with the total number of n-grams, we can approximate b and d by the total number of n-grams in the foreign and English language distribution, respectively. We have explored using an asymptotic lower and upper bound on the odds ratio to select terms for the vocabulary used for the tokenizer. However, this has resulted in classification performance just short of 70%, which is not very good in comparison with the aggregate score. Therefore, let us be less conservative in selecting terms and allow for a bigger document-term matrix based on which we classify. To keep things somewhat robust, we put a lower bound on the gram count of 5.

In [4]:
lodds_ratio = construct_lodds_ratio_dict("trained_lang_dis_20_lang_ll10", 5)
gc.collect()

0

In [5]:
training = pd.read_csv("python_data/training_tfidf")
validation = pd.read_csv("python_data/validation_tfidf")

from nltk.tokenize.treebank import TreebankWordDetokenizer as Detok
detokenizer = Detok()
word_gram_list = []
char_gram_list = []
struc_gram_list = []
for lang in lodds_ratio.keys():
    if lang == "EN":
        continue
    for gramtype in lodds_ratio[lang].keys():
        for k in lodds_ratio[lang][gramtype].keys():
            for key,v in lodds_ratio[lang][gramtype][k].items():
                if v>math.log(6/5) or v<math.log(5/6):
                    if gramtype == "words":
                        word_gram_list.append(key)
                    if gramtype == "chars":
                        char_gram_list.append(key)
                    if gramtype =="tags":
                        struc_gram_list.append(key)
word_gram_list = set([detokenizer.detokenize(gram) for gram in set(word_gram_list)])
struc_gram_list = set([detokenizer.detokenize(gram) for gram in set(struc_gram_list)])
char_gram_list = set([''.join(gram) for gram in set(char_gram_list)])

In [6]:
import gc
lodds_ratio.clear()
gc.collect()
print(len(word_gram_list))
print(len(char_gram_list))
print(len(struc_gram_list))

130249
80360
60487


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

training = pd.read_csv("python_data/training_tfidf")
validation = pd.read_csv("python_data/validation_tfidf")

# create the transform for words, characters and structure.
word_vectorizer = CountVectorizer(ngram_range=(1, 4), vocabulary = word_gram_list, lowercase=False)
struc_vectorizer = CountVectorizer(ngram_range=(1, 4), vocabulary = struc_gram_list, lowercase=False)
char_vectorizer = CountVectorizer(ngram_range=(1, 4), vocabulary = char_gram_list, analyzer="char", lowercase=False)

In [8]:
word_vector = word_vectorizer.transform(training["text_clean"])
struc_vector = struc_vectorizer.transform(training["text_structure"])
char_vector = char_vectorizer.transform(training["text_clean"])
sparse.save_npz("word_vector",word_vector)
sparse.save_npz("struc_vector",struc_vector)
sparse.save_npz("char_vector",char_vector)

In [9]:
validation_word_vector = word_vectorizer.transform(validation["text_clean"])
validation_char_vector = char_vectorizer.transform(validation["text_clean"])
validation_struc_vector = struc_vectorizer.transform(validation["text_structure"])
sparse.save_npz("val_word_vector",validation_word_vector)
sparse.save_npz("val_struc_vector",validation_struc_vector)
sparse.save_npz("val_char_vector",validation_char_vector)

## Start classification based on word and structure vectors. 

Here we restart Python and hope this prevents memory problems...

In [3]:
training_grams = sparse.load_npz("word_vector.npz")
validation_grams = sparse.load_npz("val_word_vector.npz")
clf = MultinomialNB().fit(training_grams, training.native)
predicted = clf.predict(validation_grams)
accuracy = 1-sum(predicted != validation.native)/len(predicted)
print("Classifying based on word-grams gives NB accuracy of {}%".format(accuracy))
clf = SGDClassifier(loss="hinge",penalty="l2",alpha=1e-4, random_state=42, max_iter=500, tol=None)
clf.fit(training_grams, training.native)
predicted_word = clf.predict(validation_grams)
accuracy = 1-sum(predicted_word != validation.native)/len(predicted_word)
print("Classifying based on word-grams with SVM with gradient descent gives accuracy of {}%".format(accuracy*100))

Classifying based on word-grams gives NB accuracy of 0.6947756956274844%
Classifying based on word-grams with SVM with gradient descent gives accuracy of 71.99192378068017%


In [4]:
training_grams = sparse.load_npz("char_vector.npz")
validation_grams = sparse.load_npz("val_char_vector.npz")
clf = MultinomialNB().fit(training_grams, training.native)
predicted = clf.predict(validation_grams)
accuracy = 1-sum(predicted != validation.native)/len(predicted)
print("Classifying based on character grams gives NB accuracy of {}%".format(accuracy))
clf = SGDClassifier(loss="hinge",penalty="l2",alpha=1e-4, random_state=42, max_iter=500, tol=None)
clf.fit(training_grams, training.native)
predicted_char = clf.predict(validation_grams)
accuracy = 1-sum(predicted_char != validation.native)/len(predicted_char)
print("Classifying based on character grams with SVM with gradient descent gives accuracy of {}%".format(accuracy*100))

Classifying based on character grams gives NB accuracy of 0.6492838664899994%
Classifying based on character grams with SVM with gradient descent gives accuracy of 71.55656508297054%


In [5]:
training_grams = sparse.load_npz("struc_vector.npz")
validation_grams = sparse.load_npz("val_struc_vector.npz")
clf = MultinomialNB().fit(training_grams, training.native)
predicted = clf.predict(validation_grams)
accuracy = 1-sum(predicted != validation.native)/len(predicted)
print("Classifying based on structure gives NB accuracy of {}%".format(accuracy))
clf = SGDClassifier(loss="hinge",penalty="l2",alpha=1e-4, random_state=42, max_iter=500, tol=None)
clf.fit(training_grams, training.native)
predicted_struc = clf.predict(validation_grams)
accuracy = 1-sum(predicted_struc != validation.native)/len(predicted_struc)
print("Classifying based on structure with SVM with gradient descent gives accuracy of {}%".format(accuracy*100))

Classifying based on structure gives NB accuracy of 0.5784276610511705%
Classifying based on structure with SVM with gradient descent gives accuracy of 60.4012871474541%


Clearly, SVM with some regularization works better than multivariate Naive Bayes. Let us spit out all predictions to a file.

In [6]:
predicted_struc = pd.DataFrame(predicted_struc, validation.index)
predicted_word = pd.DataFrame(predicted_word, validation.index)
predicted_char = pd.DataFrame(predicted_char, validation.index)
predicted_struc.columns = ["prediction_struc"]
predicted_char.columns = ["prediction_chars"]
predicted_word.columns = ["prediction_words"]

# Merge the predictions and validation dataframe.
dfs = [validation, predicted_struc, predicted_char, predicted_word]
df_final = reduce(lambda left,right: pd.merge(left,right, left_index= True, right_index = True), dfs)
df_final = df_final.drop(['text_original','text_structure'],1)
df_final.to_csv("output_DTM_classifiers")

Clearly, inspection of df_final shows that the classifiers frequently disagree. It would be interesting to see what the majority vote would be. Let us postpone such an analysis to R.

In [7]:
df_final.head()

Unnamed: 0,native_lang,level_english,text_clean,native,prediction_struc,prediction_chars,prediction_words
23905,DE,3,"NNP. your question: ""you NNP picked a bad day ...",non-native,native,non-native,non-native
272,FR,5,"There is also a similar discussion, (caused by...",non-native,native,non-native,non-native
42283,NL,4,"It may be co-funded by an JJ network, but the ...",non-native,native,non-native,native
23101,PT,3,It could be worse! I actually refute the crap ...,non-native,native,non-native,non-native
23884,DA,3,NNP was proposed for deletion. This page is an...,non-native,non-native,non-native,native


## Appendix: Compute similarity scores against log-odds dictionary.

Here, we tried to compute a similarity score analogous to the one in the replication based on all 20 languages and odds ratios. This turns out to work pretty badly...

In [None]:
def compute_similarity_score_sum(dis_ngramdic, gramlist):
    """ This function computes the similarity scores for a comment based on the corresponding k-grams.
    Note that the comment is already tokenized into sentences.
    @dis_ngramdic: ngram dictionary as constructed by language_distribution for particular k.
    @gramlist: list of kgrams
    """
    score=0
    if gramlist:
        for gram in gramlist:
            score += dis_ngramdic.get(gram,1)
    return score

colnames = None


def compute_all_features(lang_dis, original_text, clean_text, structure_text):
    """ This function compares the sentences and structure to each of the languages distributions. It returns
    similarity scores to each language model. Also included are other features, such as the number of sentences
    per text, etc.
    @lang_dis: Language distribution of n-grams.
    @clean_text: Text with proper nouns and demonyms substituted
    @structure_text: PoS structure retrieved by SENNA.
    """
    simscoredict=dict()
    
    # For each gramtype, first construct the list of which we can make n-grams.
    words_ps = list(word_tokenize(clean_text))
    struc_ps = list(word_tokenize(structure_text))
    wordlens_ps = [len(word) for word in word_tokenize(original_text) if word.isalpha()]
    
    # Now we should construct k-gram lists for each k and return the score. Let us store all grams in 
    for gramtype in lang_dis[list(lang_dis.keys())[0]].keys():
        
        # Select appropriate data type.
        if gramtype == "tags":
            ps = struc_ps
        elif gramtype =="words":
            ps = words_ps
        elif gramtype == "w_sizes":
            ps = wordlens_ps
        elif gramtype == "chars":
            ps = clean_text
            
        seq_len = len(ps) if len(ps) != 0 else 1
        
        # For each k, feed the ngrams function into the compute_similarity_score function. 
        for k in range(1,len(lang_dis[list(lang_dis.keys())[0]][gramtype])+1):
            for lang in lang_dis.keys():
                simscoredict[lang+'_'+gramtype+'_'+str(k)] = compute_similarity_score_sum(lang_dis[lang][gramtype][k], ngrams(ps,k))/seq_len
    
    # Set the other features they use in the paper.
    simscoredict["num_sentences"] = len(list(sent_tokenize(clean_text)))
    simscoredict["num_words"] = len(wordlens_ps)
    simscoredict["avg_wordlength"] = sum(wordlens_ps)/len(wordlens_ps)
        
    global colnames
    if colnames == None:
        colnames = list(simscoredict.keys())
            
    return simscoredict.values()

In [None]:
import csv

print("Starting computing features against each language distribution")
start=time.time()

with open("python_data/training_tfidf") as infile, open("python_data/training_with_features_tfidf","w") as outfile:
    # Open csv reader and writer.
    r =  csv.reader(infile); w = csv.writer(outfile); next(r)
    header_written = False
    for row in r:
        features = compute_all_features(lodds_ratio,row[2],row[4],row[5])
        if not header_written:
            w.writerow(['','native_lang','native','level']+colnames)
            header_written = True
        w.writerow([row[0], row[1], row[3], row[6]]+ list(features))

with open("python_data/validation_tfidf") as infile, open("python_data/validation_with_features_tfidf","w") as outfile:
    # Open csv reader and writer.
    r =  csv.reader(infile); w = csv.writer(outfile); next(r)
    header_written = False
    for row in r:
        features = compute_all_features(lodds_ratio,row[2],row[4],row[5])
        if not header_written:
            w.writerow(['','native_lang','native','level']+colnames)
            header_written = True
        w.writerow([row[0], row[1], row[3], row[6]]+ list(features))

print("Features calculated in {} seconds".format(time.time()-start))