## Loading data and brief description data.

We start by importing packages we need and loading the data. We will use pandas to store the data. The scripts intend to follow the same procedure as Al-Rfou, but have been reimplemented from scratch.

In [1]:
# Import necessary packages
import pandas as pd
import datetime
import numpy as np
import re
import sys
import random
import math
from nltk import FreqDist, ngrams, sent_tokenize, word_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn import svm
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, ParameterGrid
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

In [None]:
# Load the training data. Scramble the rows (sometimes this is important for training)
df = pd.read_csv("python_data/train",sep="\t",error_bad_lines=False,encoding="utf-8")
df = df.sample(frac=1, random_state = 54021)

# Load some datasets we will need later on. E.g. English stopwords.
eng_stopwords = stopwords.words('english') 

Let us first take a brief look at the data. We see that there are approximately 323K comments in the training data. For each comment, we have the native language of the writer (native_lang), the self-reported level of the english speaker (native, 1, 2, 3, 4, 5 or unknown), the original text of the comment (text_original), the text with demonyms and proper nouns replaced by PoS tags (text_clean), and finally the structure of the text reported by SENNA (text_structure). Of the 323K comments, about 110K comments are from native speakers. Thus, the training data seems sufficiently balanced.

In [2]:
df.describe()

Unnamed: 0,native_lang,text_original,level_english,text_clean,text_structure
count,323185,323185,323185,323185,323185
unique,20,317653,7,317560,317281
top,EN,is being used on this article. I notice the im...,N,is being used on this article. I notice the im...,VBZ VBG VBN IN DT NN . PRP VBP DT NN NN VBZ IN...
freq,110320,708,110320,708,708


In [3]:
df.native_lang.describe()

count     323185
unique        20
top           EN
freq      110320
Name: native_lang, dtype: object

In [4]:
df['native'] = np.where(df['native_lang']=='EN', "native", "non-native")

## Replication Al-'Rfou

The paper on which we've based our project on uses similarity scores to word and character n-gram models as the features for subsequent classification. Let us embark too on construction of such models. However, other literature has shown that for character n-grams, increasing n seems to enhance classifcation. Thus, we will construct models for up to 10 n-gram models.

Note that the two important steps are (i) constructing n-gram models for each language and (ii) computing similarity scores against these distributions as the features. 

#### (i) `pruned_language_distribution` to construct n-gram models

Note, a problem with (i) is that construction of n-grams suffers from combinatorial explosion. This is problematic for our purposes as we have no access to a computer with more than 8 GB RAM. To prevent this combinatorial explosion, we prevent construction of higher order n-grams that do not meet a lower threshold `lowerfreqlimit`. Grams with counts equal to 1 do not contribute to the similarity score. Hence, for a strict replication of Al-'Rfou `lowerfreqlimit` should be set to 1. Where we run into trouble processing data we take the liberty to increase this parameter a little bit.

One could argue that increasing this parameter will result in loss of information as it will not record misspellings, which may be indicative of non-native written text. However, note that the character n-grams capture typical non-native speaker mistakes, such that this loss of information is limited. 

Note that the implementation of pruned_language_distribution may seem not very efficient, as it has to check for each n-gram if the sum of the two (n-1) grams it contains exceeds `lowerfreqlimit`. We provide a little benchmark in the appendix to compare `pruned_language_distribution` againt `language_distribution` where all bigrams are exhaustively constructed. 

In [5]:
def pruned_language_distribution(n, m, lowerfreqlimit, training, LANGUAGES):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m.
    @n: consider k-grams up to and including n for words, part of speech tags and word sizes.
    @m: consider k-grams up to and including m for characters. We assume m >= n.
    @lowerfreqlimit: number below which we consider words misspellings, odd words out or unique.
    @training: training data to retrieve the language distribution from.
    @LANGUAGES: languages based on which we classify.
    """
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
        
    # Iterate first over k. This is required as we need to know the full k-1 distributions to see if we should add a 
    # k-gram to the dictionary.
    kmax = 0
    for k in range(1, n+1):
        for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
            for sentence in tokenized_sents:
                
                # Get the necessary input structures for the ngrams-function. It is sentence for "chars".          
                token=word_tokenize(sentence)
                wordlens = [len(word) for word in token]
                
                # Note, for any gram, there exist 2 subgrams of all but the first and all of the last element. Let us
                # only update the dictionary if the total count of these subgrams exceeds the lower limit. This prevents
                # an unnecessary combinatorial explosion.
                for gram in ngrams(sentence,k):
                    if k == 1: 
                        language_dist[language]["chars"][k][gram] += 1
                    elif language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                        
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["words"][k][gram] += 1
                    elif language_dist[language]["words"][k-1].get(gram[1:],0)+language_dist[language]["words"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["words"][k][gram] += 1
                        
                for gram in ngrams(wordlens,k):
                    if k == 1:
                        language_dist[language]["w_sizes"][k][gram] += 1
                    elif language_dist[language]["w_sizes"][k-1].get(gram[1:],0)+language_dist[language]["w_sizes"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["w_sizes"][k][gram] += 1
                        
            # Now for the tokenized structures (tags)
            for sentence in tokenized_struc:
                token=word_tokenize(sentence)
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["tags"][k][gram] += 1
                    elif language_dist[language]["tags"][k-1].get(gram[1:],0)+language_dist[language]["tags"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["tags"][k][gram] += 1
                        
    # Also construct it for higher order k-grams for characters.
    for k in range(n+1, m+1):
        for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
            for sentence in tokenized_sents:
                for gram in ngrams(sentence,k):
                    if language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                           
    return language_dist

With the distribution of grams over different languages, we can compute similarity scores. Note, for each language one can compute similarity score against each k-gram model. Thus, this results here in nlang*(3n+m) features. The paper mentiones that the scores are calculated as the sum of the 2-logs of counts in the model. Inspection of the scripts, however, show that this is not the case: this 2-log is normalized by the length of the sequence. We will follow the script.

In [6]:
def compute_similarity_score(dis_ngramdic, gramlist):
    """ This function computes the similarity scores for a comment based on the corresponding k-grams.
    Note that the comment is already tokenized into sentences.
    @dis_ngramdic: ngram dictionary as constructed by language_distribution for particular k.
    @gramlist: list of kgrams
    """
    score=0
    if gramlist:
        for gram in gramlist:
            score += math.log2(dis_ngramdic.get(gram,1))
    return score

colnames = None

def compute_all_features(lang_dis, tokenized_sent, tokenized_struc):
    """ This function compares the tokenized sentences and tokenized structure to each of the languages distributions.
    It returns similarity scores to each language model. Also included are other features, such as the number of sentences
    per 
    @lang_dis: Language distribution of n-grams.
    @tokenized_sents: sentences tokenized by nltk.ngrams
    @tokenized_struc: PoS structure retrieved by SENNA, tokenized by nltk.ngrams.``
    """
    simscoredict=dict()

    # For each gramtype, first construct the list of which we can make n-grams.
    wordlens_ps = []
    words_ps = []
    struc_ps = []
    
    for sentence in tokenized_sent:
        wordlens_ps.append([len(word) for word in word_tokenize(sentence) if word.isalpha()])
        words_ps.append([word for word in word_tokenize(sentence)])
    
    for sentence in tokenized_struc:
        struc_ps.append([word for word in word_tokenize(sentence)])
    
    # Now we should construct k-gram lists for each k and return the score. Let us store all grams in 
    for gramtype in lang_dis[list(lang_dis.keys())[0]].keys():
        
        # Select appropriate data type.
        if gramtype == "tags":
            ps = struc_ps
        elif gramtype =="words":
            ps = words_ps
        elif gramtype == "w_sizes":
            ps = wordlens_ps
        else:
            ps = tokenized_sent
            
        seq_len = max(sum([len(item) for item in ps]),1)
            
        # Construct for each k a gramlist.
        for k in range(1,len(lang_dis[list(lang_dis.keys())[0]][gramtype])+1):
            
            kgramlist = [gram for sentence in ps for gram in ngrams(sentence, k)]
            
            for lang in lang_dis.keys():
                simscoredict[lang+'_'+gramtype+'_'+str(k)]= compute_similarity_score(lang_dis[lang][gramtype][k], kgramlist)/seq_len 
        
    # Set the other features they use in the paper.
    simscoredict["num_sentences"] = len(tokenized_sent) if isinstance(tokenized_sent, list) else 0
    simscoredict["num_words"] = sum([len(wc) for wc in wordlens_ps])
    simscoredict["avg_wordlength"] = sum([sum(word) for word in wordlens_ps])/max(simscoredict["num_words"],1)
    
    global colnames
    if colnames == None:
        colnames = list(simscoredict.keys())
            
    return simscoredict.values()

Al Rfou reports having used some different features too. It is not entirely clear what they mean. These include:
- "Relative frequency of each of the stop words mentioned in the comment"
- "Average number of sentences"
- "Size of the comments"

What these mean is not unequivocally clear. How should relative frequency be measured? Each comment has a deterministic number of sentences, so the average of sentences over what? The size of the comments, measured in what way? Since such features are ambiguous in their definition and are not reported to be important for the native vs non-native experiment, we exclude them.

In addition to the problems above, the paper does not detail how models were constructed. It mentions that approximately 322K features were used in the experiment and the baseline is 1/(number of classes). The first number makes sense as we have approximately 323K features in total. However, about 110K of these are native US English, whereas the other 200K are non-native speakers. It is not mentioned whether the non-native comments should be downsampled such that we have a balanced problem. We will assume this is the case.

### Repetition of the non-native experiment using SVM classification

Note, the development/validation set is 7 times as small as the training set. Any n-gram expected to be present in the random sample, can be expected to be present 14 times in the training set. We downsample it a bit, but it seems therefore reasonable to require any n-gram in the language distribution to be in there at least 10 times. Note, computing the distribution takes ~1 hour. Try to dump it into a pickle object which may be reloaded later.

In [9]:
# Downsample the non-native languages so that the classes are balanced. Fix the random_state such that we get the appropriate lang_dis.
print("Loading the training and validation data...")
training = pd.concat([df[df.native == "non-native"].sample(sum(df.native == "native"), random_state = 1810), df[df.native=="native"]])
training = training.sample(frac=1, random_state = 1318910)
training.native = training.native.astype('category')

# Load the validation data. 
validation = pd.read_csv("python_data/development",sep="\t",error_bad_lines=False,encoding="utf-8")
df['native'] = np.where(df['native_lang']=='EN', "native", "non-native")

# Parameter choices
n = 4           # n-grams for words, PoS tags and word sizes.
m = 4           # m-grams for characters
lowerlim = 20   # lower limit on the number of wordcounts to consider words for bigrams, trigrams, etc. Needed to prevent memory issues.

# Tokenize sentences and structures for training data.
print("Tokenizing the sentences...")
training['tokenized_sents'] = training.apply(lambda row: sent_tokenize(row['text_clean']), axis=1)
training['tokenized_struc'] = training.apply(lambda row: sent_tokenize(row['text_structure']), axis=1)
validation['tokenized_sents'] = validation.apply(lambda row: sent_tokenize(row['text_clean']), axis=1)
validation['tokenized_struc'] = validation.apply(lambda row: sent_tokenize(row['text_structure']), axis=1)

# Drop the original texts as they are no longer needed.
training = training.drop(['text_clean', 'text_structure'], axis = 1)
validation = validation.drop(['text_clean', 'text_structure'], axis = 1)

# Derive the language distribution from the training data.
print("Deriving the language distribution from training data...")
lang_dis = pruned_language_distribution(n,m,lowerlim,training[['native','tokenized_sents','tokenized_struc']], training.native.unique())

# Use the language distribution to obtain features for training data and validation data.
print("Computing the features for training and validation data.")
features = training.apply(lambda row: compute_all_features(lang_dis,row['tokenized_sents'], row['tokenized_struc']), axis=1)
features = pd.DataFrame(features.to_frame()[0].values.tolist(), index=features.to_frame()[0].index, columns=colnames)
training = pd.merge(training, features, left_index=True, right_index=True)
features = validation.apply(lambda row: compute_all_features(lang_dis,row['tokenized_sents'], row['tokenized_struc']), axis=1)
features = pd.DataFrame(features.to_frame()[0].values.tolist(), index=features.to_frame()[0].index, columns=colnames)
validation = pd.merge(validation, features, left_index=True, right_index=True)

training = training.drop(['tokenized_sents', 'tokenized_struc'], axis=1)
validation = validation.drop(['tokenized_sents', 'tokenized_struc'], axis=1)
lang_dis.clear()

# Drop the tokenized sentences. Also clear the language distribution. They will no longer be needed.
training.to_csv("python_data/training_features_4_4_"+str(lowerlim))
validation.to_csv("python_data/validation_with_features_4_4_"+str(lowerlim))



In [2]:
# Train the SVC classifier with a linear kernel. This is pursued is in the paper.
training=pd.read_csv("python_data/training_features_4_4_20",index_col=0,header=0)
validation=pd.read_csv("python_data/validation_with_features_4_4_20",index_col=0,header=0)

linear = svm.LinearSVC(C=1.0, penalty="l1", dual=False)
linear.fit(training.iloc[:,training.columns.get_loc('non-native_words_1'):], training.native)
y_predicted = linear.predict(validation.iloc[:,validation.columns.get_loc('non-native_words_1'):])
validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
accur = accuracy_score(validation.native, y_predicted)
print(accur)

0.728654920313


We see that we can achieve a 72.9% accuracy on all the data, which is quite in the league of the 74.53% in the paper. Note that accuracy of course depends on the fold on the training and testing data. For another fold, we found accuracy of 73.0%. Let us investigate if we can bump up this accuracy by parameter tuning through grid search. We will also try to improve this accuracy by using a non-linear kernel. Such a non-linear kernel may make sense as it is not guaranteed at all that the two classes are linearly separable. There is also the consideration of scaling. We did not apply it here as Al-Rfou does not seem to apply it, but it is generally thought to be a good idea for SVMs when one starts working with kernels.

#### Tuning C.
Seems to barely have any effect. Scores hover around .73. We get a mean score of .73018 and a standard deviation of 0.00034. Standard error on mean is of course even lower. In retrospect, this is not entirely unexpected as increasing the parameter C is a means to increase robustness by lowering the bias of the classifier. Since we have lots of training data, we can expect that it will have little effect.

In [4]:
grid = [2**i for i in range(-5, 16, 1)]
scores = []
colnames = list(training)[4:]
for C_ in grid:
    linear = svm.LinearSVC(C=C_, penalty="l1", dual=False)
    linear.fit(training[colnames], training.native)
    y_predicted = linear.predict(validation[colnames])
    validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
    scores.append(accuracy_score(validation.native, y_predicted))
scores

[0.72923720589186736,
 0.72906467683150378,
 0.7288490155060493,
 0.72891371390368564,
 0.72897841230132199,
 0.72895684616877654,
 0.72923720589186736,
 0.72865492031314028,
 0.72899997843386743,
 0.72859022191550393,
 0.72919407362677646,
 0.72932347042204926,
 0.72878431710841296,
 0.72921563975932191,
 0.72902154456641288,
 0.7294313010847765,
 0.72932347042204926,
 0.72882744937350386,
 0.72887058163859475,
 0.72934503655459471,
 0.72880588324095841]

#### Bagging for non-linear kernels & grid-search to optimize parameters.

Support Vector Machines are a generalization of Support Vector Classifiers to non-linear decision boundaries. Such non-linear decision boundaries are constructed by a "kernel trick". In Hastie, it is reported that when classes are not linearly separable, non-linear kernels may deliver drastic improvements in prediction performance over linear kernels (but at the risk of overfitting as implied by the higher flexibility).

Our data are currently classified into two classes, being (i) native English speakers and (ii) non-native speakers. The second group obviously consists of many subgroups, and the separating hyperplanes between each of these subgroups with the native English speakers can be expected to be different. That the classes are linearly separable is therefore inherently not obvious and finding out whether non-linear kernels do better is an interesting problem.

However, problem kicks in as the implementation of `svc` in `sklearn` has in the best-case scenario complexity $O(n_features*n_samples^2)$, leading to very large processing times for training SVC. Bagging is a neat way around this problem as a reduction in the number of samples by a factor n_estimators yields a reduction in complexity of n_estimators^2, significantly speeding up calculations (inspired/suggested by https://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python). Let us try this by bagging with 20 estimators. 

In [20]:
"""# Load data and timing package.
import time
training=pd.read_csv("python_data/training_features_4_4_20",index_col=0,header=0)
validation=pd.read_csv("python_data/validation_with_features_4_4_20",index_col=0,header=0)
validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
validation = pd.concat([validation[validation.native == "non-native"].sample(sum(validation.native == "native"), random_state = 1810), validation[validation.native=="native"]])

colnames = list(training)[4:]

# Do scaling. This is suggested by Hsu et al. Scaling is based on training data but executed on validation data.
scaler = StandardScaler()
training[colnames] = scaler.fit_transform(training[colnames])
validation[colnames] = scaler.transform(validation[colnames])

# Train SVCs by bagging. 20 estimators will yield approximately 10.000 samples per SVM. This number should be feasible
# according to the documentation. We can speed up things by multithreading (n_jobs = -1).
n_estimators = 20
start = time.time()
clf = BaggingClassifier(svm.SVC(kernel='rbf', C=2**5, gamma=2**-5, cache_size=2000), random_state = 1281, max_samples=1.0 / n_estimators, n_jobs = -1, n_estimators=n_estimators)
print("Fitting the classifier")
clf.fit(training[colnames],training.native)
print("Prediction out-of-sample")
y_predicted = clf.predict(validation[colnames])
end = time.time()
print("Bagging SVC", end - start, accuracy_score(validation.native, y_predicted))


start = timeit.default_timer()
svc = svm.SVC(kernel='rbf',gamma=2**-5, C=2**5, cache_size=4000)
svc.fit(training[colnames],training.native)
y_predicted=svc.predict(validation[colnames])
print(accuracy_score(validation.native, y_predicted))
end = timeit.default_timer()
print("Training and prediction completed in {} seconds".format(end-start))"""

Fitting the classifier
Prediction out-of-sample
Bagging SVC 75.9409556388855 0.729320461859


'\nstart = timeit.default_timer()\nsvc = svm.SVC(kernel=\'rbf\',gamma=2**-5, C=2**5, cache_size=4000)\nsvc.fit(training[colnames],training.native)\ny_predicted=svc.predict(validation[colnames])\nprint(accuracy_score(validation.native, y_predicted))\nend = timeit.default_timer()\nprint("Training and prediction completed in {} seconds".format(end-start))'

With training of the predictor at approximately 90 seconds, grid-search becomes feasible. For this grid search, we follow recommendations by Hsu et al. ("A practical guide to Support Vector Classification"). We will refrain from selecting $(C,\gamma)$ by cross-validation as we have quite a large dataset already. 

In [3]:
# Load data and timing package.
training=pd.read_csv("python_data/training_features_4_4_20",index_col=0,header=0)
validation=pd.read_csv("python_data/validation_with_features_4_4_20",index_col=0,header=0)
validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
validation = pd.concat([validation[validation.native == "non-native"].sample(sum(validation.native == "native"), random_state = 1810), validation[validation.native=="native"]])
colnames = list(training)[4:]

# Scale the data.
scaler = StandardScaler()
training[colnames] = scaler.fit_transform(training[colnames])
validation[colnames] = scaler.transform(validation[colnames])

# Define the grid and pre-allocate memory for scores. Set the seed for bagging such that results become comparable.
Cs=[2**i for i in range(-5,16,2)]
gammas = [2**i for i in range(-15, 3, 2)]
param_grid = {'C': Cs, 'gamma': gammas, 'kernel':['rbf'], 'cache_size':[500.0]}
scores = pd.DataFrame(columns=['C','gamma','kernel','score'])

# Iterate over the grid and save scores for each (C,gamma) pair.
itern = 1
n_estimators = 20
for g in ParameterGrid(param_grid):
    svc = svm.SVC()
    svc.set_params(**g)
    clf = BaggingClassifier(svc, random_state = 1281, max_samples=1.0 / n_estimators, n_jobs = 4, n_estimators=n_estimators)
    clf.fit(training[colnames],training.native)
    y_predicted = clf.predict(validation[colnames])
    g['score'] = [accuracy_score(validation.native, y_predicted)]
    holder  = pd.DataFrame(g)
    scores = pd.concat([scores, holder])
    if (itern==1 or itern%10 == 0 or itern == 99):
        print("Iteration {}/{} completed".format(itern, len(Cs)*len(gammas)))
    itern += 1
        
scores.to_csv("results_gridsearch_SVM_rbf")

Iteration 1/99 completed
Iteration 10/99 completed
Iteration 20/99 completed
Iteration 30/99 completed
Iteration 40/99 completed
Iteration 50/99 completed
Iteration 60/99 completed
Iteration 70/99 completed
Iteration 80/99 completed
Iteration 90/99 completed
Iteration 99/99 completed


In [6]:
scores

Unnamed: 0,C,cache_size,gamma,kernel,score
0,0.03125,500.0,0.000031,rbf,0.500000
0,0.03125,500.0,0.000122,rbf,0.500000
0,0.03125,500.0,0.000488,rbf,0.514859
0,0.03125,500.0,0.001953,rbf,0.549593
0,0.03125,500.0,0.007812,rbf,0.571740
0,0.03125,500.0,0.031250,rbf,0.618052
0,0.03125,500.0,0.125000,rbf,0.589375
0,0.03125,500.0,0.500000,rbf,0.500000
0,0.03125,500.0,2.000000,rbf,0.500000
0,0.12500,500.0,0.000031,rbf,0.500000


#### Working with scaled data

### Trying to improve the classification
Although it is not obvious, Al Rfou seems to be using the tuning parameter of C=1. Although historically considered a nuisance parameter, it has become clear that C critically determines whether the classifier will overfit. Let us therefore do a grid search for an appropriate C. Cross-validation will not be pursued, since (i) it takes too long and we have ample data and (ii) 5-fold cross-validation performed in a previous version of this script showed the s.e. on MSE is very small (approx. two order of magnitudes smaller than the MSE itself).


## Problems
- The language model distribution is based on all training samples. This actually means that our features already know some information about the class labels in cross-validation, which is kind of prohibited. Fixed, but maybe not the most efficient implementation as it has to reconstruct the model k times.
- Warnings. Should be able to suppress them. Not a problem for running but ugly.

## Appendix

### Benchmark language_distribution against pruned_language_distribution

Here we benchmark the function `pruned_language_distribution` against `language_distribution` for a small sample of the dataset. 

In [None]:
def language_distribution(n, m, training, LANGUAGES):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m.
    @n: consider k-grams up to and including n for words, part of speech tags and word sizes.
    @m: consider k-grams up to and including m for characters.
    @training: training data to retrieve the language distribution from.
    @LANGUAGES: languages based on which we classify.
    """
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
    
    for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
    
        # Construct n grams counts from the tokenized sentences.
        for sentence in tokenized_sents:
            token = word_tokenize(sentence)     
            wordlens = [len(word) for word in token if word.isalpha()]   
            
            for k in range(1,n+1):  
                language_dist[language]["w_sizes"][k].update(ngrams(wordlens,k))
                language_dist[language]["words"][k].update(ngrams(token,k))
                
        # Construct n gram counts from sentences tokenized based on structure.
        for sentence in tokenized_struc:
            token = word_tokenize(sentence)
            for k in range(1,n+1):
                language_dist[language]["tags"][k].update(ngrams(token,k))

        # Construct character m-grams for tokenized sentences.
        for sentence in tokenized_sents:
            for k in range(1,m+1):
                language_dist[language]["chars"][k].update(ngrams(sentence,k))
    
    return language_dist

In [None]:
import timeit

def sum_keys(d):
    return (0 if not isinstance(d, dict) else len(d) + sum(sum_keys(v) for v in d.values()))

rand_sample = df.sample(20000)
rand_sample['tokenized_sents'] = rand_sample.apply(lambda row: sent_tokenize(row['text_clean']), axis=1)
rand_sample['tokenized_struc'] = rand_sample.apply(lambda row: sent_tokenize(row['text_structure']), axis=1)

start = timeit.default_timer()
dis1 = language_distribution(4, 9, rand_sample[['native','tokenized_sents','tokenized_struc']], rand_sample.native.unique())
end = timeit.default_timer()

dis2 = pruned_language_distribution(4, 9, 1, rand_sample[['native','tokenized_sents','tokenized_struc']], rand_sample.native.unique())
end2 = timeit.default_timer()

print("Unpruned, time: {} sec, size: {} items, \n Pruned, time: {} sec, size: {} items".format(end-start, sum_keys(dis1), end2-end, sum_keys(dis2)))

It is evident the pruned models are already somewhat better in terms of memory and take only twice as long for construction. If we increase `lowerlimitfreq` this obviously becomes much better. 