## Loading data and brief description data.

We start by importing packages we need and loading the data. We will use pandas to store the data. The scripts intend to follow the same procedure as Al-Rfou, but have been reimplemented from scratch.

In [1]:
# Import necessary packages
import pandas as pd
import time
import datetime
import numpy as np
import re
import sys
import random
import math
from nltk import FreqDist, ngrams, sent_tokenize, word_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn import svm
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, ParameterGrid, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
import pickle

def save_object(obj, filename):
    with open(filename, 'wb') as output:  # Overwrites any existing file.
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
        
def load_object(filename):
    with open(filename, 'rb') as f:
        x = pickle.load(f)
    return(x)

In [2]:
# Load the training data. Scramble the rows (sometimes this is important for training)
df = pd.read_csv("python_data/train",sep="\t",error_bad_lines=False,encoding="utf-8")
df = df.sample(frac=1, random_state = 54021)
df['native'] = np.where(df['native_lang']=='EN', "native", "non-native")

# Load some datasets we will need later on. E.g. English stopwords.
eng_stopwords = stopwords.words('english') 

# Load the training data. Downsample non-English such that it is balanced.
print("Loading the training and validation data...")
training = pd.concat([df[df.native == "non-native"].sample(sum(df.native == "native"), random_state = 1810), df[df.native=="native"]])
training = training.sample(frac=1, random_state = 1318910)
training.native = training.native.astype('category')

# Load the validation data. Again, downsample such that it is balanced.
validation = pd.read_csv("python_data/development",sep="\t",error_bad_lines=False,encoding="utf-8")
validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
validation = pd.concat([validation[validation.native == "non-native"].sample(sum(validation.native == "native"), random_state = 1), validation[validation.native=="native"]])
validation.native = validation.native.astype('category')

Loading the training and validation data...


Let us first take a brief look at the data. We see that there are approximately 323K comments in the training data. For each comment, we have the native language of the writer (native_lang), the self-reported level of the english speaker (native, 1, 2, 3, 4, 5 or unknown), the original text of the comment (text_original), the text with demonyms and proper nouns replaced by PoS tags (text_clean), and finally the structure of the text reported by SENNA (text_structure). Of the 323K comments, about 110K comments are from native speakers. We downsample to get balanced training and validation data as this is suggested by the paper as its baseline.

In [3]:
df.describe()

Unnamed: 0,native_lang,text_original,level_english,text_clean,text_structure,native
count,323185,323185,323185,323185,323185,323185
unique,20,317653,7,317560,317281,2
top,EN,is being used on this article. I notice the im...,N,is being used on this article. I notice the im...,VBZ VBG VBN IN DT NN . PRP VBP DT NN NN VBZ IN...,non-native
freq,110320,708,110320,708,708,212865


In [4]:
df.native_lang.describe()

count     323185
unique        20
top           EN
freq      110320
Name: native_lang, dtype: object

In [5]:
# Drop the original dataframe. We will work only with downsampled training and validation data.
df = df.iloc[0:0]

## Replication Al-'Rfou

The paper on which we've based our project on uses similarity scores to word and character n-gram models as the features for subsequent classification. Here, we embark too on construction of such models. The two important steps are (i) constructing n-gram models for each language and (ii) computing similarity scores against these distributions as the features.

#### (i) `pruned_language_distribution` to construct n-gram models

Note, a problem with (i) is that construction of n-grams suffers from combinatorial explosion. This is problematic for our purposes as we have no access to a computer with more than 8 GB RAM. To prevent this combinatorial explosion, we prevent construction of higher order n-grams that do not meet a lower threshold `lowerfreqlimit`. Grams with counts equal to 1 do not contribute to the similarity score. Hence, for a strict replication of Al-'Rfou `lowerfreqlimit` should be set to 1. We have decided to increase this parameter to 10 to so as to keep relatively sparse n-gram models.

One could argue that increasing this parameter will result in loss of information as it will not record misspellings, which may be indicative of non-native written text. However, note that the character n-grams capture typical non-native speaker mistakes, such that this loss of information is limited. 

Note that the implementation of pruned_language_distribution may seem not very efficient, as it has to check for each n-gram if the sum of the two (n-1) grams it contains exceeds `lowerfreqlimit`. We provide a little benchmark in the appendix to compare `pruned_language_distribution` against `language_distribution` where all bigrams are exhaustively constructed. This is not pursued for the entire training set as it results in a memory error on a PC with 8GB RAM, but it shows that (for smaller samples) preventing construction of leaves with $<10$ counts slows the construction down only by a factor of approximately 2.

In [6]:
def pruned_language_distribution(n, m, lowerfreqlimit, training, LANGUAGES):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m.
    @n: consider k-grams up to and including n for words, part of speech tags and word sizes.
    @m: consider k-grams up to and including m for characters. We assume m >= n.
    @lowerfreqlimit: number below which we consider words misspellings, odd words out or unique.
    @training: training data to retrieve the language distribution from.
    @LANGUAGES: languages based on which we classify.
    """
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
        
    # Iterate first over k. This is required as we need to know the full k-1 distributions to see if we should add a 
    # k-gram to the dictionary.
    kmax = 0
    for k in range(1, n+1):
        for language, text, struc in training.itertuples(index=False):
            
            for sentence in sent_tokenize(text):
                
                # Get the necessary input structures for the ngrams-function. It is sentence for "chars".
                token=word_tokenize(sentence) 
                wordlens = [len(word) for word in token]
                
                # Note, for any gram, there exist 2 subgrams of all but the first and all of the last element. Let us
                # only update the dictionary if the total count of these subgrams exceeds the lower limit. This prevents
                # an unnecessary combinatorial explosion.
                for gram in ngrams(sentence,k):
                    if k == 1: 
                        language_dist[language]["chars"][k][gram] += 1
                    elif language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                        
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["words"][k][gram] += 1
                    elif language_dist[language]["words"][k-1].get(gram[1:],0)+language_dist[language]["words"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["words"][k][gram] += 1
                        
                for gram in ngrams(wordlens,k):
                    if k == 1:
                        language_dist[language]["w_sizes"][k][gram] += 1
                    elif language_dist[language]["w_sizes"][k-1].get(gram[1:],0)+language_dist[language]["w_sizes"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["w_sizes"][k][gram] += 1
                        
            # Now for the tokenized structures (tags)
            for sentence in sent_tokenize(struc):
                token=word_tokenize(sentence)
                for gram in ngrams(token,k):
                    if k == 1:
                        language_dist[language]["tags"][k][gram] += 1
                    elif language_dist[language]["tags"][k-1].get(gram[1:],0)+language_dist[language]["tags"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["tags"][k][gram] += 1
                        
    # Also construct it for higher order k-grams for characters.
    for k in range(n+1, m+1):
        for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
            for sentence in tokenized_sents:
                for gram in ngrams(sentence,k):
                    if language_dist[language]["chars"][k-1].get(gram[1:],0)+language_dist[language]["chars"][k-1].get(gram[:-1],0) > 2*lowerfreqlimit:
                        language_dist[language]["chars"][k][gram] += 1
                           
    return language_dist

With the distribution of grams over different languages, we can compute similarity scores. Note, for each language one can compute similarity score against each k-gram model. Thus, this results here in nlang*(3n+m) features. The paper mentiones that the scores are calculated as the sum of the 2-logs of counts in the model. Inspection of the scripts, however, show that this is not the case: this 2-log is divided by the length of the sequence. We follow the script as we obtain values of infinity if we do not apply this division.

In [7]:
def compute_similarity_score(dis_ngramdic, gramlist):
    """ This function computes the similarity scores for a comment based on the corresponding k-grams.
    Note that the comment is already tokenized into sentences.
    @dis_ngramdic: ngram dictionary as constructed by language_distribution for particular k.
    @gramlist: list of kgrams
    """
    score=0
    if gramlist:
        for gram in gramlist:
            score += math.log2(dis_ngramdic.get(gram,1))
    return score

colnames = None

def compute_all_features(lang_dis, original_text, clean_text, structure_text):
    """ This function compares the sentences and structure to each of the languages distributions. It returns
    similarity scores to each language model. Also included are other features, such as the number of sentences
    per text, etc.
    @lang_dis: Language distribution of n-grams.
    @clean_text: Text with proper nouns and demonyms substituted
    @structure_text: PoS structure retrieved by SENNA.
    """
    simscoredict=dict()
    
    # For each gramtype, first construct the list of which we can make n-grams.
    words_ps = list(word_tokenize(clean_text))
    struc_ps = list(word_tokenize(structure_text))
    wordlens_ps = [len(word) for word in word_tokenize(original_text) if word.isalpha()]
    
    # Now we should construct k-gram lists for each k and return the score. Let us store all grams in 
    for gramtype in lang_dis[list(lang_dis.keys())[0]].keys():
        
        # Select appropriate data type.
        if gramtype == "tags":
            ps = struc_ps
        elif gramtype =="words":
            ps = words_ps
        elif gramtype == "w_sizes":
            ps = wordlens_ps
        elif gramtype == "chars":
            ps = clean_text
        
        # We need to normalize with the sequence length.
        seq_len = len(ps)

        # For each k, feed the ngrams function into the compute_similarity_score function. 
        for k in range(1,len(lang_dis[list(lang_dis.keys())[0]][gramtype])+1):
            for lang in lang_dis.keys():
                simscoredict[lang+'_'+gramtype+'_'+str(k)]= compute_similarity_score(lang_dis[lang][gramtype][k], ngrams(ps,k))/seq_len
    
    # Set the other features they use in the paper.
    simscoredict["num_sentences"] = len(list(sent_tokenize(clean_text)))
    simscoredict["num_words"] = len(wordlens_ps)
    simscoredict["avg_wordlength"] = sum(wordlens_ps)/len(wordlens_ps)
        
    global colnames
    if colnames == None:
        colnames = list(simscoredict.keys())
            
    return simscoredict.values()

Al Rfou reports having used some different features too. These include:
- "Relative frequency of each of the stop words mentioned in the comment"
- "Average number of sentences"
- "Size of the comments"

What these mean is not unequivocally clear. How should relative frequency be measured? Each comment has a deterministic number of sentences, so the average of sentences over what? The size of the comments, measured in what way? Since such features are ambiguous in their definition and are not reported to be important for the native vs non-native experiment, we exclude them.

In addition to the problems above, the paper does not detail how models were constructed. It mentions that approximately 322K features were used in the experiment and the baseline is 1/(number of classes). The first number makes sense as we have approximately 323K features in total. However, about 110K of these are native US English, whereas the other 200K are non-native speakers. It is not mentioned whether the non-native comments should be downsampled such that we have a balanced problem. We will assume this is the case.

### Repetition of the non-native experiment using SVM classification

Note, the development/validation set is 7 times as small as the training set. Let us impose the restriction that we only consider n-grams which are at least present 10 times in the entire training set. Computing this distribution takes ~1 hour. Dump it in a pickle object so it can be reloaded later.

In [8]:
# Downsample the non-native languages so that the classes are balanced. Fix the random_state such that we get the appropriate lang_dis.
print("Setting parameters")
n = 4           # n-grams for words, PoS tags and word sizes.
m = 4           # m-grams for characters
lowerlim = 10   # lower limit on the number of wordcounts to consider words for bigrams, trigrams, etc. Needed to prevent memory issues.

Setting parameters


In [9]:
# Derive the language distribution from the training data.
print("Deriving the language distribution from training data...")
start = time.time()
lang_dis = pruned_language_distribution(n,m,lowerlim,training[['native','text_clean','text_structure']], training.native.unique())
end = time.time()
print("Language distribution constructed in {} seconds".format(end-start))

# Save it and clear it to save memory.
save_object(lang_dis,"trained_lang_dis")
lang_dis.clear()

Deriving the language distribution from training data...




Language distribution constructed in 3678.919601917267 seconds


In [10]:
# Use the language distribution to obtain features for training data and validation data.
print("Loading the language distribution")
if 'lang_dis' in globals() or 'lang_dis' in locals():
    lang_dis.clear()
lang_dis = load_object("trained_lang_dis")

print("Computing the features for training and validation data.")
start = time.time()
features = training.apply(lambda row: compute_all_features(lang_dis,row['text_original'],row['text_clean'], row['text_structure']), axis=1)
features = pd.DataFrame(features.to_frame()[0].values.tolist(), index=features.to_frame()[0].index, columns=colnames)
training = pd.merge(training, features, left_index=True, right_index=True)
features = validation.apply(lambda row: compute_all_features(lang_dis,row['text_original'],row['text_clean'], row['text_structure']), axis=1)
features = pd.DataFrame(features.to_frame()[0].values.tolist(), index=features.to_frame()[0].index, columns=colnames)
validation = pd.merge(validation, features, left_index=True, right_index=True)
end = time.time()
print("Finished computing features in {} seconds".format(end-start))

# Clean up stuff we no longer need.
lang_dis.clear()
features = features.iloc[0:0]
training = training.drop(['text_original', 'text_clean', 'text_structure'], axis=1)
validation = validation.drop(['text_original','text_clean','text_structure'], axis = 1)

# Write the training and validation including their features to file.
training.to_csv("python_data/training_features_4_4_"+str(lowerlim))
validation.to_csv("python_data/validation_with_features_4_4_"+str(lowerlim))

Loading the language distribution
Computing the features for training and validation data.
Finished computing features in 1874.8679361343384 seconds


In [11]:
# Load the data. 
training=pd.read_csv("python_data/training_features_4_4_10",index_col=0,header=0)
validation=pd.read_csv("python_data/validation_with_features_4_4_10",index_col=0,header=0)
colnames = training.columns[3:]    #First column native language, second English level, third if native.
training.native = training.native.astype('category')
validation.native = validation.native.astype('category')

In [12]:
# Train the SVC classifier
linear = svm.SVC(C=1.0, kernel="linear", penalty="l1", dual=False)
linear.fit(training[colnames], training.native)
y_predicted = linear.predict(validation[colnames])
accur = accuracy_score(validation.native, y_predicted)
print(accur)

0.726985929712


We see that we can achieve a 72.7% accuracy on all the data, which is almost in the league of the 74% mentioned in the paper. Note that accuracy of course depends on the fold on the training and testing data, and exact replication is without reach. Let us investigate if we can bump up this accuracy by parameter tuning through grid search.

#### Tuning C.
Seems to barely have any effect. Scores hover around .727. In retrospect, this is not entirely unexpected as increasing the parameter C is a means to increase robustness by lowering the bias of the classifier. Since we have lots of training data, we can expect that it will have little effect.

In [13]:
grid = [2**i for i in range(-5, 16, 2)]
scores = []
colnames = list(training)[3:]
for C_ in grid:
    linear = svm.LinearSVC(C=C_, penalty="l1", dual=False)
    linear.fit(training[colnames], training.native)
    y_predicted = linear.predict(validation[colnames])
    validation['native'] = np.where(validation['native_lang']=='EN', "native", "non-native")
    scores.append(accuracy_score(validation.native, y_predicted))
scores

[0.72689128651649948,
 0.72711212063852604,
 0.72720676383368033,
 0.72764843207773366,
 0.72730140702883461,
 0.72761688434601557,
 0.72733295476055271,
 0.72708057290680805,
 0.72752224115086128,
 0.7267650955896271,
 0.72692283424821758]

#### Bagging for non-linear kernels & grid-search to optimize parameters.

Support Vector Machines are a generalization of Support Vector Classifiers to non-linear decision boundaries. Such non-linear decision boundaries are constructed by a "kernel trick". In Hastie, it is reported that when classes are not linearly separable, non-linear kernels may deliver drastic improvements in prediction performance over linear kernels (but at the risk of overfitting as implied by the higher flexibility).

Our data are currently classified into two classes, being (i) native English speakers and (ii) non-native speakers. The second group obviously consists of many subgroups, and the separating hyperplanes between each of these subgroups with the native English speakers can be expected to be different. That the classes are linearly separable is therefore inherently not obvious and finding out whether non-linear kernels do better is an interesting problem.

However, problem kicks in as the implementation of `svc` in `sklearn` has in the best-case scenario complexity $O(n\_features*n\_samples^2)$, leading to very large processing times for training SVC. Bagging is a neat way around this problem as a reduction in the number of samples by a factor n_estimators yields a reduction in complexity of n_estimators^2, significantly speeding up calculations (inspired/suggested by https://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python). Let us try this by bagging with 20 estimators. 

In [14]:
# Load data and timing package.
import time

# Do scaling. This is suggested by Hsu et al. Scaling is based on training data but executed on validation data.
scaler = StandardScaler()
training[colnames] = scaler.fit_transform(training[colnames])
validation[colnames] = scaler.transform(validation[colnames])

# Train SVCs by bagging. 20 estimators will yield approximately 10.000 samples per SVM. This number should be feasible
# according to the documentation. We can speed up things by multithreading (n_jobs).
n_estimators = 20
start = time.time()
clf = BaggingClassifier(svm.SVC(kernel='rbf', C=2**5, gamma=2**-5, cache_size=2000), random_state = 1281, max_samples=1.0 / n_estimators, n_jobs = -1, n_estimators=n_estimators)
print("Fitting the classifier")
clf.fit(training[colnames],training.native)
print("Prediction out-of-sample")
y_predicted = clf.predict(validation[colnames])
end = time.time()
print("Bagging SVC trained in {} seconds reaching accuracy of {}".format(end - start, 100*accuracy_score(validation.native, y_predicted)))

Fitting the classifier
Prediction out-of-sample
Bagging SVC trained in 74.2062509059906 seconds reaching accuracy of 72.85317685658401


With training of the predictor at approximately 90 seconds, grid-search becomes feasible. For this grid search, we follow recommendations by Hsu et al. ("A practical guide to Support Vector Classification"). We will refrain from selecting $(C,\gamma)$ by cross-validation as we have quite a large dataset already. 

In [15]:
# Scale the data if this hasn't been done yet.
scaler = StandardScaler()
training[colnames] = scaler.fit_transform(training[colnames])
validation[colnames] = scaler.transform(validation[colnames])

# Define the grid and pre-allocate memory for scores.
Cs=[2**i for i in range(-5,16,2)]
gammas = [2**i for i in range(-15, 3, 2)]
param_grid = {'C': Cs, 'gamma': gammas, 'kernel':['rbf'], 'cache_size':[500.0]}
scores = pd.DataFrame(columns=['C','gamma','kernel','score'])

# Iterate over the grid and save scores for each (C,gamma) pair.
itern = 1
n_estimators = 20
for g in ParameterGrid(param_grid):
    svc = svm.SVC()
    svc.set_params(**g)
    clf = BaggingClassifier(svc, random_state = 1281, max_samples=1.0 / n_estimators, n_jobs = 4, n_estimators=n_estimators)
    clf.fit(training[colnames],training.native)
    y_predicted = clf.predict(validation[colnames])
    g['score'] = [accuracy_score(validation.native, y_predicted)]
    holder  = pd.DataFrame(g)
    scores = pd.concat([scores, holder])
    if (itern==1 or itern%10 == 0 or itern == 99):
        print("Iteration {}/{} completed".format(itern, len(Cs)*len(gammas)))
    itern += 1
        
scores.to_csv("results_gridsearch_SVM_rbf")

Iteration 1/99 completed
Iteration 10/99 completed
Iteration 20/99 completed
Iteration 30/99 completed
Iteration 40/99 completed
Iteration 50/99 completed
Iteration 60/99 completed
Iteration 70/99 completed
Iteration 80/99 completed
Iteration 90/99 completed
Iteration 99/99 completed


Inspection of the results of the grid search show we reach approximately a 74.4% accuracy if we downsample the validation set such that the dataset is balanced. We balanced the dataset as the training dataset too is balanced. Although not cross-validated, the highest accuracy was found for a gamma of 2^-11 and a C of 8. This results in approximately the same classification performance as in the paper.

In [16]:
# Train SVCs by bagging. 20 estimators will yield approximately 10.000 samples per SVM. This number should be feasible
# according to the documentation. We can speed up things by multithreading (n_jobs).
n_estimators = 20
clf = BaggingClassifier(svm.SVC(kernel='rbf', C=2**3, gamma=2**-11, cache_size=2000), random_state = 1281, max_samples=1.0 / n_estimators, n_jobs = 4, n_estimators=n_estimators)
print("Fitting the classifier")
clf.fit(training[colnames],training.native)
print("Prediction out-of-sample")
y_predicted = clf.predict(validation[colnames])
print("Bagging SVC score:",accuracy_score(validation.native, y_predicted))

Fitting the classifier
Prediction out-of-sample
Bagging SVC score: 0.746166950596


In [17]:
y_predicted = pd.DataFrame(y_predicted, validation.index)
y_predicted.columns = ["prediction"]
y_predicted = y_predicted.join(validation[["native_lang","level_english","native","num_words"]])
y_predicted.to_csv("output_SVM_RBF_classifier")

Doing bagging with LinearSVC shows that the increase in performance is likely due to the non-linear kernel, not bagging itself. Let us try it to by decreasing the number of samples used for training. This might be a good idea to reduce the bias in the unflexible LinearSVC

In [22]:
n_estimators = 20
max_samples=[2**i for i in range(-12,-4,2)]
scores = []
for max_sample in max_samples:
    clf = BaggingClassifier(svm.LinearSVC(C=2**3), random_state = 1281, max_samples=max_sample, n_jobs = 5, n_estimators=n_estimators)
    print("Fitting the classifier")
    clf.fit(training[colnames],training.native)
    print("Prediction out-of-sample")
    y_predicted = clf.predict(validation[colnames])
    scores.append(accuracy_score(validation.native, y_predicted))
    print("Bagging LinearSVC score for max_sample = {}:     {}".format(max_sample,accuracy_score(validation.native, y_predicted)))

Fitting the classifier
Prediction out-of-sample
Bagging LinearSVC score for max_sample = 0.000244140625:     0.7208025742949082
Fitting the classifier
Prediction out-of-sample
Bagging LinearSVC score for max_sample = 0.0009765625:     0.7208025742949082
Fitting the classifier
Prediction out-of-sample
Bagging LinearSVC score for max_sample = 0.00390625:     0.7208025742949082
Fitting the classifier
Prediction out-of-sample
Bagging LinearSVC score for max_sample = 0.015625:     0.7208025742949082


## Random Forests

We also want to try Random Forests for classification. An initial implementation reaches an accuracy of 72.4%, which is quite high already. Let us see if we can improve it by cross-validation. The cross-validation implementation was taken from https://github.com/WillKoehrsen/Machine-Learning-Projects/blob/master/random_forest_explained/Improving%20Random%20Forest%20Part%202.ipynb. Note that training a random forest is substantially faster than the SVM classifier, so we can allow ourselves to try more parameter settings.

In [23]:
rf = RandomForestClassifier(n_estimators=300, min_samples_leaf = 20, n_jobs = 4)
rf.fit(training[colnames],training.native)
y_predicted = rf.predict(validation[colnames])
print(accuracy_score(y_predicted,validation.native))

0.730456180201


In [24]:
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 700, num = 11)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 10)]
max_depth.append(None)
min_samples_split = [20, 50, 100]
min_samples_leaf = [10, 20, 50, 100]

# Create the grid.
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

# Allocate dataframe for the resulting scores.
scores_rf = pd.DataFrame(columns=['n_estimators','max_features','max_depth','min_samples_split','min_samples_leaf','bootstrap'])

# Start training and calculate accuracy for each set.
rf = RandomForestClassifier(random_state = 42)
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 50, cv = 3, verbose=1, random_state=42, n_jobs=6)

# Fit the random search model
rf_random.fit(training[colnames], training.native);

# Return the best parameter set.
rf_random.best_params_

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed: 64.7min
[Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 249.9min finished


{'max_depth': 76,
 'max_features': 'sqrt',
 'min_samples_leaf': 10,
 'min_samples_split': 20,
 'n_estimators': 250}

In [26]:
rf = RandomForestClassifier(n_estimators=250, min_samples_leaf = 10, min_samples_split = 20, max_depth = 76, max_features = 'sqrt', n_jobs = 4)
rf.fit(training[colnames],training.native)
y_predicted = rf.predict(validation[colnames])
print(accuracy_score(y_predicted,validation.native))

0.73118177803


## Open problems
- 

## Appendix

### Benchmark language_distribution against pruned_language_distribution

Here we benchmark the function `pruned_language_distribution` against `language_distribution` for a small sample of the dataset. 

In [None]:
def language_distribution(n, m, training, LANGUAGES):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m.
    @n: consider k-grams up to and including n for words, part of speech tags and word sizes.
    @m: consider k-grams up to and including m for characters.
    @training: training data to retrieve the language distribution from.
    @LANGUAGES: languages based on which we classify.
    """
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
    
    for language, tokenized_sents, tokenized_struc in training.itertuples(index=False):
    
        # Construct n grams counts from the tokenized sentences.
        for sentence in tokenized_sents:
            token = word_tokenize(sentence)     
            wordlens = [len(word) for word in token if word.isalpha()]   
            
            for k in range(1,n+1):  
                language_dist[language]["w_sizes"][k].update(ngrams(wordlens,k))
                language_dist[language]["words"][k].update(ngrams(token,k))
                
        # Construct n gram counts from sentences tokenized based on structure.
        for sentence in tokenized_struc:
            token = word_tokenize(sentence)
            for k in range(1,n+1):
                language_dist[language]["tags"][k].update(ngrams(token,k))

        # Construct character m-grams for tokenized sentences.
        for sentence in tokenized_sents:
            for k in range(1,m+1):
                language_dist[language]["chars"][k].update(ngrams(sentence,k))
    
    return language_dist

In [None]:
import timeit

def sum_keys(d):
    return (0 if not isinstance(d, dict) else len(d) + sum(sum_keys(v) for v in d.values()))

rand_sample = df.sample(20000)
rand_sample['tokenized_sents'] = rand_sample.apply(lambda row: sent_tokenize(row['text_clean']), axis=1)
rand_sample['tokenized_struc'] = rand_sample.apply(lambda row: sent_tokenize(row['text_structure']), axis=1)

start = timeit.default_timer()
dis1 = language_distribution(4, 9, rand_sample[['native','tokenized_sents','tokenized_struc']], rand_sample.native.unique())
end = timeit.default_timer()

dis2 = pruned_language_distribution(4, 9, 1, rand_sample[['native','text_clean','text_structure']], rand_sample.native.unique())
end2 = timeit.default_timer()

print("Unpruned, time: {} sec, size: {} items, \n Pruned, time: {} sec, size: {} items".format(end-start, sum_keys(dis1), end2-end, sum_keys(dis2)))

It is evident the pruned models are already somewhat better in terms of memory and take only twice as long for construction. If we increase `lowerlimitfreq` this obviously becomes much better. 