## Loading data and brief description data.

We start by importing packages we need and loading the data. We will use pandas to store the data. This structure modeled on Worksheet 3 and the scripts from Wikitalk.

In [2]:
# Import necessary packages
import pandas as pd
import datetime
import numpy as np
import re
import sys
from nltk import FreqDist, ngrams, sent_tokenize, word_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# Load the training data.
df = pd.read_csv("python_data/train",sep="\t",error_bad_lines=False,encoding="utf-8")

# Load some datasets we will need later on. E.g. English stopwords.
eng_stopwords = stopwords.words('english') 

Let us first take a brief look at the data. We see that there are approximately 323K comments in the training data. For each comment, we have the native language of the writer (native_lang), the self-reported level of the english speaker (native, 1, 2, 3, 4, 5 or unknown), the original text of the comment (text_original), the text with demonyms and proper nouns replaced by PoS tags (text_clean), and finally the structure of the text reported by SENNA (text_structure). Of the 323K comments, about 110K comments are from native speakers. Thus, the training data seems sufficiently balanced.

In [3]:
df.describe()

Unnamed: 0,native_lang,text_original,level_english,text_clean,text_structure
count,323185,323185,323185,323185,323185
unique,20,317700,7,317603,317329
top,EN,is being used on this article. I notice the im...,N,is being used on this article. I notice the im...,VBZ VBG VBN IN DT NN . PRP VBP DT NN NN VBZ IN...
freq,110049,733,110049,733,733


In [4]:
df.native_lang.describe()

count     323185
unique        20
top           EN
freq      110049
Name: native_lang, dtype: object

In [5]:
df['native'] = np.where(df['native_lang']=='EN', "native", "non-native")
df.describe()

Unnamed: 0,native_lang,text_original,level_english,text_clean,text_structure,native
count,323185,323185,323185,323185,323185,323185
unique,20,317700,7,317603,317329,2
top,EN,is being used on this article. I notice the im...,N,is being used on this article. I notice the im...,VBZ VBG VBN IN DT NN . PRP VBP DT NN NN VBZ IN...,non-native
freq,110049,733,110049,733,733,213136


## Tokenizing text

The paper on which we've based our project on uses similarity scores to word and character n-gram models as the features for subsequent classification. Let us embark too on construction of such models. However, other literature has shown that for character n-grams, increasing n seems to enhance classifcation. Thus, we will construct models for up to 10 n-gram models.

We start by tokenizing sentences.

In [22]:
df['tokenized_sents'] = df.apply(lambda row: sent_tokenize(row['text_clean']), axis=1)
df['tokenized_struc'] = df.apply(lambda row: sent_tokenize(row['text_structure']), axis=1)

Now we derive distributions of language use of n-grams among native and non-native speakers.

In [89]:
def language_distribution(n, m, training):
    """Calculate the word n grams distribution up to n, the character n gram distribution up to m."""
    
    language_dist = {}

    for language in LANGUAGES:
        language_dist[language] = {"words": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "tags": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)])),
                               "chars": dict(zip(range(1, m+1), [FreqDist() for i in range(1, m+1)])),
                               "w_sizes": dict(zip(range(1, n+1), [FreqDist() for i in range(1, n+1)]))}
    for i, row in training.iterrows():
    
        # Extract the native language of the speaker, the cleaned text and the text structure from row.
        language=row[0]
        tokenized_sent = row[1]
        tokenized_struc = row[2]

        # Construct n grams counts from the tokenized sentences.
        for sentence in tokenized_sent:
            token = word_tokenize(sentence)
            wordlens = [len(word) for word in token]            
            for k in range(1,n+1):
                language_dist[language]["words"][k].update(ngrams(token,k))
                language_dist[language]["w_sizes"][k].update(ngrams(wordlens,k))
                
        # Construct n gram counts from sentences tokenized based on structure.
        for sentence in tokenized_struc:
            token = word_tokenize(sentence)
            for k in range(1,n+1):
                language_dist[language]["tags"][k].update(ngrams(token,k))

        # Construct character m-grams for tokenized sentences.
        for sentence in tokenized_sent:
            for k in range(1,m+1):
                language_dist[language]["chars"][k].update(ngrams(sentence,k))
                
    return language_dist

z = language_distribution(4,10,df[['native','tokenized_sents','tokenized_struc']])

KeyboardInterrupt: 

In [86]:
z["non-native"]

{'chars': {1: FreqDist({(' ',): 886,
            ('!',): 2,
            ('"',): 7,
            ('#',): 6,
            ('%',): 1,
            ("'",): 13,
            ('(',): 10,
            (')',): 9,
            (',',): 38,
            ('-',): 4,
            ('.',): 54,
            ('/',): 4,
            ('0',): 9,
            ('1',): 6,
            ('2',): 4,
            ('3',): 3,
            ('5',): 2,
            ('6',): 3,
            ('7',): 1,
            ('8',): 1,
            ('9',): 4,
            (':',): 12,
            ('?',): 5,
            ('A',): 5,
            ('B',): 4,
            ('D',): 3,
            ('E',): 2,
            ('H',): 3,
            ('I',): 23,
            ('J',): 23,
            ('L',): 4,
            ('M',): 2,
            ('N',): 166,
            ('O',): 4,
            ('P',): 84,
            ('R',): 4,
            ('S',): 9,
            ('T',): 12,
            ('U',): 4,
            ('V',): 1,
            ('W',): 2,
            ('Y',): 1,
         

One of the problems we run into is that #URL#, which we've used as a replacement for URLs is interpreted by the NLTK tokenizer as consisting of the words # URL #. This could mess things up, so we should change this.