A number of things were tested for this model

* A variety of different parameters to countVectorizer and tfidfVectorizor including
  * a number of different preprocessing steps
    * regular expressions to remove some oddities
    * an attempt to correct spellings and conjoined words
  * ngrams from (1,10) - code not here as it did not have any success
* Use of GridSearchCV to find the optimal alpha for each classifier
  
Interesting observations

* Depending on which method we use to score the classifier, AUC or F1, the choice of parameters differ. All the top scores for AUC are MultinomianNB using TfidfVectorizer, and all the top scores for F1 are BernoulliNB using CountVectorizer.  See https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ for the details.
* Spell checking is very slow.  One model replacing the tokenizer with one that did a spell check on the words took over 24 hours to run.  Instead we decided to run the spell checker on the data and save it so that the effort was not repeated.  The result took 4 processing cores (4 threads doing a portion of the work each) 24 hours to complete.


Results:  
Best scores for AUC:

<table>
<tr><th> label </th><th> model </th><th> alpha </th><th> type </th><th> preprocessor </th><th> tokenizer </th><th> max_features </th><th> stop_words </th><th> lowercase </th><th> strip_accents </th><th> aucdev</th></tr>
<tr><td> toxic </td><td> bern </td><td> 0.5 </td><td> count </td><td> 0 </td><td> 0 </td><td>  </td><td> english </td><td> TRUE </td><td> unicode </td><td> 0.852321727</td></tr>
<tr><td> severe_toxic </td><td> bern </td><td> 2.0 </td><td> count </td><td> 0 </td><td> 0 </td><td> 4000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.942104059</td></tr>
<tr><td> obscene </td><td> bern </td><td> 10.0 </td><td> count </td><td> 0 </td><td> 0 </td><td> 4000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.893736805</td></tr>
<tr><td> threat </td><td> bern </td><td> 0.5 </td><td> count </td><td> 0 </td><td> 0 </td><td> 5000 </td><td> english </td><td> TRUE </td><td>  </td><td> 0.838735331</td></tr>
<tr><td> insult </td><td> bern </td><td> 10.0 </td><td> count </td><td> 0 </td><td> 0 </td><td> 4000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.878101445</td></tr>
<tr><td> identity_hate </td><td> bern </td><td> 2.0 </td><td> count </td><td> 0 </td><td> 0 </td><td> 6000 </td><td>  </td><td> TRUE </td><td> ascii </td><td> 0.82812169</td></tr>
<tr><td> </td></tr>
<tr><td> Average </td><td> &nbsp; </td><td> &nbsp;</td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td>  </td><td> &nbsp; </td><td> &nbsp; </td><td> 0.872187</td></tr>
<tr><td> </td></tr>
</table>

Best scores for F1:
<table>
<tr><th> label </th><th> model </th><th> alpha </th><th> type </th><th> preprocessor </th><th> tokenizer </th><th> max_features </th><th> stop_words </th><th> lowercase </th><th> strip_accents </th><th> f1dev </th></tr>
<tr><td> toxic </td><td> multi </td><td> 0.1 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 6000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.950819672 </td></tr>
<tr><td> severe_toxic </td><td> multi </td><td> 0.5 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 10000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.990763086 </td></tr>
<tr><td> obscene </td><td> multi </td><td> 0.5 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 4000 </td><td> english </td><td> TRUE </td><td> ascii </td><td> 0.9735999 </td></tr>
<tr><td> threat </td><td> multi </td><td> 0.1 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 4000 </td><td> english </td><td> TRUE </td><td>  </td><td> 0.996879421 </td></tr>
<tr><td> insult </td><td> multi </td><td> 0.1 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 5000 </td><td>  </td><td> TRUE </td><td> ascii </td><td> 0.969231089 </td></tr>
<tr><td> identity_hate </td><td> multi </td><td> 0.5 </td><td> tfidf </td><td> 0 </td><td> 0 </td><td> 6000 </td><td> english </td><td> TRUE </td><td>
 ascii </td><td> 0.991303986 </td></tr>
<tr><td> Average </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> &nbsp; </td><td> 0.978766 </td></tr>  
</table>

In [1]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [2]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print('total training observations:', train_df.shape[0])
print('training data shape:', train_data.shape)
print('training label shape:', train_labels.shape)
print('dev label shape:', dev_labels.shape)
print('labels names:', target_names)

total training observations: 159571
training data shape: (111471,)
training label shape: (111471, 6)
dev label shape: (48100, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [3]:
# Imports etc. used in this analysis
import string
import datetime
import re
from collections import Counter

from enchant import DictWithPWL
from enchant.checker import SpellChecker
import difflib

from sklearn import metrics

punctuation = "[\!\?\"\#\$\%\&\(\)\*\+\,\.\/\:\;\<\=\>\?\@\[\]\^\_\`\{\|\}\~\']"

ModuleNotFoundError: No module named 'enchant'

In [4]:
# from http://norvig.com/spell-correct.html
# This is the Norvig spell checker and requires the storage of a "big.txt"
# file with a corpus of words that it uses for predictions

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('../data/big.txt').read()))

def norvig_P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def norvig_correction(word): 
    "Most probable spelling correction for word."
    return max(norvig_candidates(word), key=norvig_P)

def norvig_candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [5]:
# Functions to support finding and correcting spellings
# using pyenchant for spell checking
from pyenchant import DictWithPWL
from enchant.checker import SpellChecker
import difflib
# import splitter # not useful, does a worse job than my implementation

# mywords.txt currently contains:
# - list of firstnames and surnames gathered from internet searches
# http://www.birkenhoerdt.net/surnames-all.php?tree=1
# www.tysto.com/uk-us-spelling-list.html
my_dict=DictWithPWL('en_US', "../data/mywords.txt")
my_checker = SpellChecker(my_dict)

punctuation = "[\!\?\"\#\$\%\&\(\)\*\+\,\.\/\:\;\<\=\>\?\@\[\]\^\_\`\{\|\}\~\']"

# list of swear words correctly spelt courtesy of https://www.noswearing.com/

def my_preprocessor(textblock):
    # u -> you
    # c -> see
    # k -> okay
    return_words = textblock

#     return_words = re.sub(r"[^A-Za-z0-9,!?*.;\u2019´'\/]", " ", return_words)
    return_words = re.sub(r"[^A-Za-z0-9]", " ", return_words)
    return_words = re.sub(r","," ",return_words)
    return_words = re.sub(r"\.\.+"," ",return_words)
    return_words = re.sub(r"\."," ",return_words)
    return_words = re.sub(r"\("," ", return_words)
    return_words = re.sub(r"\)"," ", return_words)
    return_works = re.sub(r"\;", " ", return_words)
    return_words = re.sub(r":"," ", return_words)
    return_words = re.sub(r"´", "'", return_words)
    return_words = re.sub(r"`", "'", return_words)
    return_words = re.sub(r"''+", "'", return_words)
    return_words = re.sub(r" '", " ", return_words)
    return_words = re.sub(r"' ", " ", return_words)
    return_words = re.sub(r"\"", " ", return_words)
    return_words = re.sub(r"\/", " ", return_words)
    return_words = re.sub(r"\!\!+", "!!", return_words)
    return_words = re.sub(r"\?\?+", "?!", return_words)
    return_words = re.sub(r"\!", " !", return_words)
    return_words = re.sub(r"\?", " ?", return_words)
    return_words = re.sub(r"\b\d+\b", "999", return_words)
    # slang and abbreviations, need to be aware of capitolization and spaces
    return_words = re.sub(r"[Ww]on't", "will not", return_words)
    return_words = re.sub(r"n't", " not", return_words)
    return_words = re.sub(r"'s\b", " is", return_words)
    return_words = re.sub(r"\b[Aa]bt\b", "about", return_words)
    return return_words

def trysplit(word, verbose=False):
    split_candidates = []
    max_proba = 0.0
    for i in range(1,len(word)):
        # I will only allow single letters of 'a' and 'i', all others ignored.  Pyenchant allows for
        # any single letter to be a legitimate word, and so too does norvig.  The dictionary defines
        # them as nouns that represent the letter, however even though several can be used in slang
        # (e.g. k->okay, c->see, u->you) using them in conjoined words would make the splitting far
        # too difficult and also human understanding much more difficult #howucthisk, u c?
        if (len(word[:i]) != 1 or (word[:i].lower() == 'a' or word[:i].lower() == 'i')) and (
            len(word[i:]) != 1 or (word[i:].lower() == 'a' or word[i:].lower() == 'i')):
            if my_checker.check(word[:i]) and my_checker.check(word[i:]):
                norvig_score = norvig_P(word[:i]) + norvig_P(word[i:])
                if norvig_score > max_proba:
                    max_proba = norvig_score
                    split_candidates = [word[:i],word[i:]]
    for i in range(1,len(word)):
        for j in range(i+1,len(word)):        
            if (len(word[:i]) != 1 or (word[:i].lower() == 'a' or word[:i].lower() == 'i')) and (
                len(word[i:j]) != 1 or (word[i:j].lower() == 'a' or word[i:j].lower() == 'i')) and (
                len(word[i:]) != 1 or (word[i:].lower() == 'a' or word[i:].lower() == 'i')):
                
                if my_checker.check(word[:i]) and my_checker.check(word[i:j]) and my_checker.check(word[j:]):
                    norvig_score = norvig_P(word[:i]) + norvig_P(word[i:j]) + norvig_P(word[j:])
                    if norvig_score > max_proba:
                        max_proba = norvig_score
                        split_candidates = [word[:i],word[i:j],word[j:]]
    for i in range(1,len(word)):
        for j in range(i+1,len(word)):
            for k in range(j+1,len(word)):
                if (len(word[:i]) != 1 or (word[:i].lower() == 'a' or word[:i].lower() == 'i')) and (
                    len(word[i:j]) != 1 or (word[i:j].lower() == 'a' or word[i:j].lower() == 'i')) and (
                    len(word[j:k]) != 1 or (word[j:k].lower() == 'a' or word[j:k].lower() == 'i')) and (
                    len(word[k:]) != 1 or (word[k:].lower() == 'a' or word[k:].lower() == 'i')):
                    verbose and print("making it here with i=%s j=%s k=%s %s  max_proba=%d" %(word[:i],word[i:j],word[j:k],word[k:], max_proba))
                    verbose and print("lengths are %d %d %d %d" % (len(word[:i]), len(word[i:j]),len(word[j:k]),len(word[k:])))
                    if my_checker.check(word[:i]) and my_checker.check(word[i:j]) and my_checker.check(word[j:k]) and my_checker.check(word[k:]):
                        verbose and print('found words ' + word[i:] + ' ' + word[k:])
                        norvig_score = norvig_P(word[:i]) + norvig_P(word[i:j]) + norvig_P(word[j:k]) + norvig_P(word[k:])
                        if norvig_score > max_proba:
                            verbose and print("found higher probability %d with %s %s %s %s" % (norvig_score, word[:i], word[i:j], word[j:k], word[k:]))
                            max_proba = norvig_score
                            split_candidates = [word[:i],word[i:j],word[j:k],word[k:]]
    return split_candidates

def get_best_candidates(word):
    best_words = []
    best_ratio = 0
    a = set(my_checker.suggest(word))
    for b in a:
        if not '-' in b:
            tmp = difflib.SequenceMatcher(None, word, b).ratio()
            if tmp > best_ratio:
                best_words=[b]
                best_ratio=tmp
            elif tmp == best_ratio:
                best_words.append(b)
    return best_words
    
def fix_spellings(textblock, verbose=False):
    textblock = re.sub("[^A-Za-z0-9,!?*.;\\u2019\´\'\"\\\/] ", "", textblock)
    textblock = re.sub(r"\(\)", " ", textblock)
    textblock = re.sub(r'([a-zA-Z_ ])\1+', r'\1\1',textblock)
    words = textblock.split()
    return_list = []

    for word in words:
        if my_checker.check(word) or my_checker.check(word.lower()) or word in punctuation or\
            any(i.isdigit() or i == '_' for i in word) or (word[-1].lower() == 's' and my_checker.check(word[:-1].lower())):
            return_list.append(word)
        elif len(word) < 100:            
            candidates = get_best_candidates(word)
            if len(candidates) == 1:
                return_list.append(candidates.pop())
            elif len(candidates) > 1:
                # try another spell checker
                nv_candidates = norvig_candidates(word)
                tmp_set = set(nv_candidates).intersection(set(candidates))
                if len(tmp_set) == 1:
                    # only 1 overlap, should be correct
                    return_list.append(tmp_set.pop())
                elif len(nv_candidates) == 1 and next(iter(nv_candidates)) == word:
                        # this is suspicious, pyenchants' "suggest" method always returns something, however if
                        # norvigs method cannot find a suitable match within a short distance then it simply
                        # returns the orignal word.  This section is for potentially conjoined words
                        tmp_list=trysplit(word)

                        # If we get back a list of split words then use these
                        if len(tmp_list) != 0:
                            return_list.extend(tmp_list)
                        else:
                            return_list.append(word)
                else:
                    # arbitrary now, just going to use the first one found from pyenchant, even though
                    # I have seen norvig get the correct word sometimes when pyenchant gets it wrong
                    return_list.append(candidates[0])
            else:
                nv_candidates = norvig_candidates(word)
                if len(nv_candidates) > 0:
                    return_list.append(nv_candidates[0])
                else:
                    return_list.append(word)
        else:
            return_list.append(word)

    return ' '.join(return_list)

print('done')

ModuleNotFoundError: No module named 'pyenchant'

In [6]:
# Preprocessing functions:

def my_preprocessor_eng(textblock):
    """ This function is a simple set of regular expressions to remove/replace some punctuation
    and replace some abbreviations
    
    Args:
        textbloc (string): a string of words to run the expressions against
    Returns:
        a string of adjusted text
    """
    return_words = textblock
    return_words = re.sub(r"[^A-Za-z0-9]?!'`:´()", " ", return_words)
    return_words = re.sub(r","," ",return_words)
    return_words = re.sub(r"\.\.+"," ",return_words)
    return_words = re.sub(r"\."," ",return_words)
    return_words = re.sub(r"\("," ", return_words)
    return_words = re.sub(r"\)"," ", return_words)
    return_works = re.sub(r"\;", " ", return_words)
    return_words = re.sub(r":"," ", return_words)
    return_words = re.sub(r"´", "'", return_words)
    return_words = re.sub(r"`", "'", return_words)
    return_words = re.sub(r"''+", "'", return_words)
    return_words = re.sub(r" '", " ", return_words)
    return_words = re.sub(r"' ", " ", return_words)
    return_words = re.sub(r"\"", " ", return_words)
    return_words = re.sub(r"\/", " ", return_words)
    return_words = re.sub(r"\!\!+", "!!", return_words)
    return_words = re.sub(r"\?\?+", "?!", return_words)
    return_words = re.sub(r"\!", " !", return_words)
    return_words = re.sub(r"\?", " ?", return_words)
    return_words = re.sub(r"\b\d+\b", "999", return_words)
    # slang and abbreviations, need to be aware of capitolization and spaces
    return_words = re.sub(r"[Ww]on't", "will not", return_words)
    return_words = re.sub(r"n't", " not", return_words)
    return_words = re.sub(r"'s\b", " is", return_words)
    return_words = re.sub(r"\b[Aa]bt\b", "about", return_words)
    return return_words

In [7]:
import nltk
# These imports enable the use of NLTKPreprocessor in an sklearn Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk.tokenize import punkt as punkt
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

nltk.download('stopwords')
nltk.download('punkt')


class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    """Text preprocessor using NLTK tokenization and Lemmatization

    This class is to be used in an sklean Pipeline, prior to other processers like PCA/LSA/classification
    Attributes:
        lower: A boolean indicating whether text should be lowercased by preprocessor
                default: True
        strip: A boolean indicating whether text should be stripped of surrounding whitespace, underscores and '*'
                default: True
        stopwords: A set of words to be used as stop words and thus ignored during tokenization
                default: built-in English stop words
        punct: A set of punctuation characters that should be ignored
                default: None
        lemmatizer: An object that should be used to lemmatize tokens
    """

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):

        # Break the document into sentences
        for sent in sent_tokenize(str(document)):

            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                
                # S
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

def identity(arg):
    """
    Simple identity function works as a passthrough.
    """
    return arg

nltkPreprocessor = NLTKPreprocessor()
print('%s: Converting training data with NLTK Preprocessor' %(str(datetime.datetime.now().time())))
nltkPreprocessor.fit(train_data)
train_preproc_data = nltkPreprocessor.transform(train_data)
print('%s: Converting dev data with NLTK Preprocessor' %(str(datetime.datetime.now().time())))
nltkPreprocessor.fit(dev_data)
dev_preproc_data = nltkPreprocessor.transform(dev_data)
print('%s: done' %(str(datetime.datetime.now().time())))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/burgew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/burgew/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
12:23:18.319466: Converting training data with NLTK Preprocessor
12:32:36.947439: Converting dev data with NLTK Preprocessor
12:36:06.838004: done


In [8]:
# Calculation of scores on dev set and training set
def score_f1_auc_on_train_dev(dev_vector, train_vector, name, ctype='multi'):
    """This function creates a Naive Bayes classifier with the input vectors
    and then calculates both the AUC score and F1 score for the training and dev data.
    
    Args:
        dev_vector: the processed vector of dev data
        train_vector: the processed vector of training data
        name (string) : the label name to test
        ctype: multi, gaus or bern, choses between multinomial or bernoulli
    Returns:
        alpha: the best alpha value for this classifier
        f1scoredev: the F1 score for dev
        aucdev: the AUC score for dev
        f1scoretrain: the F1 score for training
        auctrain: the AUC score for training
    """
    alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 15.0, 20.0, 50.0, 100.0]}

    if ctype == 'multi':
        nb_class = MultinomialNB().fit(train_vector, train_labels[name])
    elif ctype == 'bern':
        nb_class = BernoulliNB().fit(train_vector, train_labels[name])
    elif ctype == 'gaus':
        nb_class = GaussianNB().fit(train_vector, train_labels[name])
    else:
        print('ctype = %s, error' % (ctype))
    
    # use this to generate the best fitting model
    clf = GridSearchCV(nb_class, param_grid = alphas, scoring='roc_auc')
    clf.fit(train_vector, train_labels[name])
    
    predicted_labels_dev = clf.predict(dev_vector)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], predicted_labels_dev)
    
    predicted_labels_train = clf.predict(train_vector)
    fpr1, tpr1, thresholds1 = metrics.roc_curve(train_labels[name], predicted_labels_train)
    
    f1scoredev = metrics.f1_score(dev_labels[name],predicted_labels_dev,average='micro')
    f1scoretrain = metrics.f1_score(train_labels[name],predicted_labels_train,average='micro')
    
    aucdev = metrics.auc(fpr,tpr)
    auctrain = metrics.auc(fpr1,tpr1)
    
    return clf.best_params_, f1scoredev,aucdev,f1scoretrain,auctrain

In [9]:
import copy

vectors_all=pd.DataFrame(columns=['vectortrain', 'vectordata','type','preprocessor', 'tokenizer',
                                  'max_features', 'stop_words', 'lowercase', 'strip_accents' ])

# this set of loops works through the chosen parameters creating a count and tfidf vectorizer
# and using the preprocessor or not (4 per iteration).  These are stored in the vectors_all
# dataframe along with a list of the parameters that were used to create each on
print(str(datetime.datetime.now().time()))
index=1

for i in None, 4000, 5000, 6000, 10000:
    print('%s: Doing i = %s' %(str(datetime.datetime.now().time()), i))
    for x in None, 'english':
        for y in None, 'ascii', 'unicode':
#             for z in True, False:
            z=False  # always came out best in prior tests
            print(index)
            index +=1

#             vect = CountVectorizer(max_features=i, stop_words=x, strip_accents=y, lowercase=z)
#             vect_train = vect.fit_transform(train_data)
#             vect_dev = vect.transform(dev_data)
#             vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'count', 0, 0, i, x, z, y]
            # Same but with the preprocessor
    
# Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
# This is to account for the NLTKPreprocessor already taking care of these.
            vect = CountVectorizer(tokenizer=identity, max_features=i, stop_words=x, 
                                   preprocessor=None,lowercase=False)
            vect_train= vect.fit_transform(train_preproc_data)
            vect_dev= vect.fit_transform(dev_preproc_data)
            vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'count', 1, 0, i, x, z, y]
#             vect = TfidfVectorizer(max_features=i, stop_words=x, strip_accents=y, analyzer='word',lowercase=z)
#             vect_train = vect.fit_transform(train_data)
#             vect_dev = vect.transform(dev_data)
#             vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'tfidf', 0, 0, i, x, z, y]
            # Same but with the preprocessor
#            vect = TfidfVectorizer(max_features=i, stop_words=x, analyzer='word',strip_accents=y, lowercase=z)

# Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
# This is to account for the NLTKPreprocessor already taking care of these.
            vect = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=.7, max_features=i, analyzer='word',
                              tokenizer=identity, preprocessor=None, lowercase=False, stop_words=x)
#             vect_train= map(vect.fit_transform,train_preproc_data)
#             vect_dev= map(vect.fit_transform,dev_preproc_data)
            vect_train = vect.fit_transform(train_preproc_data)
            vect_dev = vect.transform(dev_preproc_data)
            vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'tfidf', 1, 0, i, x, z, y]

print(str(datetime.datetime.now().time()))
print(vectors_all.shape)

12:47:43.904467
12:47:43.904943: Doing i = None
1
2
3
4
5
6
12:50:05.973100: Doing i = 4000
7
8
9
10
11
12
12:52:32.184922: Doing i = 5000
13
14
15
16
17
18
12:54:53.674513: Doing i = 6000
19
20
21
22
23
24
12:57:15.779264: Doing i = 10000
25
26
27
28
29
30
12:59:37.631506
(60, 9)


In [10]:
def calculate_f1auc_all_models (vectors_all):
    """This function takes a vector of type vectors_all (defined above) acts as a wrapper
    to send each vector to the score_f1_auc_on_train_dev function selecting first multinomialNB
    and after that Bernoulli.  It collects the resulting scores and the best alpha and stores
    the results in a dataframe which is returned once all the results are calculated
    
    Args:
        Vectors_all (datafram) : a dataframe defined above that stores the vector data in each row
    Returns
        dataframe: A dataframe of all the resulting scores and the details for each model
    """
    data_all=pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                   'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                   'f1dev','aucdev','f1train','auctrain'])
    for index,row in vectors_all.iterrows():
        for name in target_names:
            alpha, tmpf1dev,tmpaucdev,tmpf1train,tmpauctrain = score_f1_auc_on_train_dev(
                train_vector=row['vectortrain'],
                dev_vector=row[1],name=name, ctype='multi')
            data_all.loc[data_all.shape[0]] = [index,name,'multi',alpha,row['type'], 
                                row['preprocessor'], row['tokenizer'], row['max_features'], 
                                row['stop_words'], row['lowercase'], row['strip_accents'],
                                tmpf1dev,tmpaucdev,tmpf1train,tmpauctrain]
            alpha, tmpf1dev,tmpaucdev,tmpf1train,tmpauctrain = score_f1_auc_on_train_dev(
                train_vector=row[0],
                dev_vector=row[1],name=name, ctype='bern')
            data_all.loc[data_all.shape[0]] = [index,name,'bern',alpha,row['type'], 
                                row['preprocessor'], row['tokenizer'], row['max_features'], 
                                row['stop_words'], row['lowercase'], row['strip_accents'],
                                tmpf1dev,tmpaucdev,tmpf1train,tmpauctrain]
    print('done')
    return data_all
    

In [14]:
# separated out.  This took over 24 hours to run as the fix_spellings tokenizer
# seems to have been very slow.  As an alternative we will preprocess our data
# before generating the dataframe

print(str(datetime.datetime.now().time()))
count_vect_plain_pre_token10k = CountVectorizer(tokenizer=fix_spellings, max_features=10000, 
                                                strip_accents='ascii', lowercase=True)
print(str(datetime.datetime.now().time()))
X_train_counts_plain_pre_token10k = count_vect_plain_pre_token6k.fit_transform(train_data)
print(str(datetime.datetime.now().time()))
X_dev_counts_plain_pre_token10k = count_vect_plain_pre_token6k.transform(dev_data)
print(str(datetime.datetime.now().time()))

11:50:26.156608


NameError: name 'fix_spellings' is not defined

In [11]:
# insert the vectors from the previous cell into the large dataframe of vectors
vectors_all.loc[vectors_all.shape[0]] = [X_train_counts_plain_pre_token6k, 
                                         X_dev_counts_plain_pre_token6k,
                                         'count', 0, 1, 10000, None, True, 'ascii']

NameError: name 'X_train_counts_plain_pre_token6k' is not defined

In [12]:
# Finally pull everything together and look for the top results for
# both F1 and AUC scores

top_aucf1_results = pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                   'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                   'f1dev','aucdev','f1train','auctrain'])

# calculate the f1 and auc for all the models
resultdf = calculate_f1auc_all_models(vectors_all)

for label in target_names:
    df_tmp = resultdf.loc[resultdf['label'] == label]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['aucdev'].idxmax()]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['f1dev'].idxmax()]
    
print(top_aucf1_results)

  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)


ValueError: dimension mismatch

In [12]:
# copy and paste from the previous section as I don't want to overwrite the results there

top_aucf1_results = pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                   'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                   'f1dev','aucdev','f1train','auctrain'])
print('%s starting' %(str(datetime.datetime.now().time())))
# calculate the f1 and auc for all the models
resultdf = calculate_f1auc_all_models(vectors_all)

for label in target_names:
    df_tmp = resultdf.loc[resultdf['label'] == label]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['aucdev'].idxmax()]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['f1dev'].idxmax()]
    
print(top_aucf1_results)
print('%s ending' %(str(datetime.datetime.now().time())))

07:10:41.176595 starting
done
    vectorno          label  model            alpha   type  preprocessor  \
0         60          toxic   bern  {'alpha': 15.0}  count             0   
1         86          toxic  multi   {'alpha': 0.5}  tfidf             0   
2         40   severe_toxic   bern   {'alpha': 2.0}  count             0   
3         10   severe_toxic  multi  {'alpha': 0.01}  tfidf             0   
4         40        obscene   bern  {'alpha': 10.0}  count             0   
5         94        obscene  multi   {'alpha': 0.5}  tfidf             0   
6         32         threat   bern   {'alpha': 1.0}  count             0   
7         54         threat  multi   {'alpha': 0.1}  tfidf             0   
8         44         insult   bern  {'alpha': 10.0}  count             0   
9        102         insult  multi   {'alpha': 0.1}  tfidf             0   
10        60  identity_hate   bern   {'alpha': 1.0}  count             0   
11       110  identity_hate  multi   {'alpha': 0.1}  tfidf

In [29]:
# copy and paste from the previous section as I don't want to overwrite the results there

top_aucf1_results = pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                   'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                   'f1dev','aucdev','f1train','auctrain'])
print('%s starting' %(str(datetime.datetime.now().time())))
# calculate the f1 and auc for all the models
resultdf = calculate_f1auc_all_models(vectors_all)

for label in target_names:
    df_tmp = resultdf.loc[resultdf['label'] == label]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['aucdev'].idxmax()]
    top_aucf1_results.loc[top_aucf1_results.shape[0]] = df_tmp.loc[df_tmp['f1dev'].idxmax()]
    
print(top_aucf1_results)
print('%s ending' %(str(datetime.datetime.now().time())))

18:47:22.163790 starting


TypeError: float() argument must be a string or a number, not 'map'

In [None]:
pd.DataFrame(top_aucf1_results).to_csv('f1auc_scores.csv')
pd.DataFrame(resultdf).to_csv('all_NB_results.csv')