## Exploring Naive Bayes Classifiers using [SMS Spam Collection Set from UCI](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#)

#### Analysis follows Chapter 4 of *Machine Learning with R* by Brett Lantz (though of course here we use Python, not R)

#### Objective:  Use a classifier to predict whether an SMS message is spam or not, using accuracy (% correct) as the metric.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import itertools

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline

from sklearn.base import BaseEstimator, TransformerMixin

warnings.filterwarnings("ignore")
%matplotlib
sns.set(style="white", color_codes=True)

Using matplotlib backend: MacOSX


### 1. Data loading and exploration

In [2]:
#data from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#
data = pd.read_csv('SMSSpamCollection', sep='\t', header=0, names=['Type','Text'])

In [3]:
data.head(5)

Unnamed: 0,Type,Text
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...


In [4]:
data.shape

(5571, 2)

In [5]:
data.Type.value_counts()

ham     4824
spam     747
Name: Type, dtype: int64

Trying 3 different stemmers available in nltk package: Snowball, Porter, and Lancaster.

In [6]:
def stem_data(data, stemmer):
    return data.apply(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))

data['Stemmed'] = stem_data(data['Text'], SnowballStemmer('english') )
data['Porter'] = stem_data(data['Text'], PorterStemmer() )
data['Lancaster'] = stem_data(data['Text'], LancasterStemmer() )

data.head()

Unnamed: 0,Type,Text,Stemmed,Porter,Lancaster
0,ham,Ok lar... Joking wif u oni...,ok lar... joke wif u oni...,Ok lar... joke wif u oni...,ok lar... jok wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri in 2 a wkli comp to win fa cup fina...,free entri in 2 a wkli comp to win FA cup fina...,fre entry in 2 a wkly comp to win fa cup fin t...
2,ham,U dun say so early hor... U c already then say...,u dun say so earli hor... u c alreadi then say...,U dun say so earli hor... U c alreadi then say...,u dun say so ear hor... u c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goe to usf, he live aroun...","nah I don't think he goe to usf, he live aroun...","nah i don't think he goe to usf, he liv around..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darl it been 3 week now and ...,freemsg hey there darl it' been 3 week' now an...,freemsg hey ther darl it's been 3 week's now a...


Code the 'Type' field as 0 or 1 for the classifier, and drop from the main dataset:

In [7]:
y = data['Type'].map({'ham':0, 'spam':1})
X = data.drop(labels=['Type'], axis=1)

In [8]:
X.head(5)

Unnamed: 0,Text,Stemmed,Porter,Lancaster
0,Ok lar... Joking wif u oni...,ok lar... joke wif u oni...,Ok lar... joke wif u oni...,ok lar... jok wif u oni...
1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri in 2 a wkli comp to win fa cup fina...,free entri in 2 a wkli comp to win FA cup fina...,fre entry in 2 a wkly comp to win fa cup fin t...
2,U dun say so early hor... U c already then say...,u dun say so earli hor... u c alreadi then say...,U dun say so earli hor... U c alreadi then say...,u dun say so ear hor... u c already then say...
3,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goe to usf, he live aroun...","nah I don't think he goe to usf, he live aroun...","nah i don't think he goe to usf, he liv around..."
4,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darl it been 3 week now and ...,freemsg hey there darl it' been 3 week' now an...,freemsg hey ther darl it's been 3 week's now a...


### 2. Create test and training sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7234)

### 3. Create initial model on training data

We'd like to investigate the effects of the following:

1. Using a stemmer vs. no stemmer
2. Using word-based vs. character-based n-grams
2. Using tf-idf representation or raw n-gram counts
3. Using different types of classifiers:  Naive Bayes, SGD, SVM

And all of these will be cross-validated using different ranges of parameters.

In [10]:
def GridSearchCV_results( vectorizer_list, classifier_list, field_list, X, y, score ):
    '''
    Runs the GridSearchCV function on given inputs:
        Lists of:
        - classifiers
        - vectorizers
        - input fields to classify
        X, y are pandas dataframes
        score is a string indicating the value to optimize, e.g. 'accuracy'
    Output: dictionary where key = tuple of parameters, value = score
    '''
    results = {}

    for v in vectorizer_list:
        for c in classifier_list:
        
            pipeline = Pipeline([ (v[0], v[1]), (c[0], c[1]) ])

            parameters = {}
            parameters.update(v[2])
            parameters.update(c[2])

            grid_search = GridSearchCV(pipeline, parameters, scoring=score, verbose=0, n_jobs=4)
            
            for field in field_list:
                results[(field, v[0], c[0])] = grid_search.fit(X[field],y)
                
    return results

In [11]:
# train parameters first on just the original Text
X_train_list = ['Text']

# test effects of different vectorizers and their parameters
vectorizer_list = [
    
    ('cv-word', CountVectorizer(analyzer='word'), {

        'cv-word__max_features': (None, 2000, 5000, 10000),
        'cv-word__ngram_range': ((1, 1), (1, 2)) })
    
    ,('cv-char', CountVectorizer(analyzer='char'), {

        'cv-char__max_features': (None, 2000),
        'cv-char__ngram_range': ((3, 3), (3, 4) )})
    
    ,('tfidf-word', TfidfVectorizer(analyzer='word'), {

        'tfidf-word__max_features': (None, 2000, 5000, 10000),
        'tfidf-word__ngram_range': ((1, 1), (1, 2)),
        'tfidf-word__smooth_idf': (True, False),
        'tfidf-word__norm': ('l2','l1', None)})
    
    ,('tfidf-char', TfidfVectorizer(analyzer='char'), {

        'tfidf-char__max_features': (None, 2000),
        'tfidf-char__ngram_range': ((3, 3), (3, 4)),
        'tfidf-char__smooth_idf': (True, False),
        'tfidf-char__norm': ('l2','l1', None)})
]

# test effects of different classifiers and their parameters
classifier_list = [
    
    # naive bayes
    ('mnb', MultinomialNB(), {'mnb__alpha': (1, .1, .01, .001, .0001, .00001)})
    
    # linear regression classifier
    ,('sgd', SGDClassifier(loss='log'), {
    'sgd__alpha': (0.01, 0.001, 0.0001, 0.00001, 0.000001),
    'sgd__penalty': ('none', 'l1', 'l2', 'elasticnet'),
    'sgd__l1_ratio': (.2, .5, .8)})
    
    # support vector machine
    ,('svm-linear', SVC(kernel='linear'), {'svm-linear__C': (1.0, 10.0, 100.0)})
    ,('svm-rbf', SVC(kernel='rbf'), {'svm-rbf__C': (1.0, 10.0, 100.0), 'svm-rbf__gamma': (1e-06, 1e-5, 1e-4)})
    ,('svm-poly', SVC(kernel='poly'), {'svm-poly__C': (1.0, 10.0, 100.0),
                                       'svm-poly__gamma': (1e-06, 1e-5, 1e-4), 'svm-poly__degree': (2, 3, 4)})
]

#run grid search on above parameters
grid_results = GridSearchCV_results( 
    vectorizer_list, classifier_list, X_train_list, X_train, y_train, score = 'accuracy' 
    )




In [12]:
for key, val in sorted(grid_results.items()): 
    print(str(key) + ': ' + str(val.best_score_))

('Text', 'cv-char', 'mnb'): 0.98999743524
('Text', 'cv-char', 'sgd'): 0.986919723006
('Text', 'cv-char', 'svm-linear'): 0.986150294947
('Text', 'cv-char', 'svm-poly'): 0.873044370351
('Text', 'cv-char', 'svm-rbf'): 0.986150294947
('Text', 'cv-word', 'mnb'): 0.987432675045
('Text', 'cv-word', 'sgd'): 0.986406770967
('Text', 'cv-word', 'svm-linear'): 0.984098486791
('Text', 'cv-word', 'svm-poly'): 0.872018466273
('Text', 'cv-word', 'svm-rbf'): 0.976147730187
('Text', 'tfidf-char', 'mnb'): 0.990510387279
('Text', 'tfidf-char', 'sgd'): 0.990253911259
('Text', 'tfidf-char', 'svm-linear'): 0.989228007181
('Text', 'tfidf-char', 'svm-poly'): 0.971787637856
('Text', 'tfidf-char', 'svm-rbf'): 0.987176199025
('Text', 'tfidf-word', 'mnb'): 0.988202103103
('Text', 'tfidf-word', 'sgd'): 0.988458579123
('Text', 'tfidf-word', 'svm-linear'): 0.98974095922
('Text', 'tfidf-word', 'svm-poly'): 0.931777378815
('Text', 'tfidf-word', 'svm-rbf'): 0.982816106694


The best accuracy (>99%) is achieved by both Naive Bayes and linear regression classifiers, using tfidf at the character level.  The worst is the SVM with the polynomial kernel.

Next steps:  try optimizing both Naive Bayes and linear regression to get even better accuracy.  But first, let's look at the SVM with the polynomial kernel just to see if it can be improved with parameter tuning, since it's so much less accurate than the others.

### 4. Additional parameter tuning for each algorithm

The SVM with the polynomial kernel has the worst results.  Before getting rid of it completely, let's try higher values of both C and gamma to see if that improves things.

In [13]:
# train parameters first on just the original Text
X_train_list = ['Text']

# test effects of different vectorizers and their parameters
vectorizer_list = [
    
    ('cv-word', CountVectorizer(analyzer='word'), {

        'cv-word__max_features': (None, 2000, 5000, 10000),
        'cv-word__ngram_range': ((1, 1), (1, 2)) })
    
    ,('cv-char', CountVectorizer(analyzer='char'), {

        'cv-char__max_features': (None, 2000),
        'cv-char__ngram_range': ((3, 3), (3, 4) )})
    
    ,('tfidf-word', TfidfVectorizer(analyzer='word'), {

        'tfidf-word__max_features': (None, 2000, 5000, 10000),
        'tfidf-word__ngram_range': ((1, 1), (1, 2)),
        'tfidf-word__smooth_idf': (True, False),
        'tfidf-word__norm': ('l2','l1', None)})
    
    ,('tfidf-char', TfidfVectorizer(analyzer='char'), {

        'tfidf-char__max_features': (None, 2000),
        'tfidf-char__ngram_range': ((3, 3), (3, 4)),
        'tfidf-char__smooth_idf': (True, False),
        'tfidf-char__norm': ('l2','l1', None)})
]

# test effects of different classifiers and their parameters
classifier_list = [
    
    # support vector machine
    ('svm-poly', SVC(kernel='poly'), {'svm-poly__C': (500.0, 1000.0, 5000.0),
                                       'svm-poly__gamma': (1e-03, 1e-2, 1e-1), 'svm-poly__degree': (2, 3, 4)})
]

#run grid search on above parameters
svm_poly_results = GridSearchCV_results( 
    vectorizer_list, classifier_list, X_train_list, X_train, y_train, score = 'accuracy' 
    )





In [14]:
for key, val in sorted(svm_poly_results.items()): 
    print(key[1] + ' ' + key[0] + ': ' + str(val.best_score_))
    print(val.best_params_)

cv-char Text: 0.975121826109
{'svm-poly__degree': 2, 'svm-poly__gamma': 0.001, 'svm-poly__C': 5000.0, 'cv-char__max_features': 2000, 'cv-char__ngram_range': (3, 4)}
cv-word Text: 0.96947935368
{'cv-word__max_features': 2000, 'svm-poly__degree': 2, 'cv-word__ngram_range': (1, 1), 'svm-poly__gamma': 0.01, 'svm-poly__C': 1000.0}
tfidf-char Text: 0.977943062324
{'tfidf-char__norm': 'l2', 'tfidf-char__smooth_idf': True, 'svm-poly__degree': 2, 'tfidf-char__max_features': 2000, 'svm-poly__gamma': 0.1, 'svm-poly__C': 500.0, 'tfidf-char__ngram_range': (3, 4)}
tfidf-word Text: 0.972557065914
{'tfidf-word__smooth_idf': True, 'svm-poly__degree': 2, 'tfidf-word__max_features': 2000, 'tfidf-word__norm': 'l2', 'svm-poly__gamma': 0.1, 'svm-poly__C': 1000.0, 'tfidf-word__ngram_range': (1, 1)}


Much better, but still not as good as the other algorithms.

Next, let's look at SGD:

In [15]:
for key, val in sorted(grid_results.items()): 
    if key[2] == 'sgd' :
        print(key[1] + ': ' + str(val.best_score_))
        print(val.best_params_)

cv-char: 0.986919723006
{'sgd__l1_ratio': 0.2, 'sgd__penalty': 'elasticnet', 'cv-char__max_features': None, 'sgd__alpha': 0.001, 'cv-char__ngram_range': (3, 3)}
cv-word: 0.986406770967
{'cv-word__max_features': 2000, 'sgd__l1_ratio': 0.2, 'cv-word__ngram_range': (1, 1), 'sgd__penalty': 'l2', 'sgd__alpha': 0.0001}
tfidf-char: 0.990253911259
{'sgd__l1_ratio': 0.5, 'tfidf-char__norm': 'l2', 'tfidf-char__smooth_idf': False, 'tfidf-char__ngram_range': (3, 3), 'sgd__alpha': 1e-06, 'tfidf-char__max_features': None, 'sgd__penalty': 'elasticnet'}
tfidf-word: 0.988458579123
{'sgd__l1_ratio': 0.2, 'tfidf-word__ngram_range': (1, 2), 'sgd__alpha': 1e-06, 'tfidf-word__norm': 'l2', 'tfidf-word__smooth_idf': False, 'sgd__penalty': 'elasticnet', 'tfidf-word__max_features': None}


Let's stick to L2 and try a wider range of alphas, and also fix a few other parameters:

In [16]:
for i in np.logspace(-8,-3, num=10):print('%.3g' % i) 

1e-08
3.59e-08
1.29e-07
4.64e-07
1.67e-06
5.99e-06
2.15e-05
7.74e-05
0.000278
0.001


In [17]:
# train parameters first on just the original Text
X_train_list = ['Text']

# test effects of different vectorizers and their parameters
vectorizer_list = [
    
    ('cv-word', CountVectorizer(analyzer='word'), {

        'cv-word__max_features': (None, 2000, 5000, 10000),
        'cv-word__ngram_range': ((1, 1), (1, 2)) })
    
    ,('cv-char', CountVectorizer(analyzer='char'), {

        'cv-char__max_features': (None, 2000),
        'cv-char__ngram_range': ((3, 3), (3, 4) )})
    
    ,('tfidf-word', TfidfVectorizer(analyzer='word'), {

        'tfidf-word__max_features': (None, 2000, 5000, 10000),
        'tfidf-word__ngram_range': ((1, 1), (1, 2)),
        'tfidf-word__smooth_idf': (True, False),
        'tfidf-word__norm': ('l2','l1', None)})
    
    ,('tfidf-char', TfidfVectorizer(analyzer='char'), {

        'tfidf-char__max_features': (None, 2000),
        'tfidf-char__ngram_range': ((3, 3), (3, 4)),
        'tfidf-char__smooth_idf': (True, False),
        'tfidf-char__norm': ('l2','l1', None)})
]

# test effects of different classifiers and their parameters
classifier_list = [

    
    # linear classifier
    ('sgd', SGDClassifier(loss='log', penalty='l2'), {
    'sgd__alpha': (1e-08,3.59e-08,1.29e-07,4.64e-07,1.67e-06,5.99e-06,2.15e-05,7.74e-05,0.000278,0.001)})
    
]

#run grid search on above parameters
sgd_results = GridSearchCV_results( 
    vectorizer_list, classifier_list, X_train_list, X_train, y_train, score = 'accuracy' 
    )



In [18]:
for key, val in sorted(sgd_results.items()): 
    print(key[1] + ': ' + str(val.best_score_))
    print(val.best_params_)

cv-char: 0.986663246986
{'cv-char__max_features': None, 'sgd__alpha': 0.001, 'cv-char__ngram_range': (3, 4)}
cv-word: 0.985637342908
{'cv-word__max_features': None, 'cv-word__ngram_range': (1, 1), 'sgd__alpha': 0.000278}
tfidf-char: 0.988458579123
{'tfidf-char__max_features': None, 'tfidf-char__norm': 'l2', 'tfidf-char__smooth_idf': True, 'tfidf-char__ngram_range': (3, 4), 'sgd__alpha': 1.67e-06}
tfidf-word: 0.987176199025
{'tfidf-word__norm': 'l2', 'tfidf-word__smooth_idf': True, 'tfidf-word__max_features': None, 'sgd__alpha': 4.64e-07, 'tfidf-word__ngram_range': (1, 2)}


Still just under 99% for each of these.  

MNB still seems to produce the best results so let's focus on that.  Let's fix the best vectorizer now and try optimizing:
- alpha
- different stemmers
- N-gram range

In [19]:
[i for i in np.logspace(-2,0, num=10)]

[0.01,
 0.016681005372000592,
 0.027825594022071243,
 0.046415888336127774,
 0.077426368268112694,
 0.12915496650148839,
 0.21544346900318834,
 0.35938136638046259,
 0.59948425031894093,
 1.0]

In [20]:
X_train_list = ['Text', 'Stemmed', 'Porter', 'Lancaster']

vectorizer_list = [
        ('tfidf-char', TfidfVectorizer(analyzer='char', norm='l2', smooth_idf=False), {
        
        'tfidf-char__max_features': (None, 2000, 5000, 10000, 20000),
        'tfidf-char__ngram_range': ((3, 4), (3, 5))})
]
classifier_list = [
    
    ('mnb', MultinomialNB(), {'mnb__alpha': (0.01,0.016,0.027,0.046,0.077,0.129,0.215,0.359,0.599,1.0)}),
    
    #('ada', AdaBoostClassifier(), {'ada__n_estimators': (10, 50, 100, 200)})
]

#run grid search on above parameters
mnb_results = GridSearchCV_results( 
    vectorizer_list, classifier_list, X_train_list, X_train, y_train, score = 'accuracy' 
    )

In [21]:
for key, val in sorted(mnb_results.items()): 
    print(key[1] +  ' ' + key[0] + ': ' + str(val.best_score_))
    print(val.best_params_)

tfidf-char Lancaster: 0.991792767376
{'tfidf-char__max_features': 10000, 'tfidf-char__ngram_range': (3, 4), 'mnb__alpha': 0.129}
tfidf-char Porter: 0.991792767376
{'tfidf-char__max_features': 10000, 'tfidf-char__ngram_range': (3, 4), 'mnb__alpha': 0.129}
tfidf-char Stemmed: 0.991792767376
{'tfidf-char__max_features': 10000, 'tfidf-char__ngram_range': (3, 4), 'mnb__alpha': 0.129}
tfidf-char Text: 0.991792767376
{'tfidf-char__max_features': 10000, 'tfidf-char__ngram_range': (3, 4), 'mnb__alpha': 0.129}


So the best alpha in this group (.129) gives 99.18% accuracy, a .13% increase.  The results are the same for all of the stemmers, so let's just use the original text for simplicity.

Finally, let's see whethet AdaBoost gives any further improvement:

In [22]:
X_train_list = ['Text']

vectorizer_list = [
        ('tfidf-char', TfidfVectorizer(analyzer='char'), {
        'tfidf-char__max_features': (10000,),
        'tfidf-char__ngram_range': ((3, 4),),
        'tfidf-char__norm': ('l2',)})
]
classifier_list = [
    
    ('mnb', MultinomialNB(), {'mnb__alpha': (0.129,)})
    
    ,('ada', AdaBoostClassifier(MultinomialNB(alpha=0.129)), 
        {'ada__n_estimators': (10, 50, 100, 200)})
]
adaboost_results = GridSearchCV_results(
    vectorizer_list, classifier_list, X_train_list, X_train, y_train, score = 'accuracy' 
)

In [23]:
for key, val in adaboost_results.items(): 
    print(key[2], str(val.best_score_))
    print(val.best_params_)

ada 0.980764298538
{'ada__n_estimators': 100, 'tfidf-char__max_features': 10000, 'tfidf-char__norm': 'l2', 'tfidf-char__ngram_range': (3, 4)}
mnb 0.991023339318
{'tfidf-char__max_features': 10000, 'tfidf-char__norm': 'l2', 'tfidf-char__ngram_range': (3, 4), 'mnb__alpha': 0.129}


So AdaBoost isn't better.  Looks like the Naive Bayes classifier give the best results on the training set, with an accuracy of 99.1%!

### 5. Run optimized classifier on test data

In [24]:
pipeline = Pipeline([('tfidf-char', TfidfVectorizer(analyzer='char', max_features=10000, ngram_range=(3, 4), norm='l2')) \
                            ,('mnb', MultinomialNB(alpha=0.129)) ])
pipeline.fit(X_train['Text'],y_train)
predictions = pipeline.predict(X_test['Text'])
print('Training set accuracy: %.4f' % pipeline.score(X_train['Text'], y_train))
print('Test set accuracy: %.4f' % pipeline.score(X_test['Text'], y_test))


Training set accuracy: 0.9931
Test set accuracy: 0.9844


Now the confusion matrix:

In [25]:
confusion_matrix(y_test, predictions)

array([[1418,    6],
       [  20,  228]])

Just for fun, let's see which items were misclassified:

In [26]:
pd.options.display.max_colwidth = 180
pd.DataFrame(data = {'Text of misclassified spam':X_test['Text'][predictions != y_test], 'Actual status':y_test[predictions != y_test]})


Unnamed: 0,Actual status,Text of misclassified spam
750,1,"Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?"
83,0,Yup next stop.
5045,0,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us"
4255,1,Block Breaker now comes in deluxe format with new features and great graphics from T-Mobile. Buy for just £5 by replying GET BBDELUXE and take the challenge
3889,0,Unlimited texts. Limited minutes.
1429,1,For sale - arsenal dartboard. Good condition but no doubles or trebles!
4297,1,thesmszone.com lets you send free anonymous and masked messages..im sending this message from there..do you see the potential for abuse???
3573,1,You won't believe it but it's true. It's Incredible Txts! Reply G now to learn truly amazing things that will blow your mind. From O2FWD only 18p/txt
3863,1,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50"
3301,1,RCT' THNQ Adrian for U text. Rgds Vatian


### 6. Summary

We were able to correctly classify about 98.4% of the test set by optimizing the parameters for a Naive Bayes Classifier.  This is slightly larger than the accuracy reported by Lantz ("almost 98%").

To arrive at the optimal classifier, we explored:

Different classification algorithms:
- Naive Bayes
- Linear Regression
- Support Vector Machine (with linear, polynomial, and RBF kernels)
- AdaBoost (on Naive Bayes)

Stemming algorithms:
- Original text
- Porter
- Snowball
- Lancaster

Vectorizers:
- Count (at both the char and word level)
- TF-IDF (at both the char and word level)

Regularization parameters:
- l2 vs l1 vs elastic
- alpha

The optimal parameter set was found to be:
- Naive Bayes, with alpha = 0.129
- Original text (no stemming)
- Count vectorizer (with 3- and 4-char n-grams)