# Linear Model Overview
This notebook generates non-neural network predictions using term frequency-inverse document frequency (TF-IDF) transformation of both word and character-level data. The resulting features are then modeled using Naive Bayes (NB), whose results are fed as features into a Logistic Regression algorithm, as shown by Wang and Manning using NB and Support Vector Machines (SVM): https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf

This model generated a leaderboard score of 0.9800 on its own, which is approximately top 60th percentile. However, it blended very well in an ensemble with the various RNN predictions due to its completely different approach.

This portion of the project was trained on my local machine, so the data loading and handling will be more traditional compared to the handling of data using Google's Datalab and Storage buckets.

In [51]:
# general use
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from my_plot import PrettyPlot
PrettyPlot(plt);

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [53]:
import os
os.environ['OMP_NUM_THREADS'] = '4'

# Data Loading
There are six binary classes that need to be classified, shown in the list_classes list below. The training and testing comments are simply a 1-d vector consisting of words / sentences / paragraphs of varying lenghts. Depending on the words and messaging of the text, multiple categories can occur, i.e. a comment can be toxic, severe_toxic, and obscene at the same time.

In [54]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [55]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y_train = train[list_classes].values

In [56]:
train_comments = train['comment_text']
test_comments = test['comment_text']

# Tokenizing Comments
The code below using TF-IDF to create 1–3 gram word tokens, as well as 2–6 gram character tokens. The word and character tokens are created separately and then concantenated together to create the final feature set. 20000 word features and 40000 character features were kept for modeling. English stop words and unicode accents were removed during the transformation. Also the fitting process was performed on both the training and testing datasets, which could potentially lead to overfitting due to data leakage, but was necessary to improve performance as these sets were drawn from slightly different distributions due to changes in the competition.

In [57]:
def vectorizer(transformer, train, test):
    
    from scipy.sparse import hstack
    transformer.fit(list(train) + list(test))
    
    train_features = transformer.transform(train)
    test_features = transformer.transform(test)
    
    return train_features, test_features

In [58]:
word_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  analyzer='word',
                                  token_pattern=r'\w{1,}',
                                  stop_words='english',
                                  ngram_range=(1, 3),
                                  max_features=20000)

char_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  analyzer='char',
                                  stop_words='english',
                                  ngram_range=(2, 6),
                                  max_features=40000)

In [59]:
train_word_features, test_word_features = vectorizer(word_vectorizer, train_comments, test_comments)

In [60]:
train_char_features, test_char_features = vectorizer(char_vectorizer, train_comments, test_comments)

In [61]:
from scipy.sparse import hstack

X_train = hstack([train_word_features, train_char_features], format='csr')
X_test = hstack([test_word_features, test_char_features], format='csr')

# Creating Model Functions
The four functions below provide the Bayesian-Logistic Regression models, as well as cross-validation and train/predict code. The first function is calculates Bayesian probability of each category between a positive or negative occurence. This probability is then multiplied by the standard features and fed into the Logistic Regression model for final output.

## Creating Bayesian-Logistic Regression Models

### Bayesian probabilities

In [62]:
# creates p(X | y = 1 or 0)
def pr(y_i, y):
    p = X_train[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

### Modeling on Bayesian outputs

In [86]:
from sklearn.linear_model import LogisticRegression

def get_mdl(estimator, y):
    r = np.log(pr(1,y) / pr(0,y))
    X_train_nb = X_train.multiply(r)
    
    return estimator.fit(X_train_nb, y), r

## Creating Cross-Validation and Train/Predict Functions

### Cross validation on Bayesian Outputs

In [16]:
from sklearn.model_selection import cross_val_score

def cross_val(estimator, y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    X_train_nb = X_train.multiply(r)
    results = cross_val_score(estimator, X=X_train_nb, y=y, cv=3, scoring='roc_auc')
    
    return np.mean(results)

### Train and Create Predictions

In [45]:
def train_and_predict(estimator):
    results = np.zeros((len(test), len(list_classes)))

    for i, label in enumerate(list_classes):
        print('fit', label)
        estimator.fit(X_train, y_train[:,i])
        try:
            results[:,i] = estimator.predict_proba(X_test)[:,1]
        except:
            results[:,i] = estimator.predict_proba(X_test)[0][:,1]

    return results

# Running Models
Below, the cross-validation is performed for each separate class in the output. The output scores are averaged and reported below.

### Run Cross-validation

In [None]:
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

scores = {}

lr = LogisticRegression(solver='sag')

for i, j in enumerate(list_classes):
    print('fit', j)
    results = cross_val(lr, train[j])
    scores[j] = np.mean(results)
    
mean_scores = round(np.mean(list(scores.values())), 4)

In [3]:
mean_scores

0.9812

### Training Full Model and Creating Predictions
Finally, the model is trained on the entire training dataset and the test set is predicted. The results are converted to a .csv file for submission to the Kaggle website below.

In [46]:
lr_results = train_and_predict(lr)

fit toxic
fit severe_toxic
fit obscene
fit threat
fit insult
fit identity_hate


# Submission

In [88]:
test_id = test.id

def create_submission(filename, ids, results):
    id_series = pd.Series(ids, name='id')
    results_df = pd.DataFrame(results, columns=list_classes)
    combined = pd.concat([id_series, results_df], axis=1)
    combined.to_csv(filename,index=False)
    
    return combined
    
submission = create_submission('../submissions/submission_45.csv', test_id, preds)