# Logistic regression-based classifier


### Loading of the data

This notebook shows how to train and evaluate a simple logistic regression-based classifier. Thanks to High Flying Bird for providing the program, I merely translated it from French to English!

In [None]:
import pandas as pd
data = pd.read_csv('../input/train.csv')

### Comments vectorization
We consider a vocabulary of 1000 unigrams and bigrams at most. The vocabulary is only composed of elements present at least 4 times and in less than 50% of the comments. Every comment is projected in this space according to the tf-idf score.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2),
                             max_df=0.5,
                             min_df=4,
                             max_features=1000)
vector_space_model = vectorizer.fit_transform(data['comment_text'].tolist())
n_comments = vector_space_model.shape[0]
print('%d comments total' % n_comments)

### Preparation of the data for the learning and evaluation of the classifier
We split the data in two subsets, with 1/3 of the comments used for the learning, and the other 2/3 being used for the evaluation.

In [None]:
training_set_size = int(n_comments * 0.33)
X = vector_space_model[:training_set_size,:]
Z = vector_space_model[training_set_size:vector_space_model.shape[0]-1,:]
print('%d comments for the estimation of the parameters and %d for the evaluation' % 
      (X.shape[0], Z.shape[0]))

### Training of the classifer for the "toxic" category
We estimate the parameters of the logistic regression, with L2 penalty (i.e. Ridge), for the "toxic" category. Beforehand we assign X of a dense representation instead of a sparse one. 

In [None]:
from sklearn import linear_model
X = X.toarray()
Y = data['toxic'][:training_set_size]
model = linear_model.BayesianRidge(verbose=True)
model.fit(X, Y)

### Evaluation of the classifier
We measure the likelihood of each comment to belong to the "toxic" category. The comments for which the likelihood is strictly over 1/2 are considered as "toxic". We sum up the results thanks to the confusion matrix.

In [None]:
from sklearn.preprocessing import binarize
from sklearn.metrics import confusion_matrix
ground_truth = data['toxic'][training_set_size:vector_space_model.shape[0]-1]
prediction = model.predict(Z)
prediction = binarize(prediction.reshape(-1, 1), 0.5)
confusion_matrix(ground_truth, prediction)

### Analysis of the predictions

In [None]:
toxic_ids = [i for i, c in enumerate(prediction) if c == 1]

In [None]:
comment_id = toxic_ids[0]
print('Content of the comment: \n%s\n' % data['comment_text'][training_set_size+comment_id])
print('Is this comment "toxic" according to the model?\n%s' % str(model.predict(Z[comment_id,:]) >0.5))