# HW 5 - TF-IDF Classifier

Goal is to make classifier which would be able to identify "toxic" comments [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

Data would be taken fom here - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

In [1]:
import numpy as np
import pandas as pd

from yaspin import yaspin
from yaspin.spinners import Spinners
from tqdm.notebook import trange, tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import warnings

# ignore matplotlib warnings for unable to represent some characters in page names
warnings.filterwarnings("ignore")

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# load data
train = pd.read_csv('data/train.csv').fillna(' ')

Standard approaches for text analyzing is [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model), and it's modification [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

They implemented in `sklearn` as [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

More details are available by the [link](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb).

In [3]:
train_text = train['comment_text']

In [4]:
# trying basic Count Vectorizer to have a look on a data, check most frequent words, etc. 
word_vectorizer = CountVectorizer()

In [5]:
with yaspin(Spinners.clock, text='Fitting') as spinner:
    word_vectorizer.fit(train_text)
    spinner.ok('✅ ')
    
with yaspin(Spinners.clock, text='Transforming') as spinner:
    train_word_features = word_vectorizer.transform(train_text)
    spinner.ok('✅ ')

# finding most frequent words
with yaspin(Spinners.clock, text='Calculating sum for each word') as spinner:
    sum_words = train_word_features.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in word_vectorizer.vocabulary_.items()]

    most_freq_word = None
    frequency = float('-inf')
    less_than_N = 0
    one_rep = 0
    for word, freq in words_freq:
        if freq == 1:
            one_rep += 1
        elif freq < 4:
            less_than_N += 1
        if freq > frequency:
            frequency = freq 
            most_freq_word = word

print(f'\nMost frequent word is: `{most_freq_word}`, it appears {frequency} times')
print(f'There are {one_rep} words that appears only once in dataset')
print(f'Total number of features (words): {len(word_vectorizer.vocabulary_)}')

✅  Fitting 
✅  Transforming 
                                 
Most frequent word is: `the`, it appears 496796 times
There are 100140 words that appears only once in dataset
Total number of features (words): 189775


Lets try to modify our CountVectorizer for better performance

In [6]:
word_vectorizer = CountVectorizer(
    stop_words='english',  # standard words like `the`, `and`, etc.
    min_df=2,              # word (or n gram) should appear at least 2 times to be taken into account
    binary=True,           # binary features
    ngram_range=(1, 2),    # also consider two words phrases 
    lowercase=True,        # lowercase everything
)

with yaspin(Spinners.clock, text='Fitting') as spinner:
    word_vectorizer.fit(train_text)
    spinner.ok('✅ ')
    
with yaspin(Spinners.clock, text='Transforming') as spinner:
    train_word_features = word_vectorizer.transform(train_text)
    spinner.ok('✅ ')

✅  Fitting 
✅  Transforming 


In [7]:
n = 5

# finding most frequent word
# check if min_df is working (remove rarely used words)
with yaspin(Spinners.clock, text='Calculating sum for each word') as spinner:
    sum_words = train_word_features.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in word_vectorizer.vocabulary_.items()]

    most_freq_word = None
    frequency = float('-inf')
    less_than_N = 0
    one_rep = 0
    for word, freq in words_freq:
        if freq == 1:
            one_rep += 1
        elif freq < n:
            less_than_N += 1
        if freq > frequency:
            frequency = freq 
            most_freq_word = word

print(f'Now most frequent word is: `{most_freq_word}`, it appears {frequency} times')
print(f'There are {one_rep} words that appears only once in dataset')
print(f'There are {less_than_N} words that appears less than {n} times in dataset')
print(f'Total number of features (words and n-grams): {len(word_vectorizer.vocabulary_)}')

Now most frequent word is: `article`, it appears 32112 times
There are 0 words that appears only once in dataset
There are 420966 words that appears less than 5 times in dataset
Total number of features (words and n-grams): 553961


For classification we would be using [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [8]:
classifier = LogisticRegression(
    penalty='l2',    # default reaularization
    C=0.1,           # inverse of regularization strength
    solver='lbfgs',  # default solver
)

Let's train one classifier for each class

For validation we would use [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from sklearn

In [9]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    with yaspin(Spinners.clock, text=f'Training {class_name}') as spinner:
        cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
        spinner.ok(f'CV score for class {class_name} is {cv_score:.6f}')
    
    scores.append(cv_score)

print(f'\nTotal score is {np.mean(scores):.6f}')

CV score for class toxic is 0.958886 Training toxic
CV score for class severe_toxic is 0.975209 Training severe_toxic
CV score for class obscene is 0.977403 Training obscene
CV score for class threat is 0.976598 Training threat
CV score for class insult is 0.967526 Training insult
CV score for class identity_hate is 0.964972 Training identity_hate

Total score is 0.970099
