Hi guys,

This will be a very short example of how we can utilize TFIDF in combination with Chi2 test to find predictive features (and by that I mean filthy words). If you dare, read on...
# Data Import
We'll start by importing the data:

In [1]:
import pandas as pd

train = pd.read_csv('../input/train.csv', header = 0)
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,22256635,"Nonsense? kiss off, geek. what I said is true...",1,0,0,0,0,0
1,27450690,"""\n\n Please do not vandalize pages, as you di...",0,0,0,0,0,0
2,54037174,"""\n\n """"Points of interest"""" \n\nI removed the...",0,0,0,0,0,0
3,77493077,Asking some his nationality is a Racial offenc...,0,0,0,0,0,0
4,79357270,The reader here is not going by my say so for ...,0,0,0,0,0,0


We'll just check if there are any empty fields:

In [2]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95851 entries, 0 to 95850
Data columns (total 8 columns):
id               95851 non-null int64
comment_text     95851 non-null object
toxic            95851 non-null int64
severe_toxic     95851 non-null int64
obscene          95851 non-null int64
threat           95851 non-null int64
insult           95851 non-null int64
identity_hate    95851 non-null int64
dtypes: int64(7), object(1)
memory usage: 5.9+ MB


Let's see if we can get some insights into the data by checking some standard metrics on the target fields:

In [3]:
train.describe()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,95851.0,95851.0,95851.0,95851.0,95851.0,95851.0,95851.0
mean,499435900000.0,0.096368,0.010068,0.053301,0.003182,0.049713,0.008492
std,289013600000.0,0.295097,0.099832,0.224635,0.05632,0.217352,0.091762
min,22256640.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,247343700000.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,500129700000.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,750108800000.0,0.0,0.0,0.0,0.0,0.0,0.0
max,999988200000.0,1.0,1.0,1.0,1.0,1.0,1.0


Looks like the mean value for the 'toxic' column is the highest. This means that more comments are labeled as 'toxic' than as 'severe toxic' or any other category. With the limited resources that the kernels provide, it would be best to focus only on predicting for that column.

To do that, we'll further split our training set into 'train' and 'test' set. This will help us at least partially evaluate our hypothesis.

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import words
from sklearn.model_selection import train_test_split

X, y = train[['comment_text']], train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 42)

# The Vectorizer
We'll then instantiate a count vectorizer and create a matrix of all the tokens contained in each comment. The matrix will exclude all English stop words and vectorize only valid English words. This will have some consequences:

* Our algorithm will be optimized for English (other languages will be ignored)
* Our algorithm will not take into account purposefully misspelled obscenities

In [5]:
vectorizer = CountVectorizer(stop_words = 'english',\
                             lowercase = True,\
                             max_df = 0.95,\
                             min_df = 0.05,\
                             vocabulary = set(words.words()))

vectorized_text = vectorizer.fit_transform(X_train.comment_text)

We'll now use our vectorized matrix and run TFIDF on it:

In [6]:
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(vectorized_text)

  idf = np.log(float(n_samples) / df) + 1.0


Here comes the interesting part, we'll use the weighted matrix terms to select the 200 best predictors of toxic comments. We can expect that those would be quite obscene terms.

In [7]:
from sklearn.feature_selection import SelectKBest, chi2

ch2 = SelectKBest(chi2, k = 200)
best_features = ch2.fit_transform(tfidf, y_train.toxic)

Fair warning, the next code snippet wil display the distilled essence of online hatred. Scroll further only if you can stomach it... [Otherwise, jump directly to the next section.](#The-Analyzer)

In [8]:
filth = [feature for feature, mask in\
         zip(vectorizer.get_feature_names(), ch2.get_support())\
         if mask == True]

print(filth)

['add', 'anal', 'anus', 'arrogant', 'arse', 'article', 'ass', 'bag', 'ban', 'basement', 'bastard', 'bet', 'big', 'bitch', 'blah', 'block', 'bloody', 'blow', 'bout', 'boy', 'bully', 'bunch', 'burn', 'butt', 'cancer', 'chink', 'choke', 'chump', 'cock', 'commie', 'consensus', 'content', 'continue', 'cougar', 'coward', 'crap', 'crazy', 'cum', 'damn', 'dare', 'deletion', 'dick', 'die', 'dirty', 'discussion', 'disgrace', 'disgusting', 'dog', 'donkey', 'dont', 'douche', 'dude', 'dumb', 'dumbhead', 'eat', 'face', 'fag', 'fascist', 'fat', 'feces', 'filthy', 'fool', 'freak', 'fu', 'garbage', 'gay', 'geek', 'god', 'grow', 'ha', 'hairy', 'hate', 'hater', 'head', 'hell', 'help', 'hey', 'hoe', 'hole', 'homo', 'homosexual', 'hypocrite', 'idiot', 'idiotic', 'ignorant', 'ill', 'image', 'imbecile', 'impotent', 'information', 'ing', 'issue', 'jackass', 'jerk', 'kick', 'kike', 'kill', 'kiss', 'lame', 'liar', 'lick', 'licker', 'licking', 'life', 'link', 'links', 'list', 'listen', 'little', 'looser', 'loser

# The Analyzer
We'll now build a new count vectorizer. We'll call it analyzer (analogous to 2 polarizing glasses) and it will vectorize again our input by only counting the predictive obscenities from above. This will give us a new matrix of n features, where n is the number of predictive words.


In [9]:
analyzer = CountVectorizer(lowercase = True,\
                             vocabulary = filth)

Now, let's define a function that vectorizes comment texts and weighs the vectors using the already trained TFIDF transformer:

In [10]:
def get_features(frame):
    result = pd.DataFrame(\
                transformer.fit_transform(\
                analyzer.fit_transform(\
                frame.comment_text)\
                                         ).todense(),\
                                            index = frame.index)
    return result

We'll also define a dictionary which will contain our input train and test data:

In [11]:
feature_frames = {}

for frame in ('train', 'test'):
    feature_frames[frame] = get_features(eval('X_%s' % frame))

feature_frames['train'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86265 entries, 84226 to 15795
Columns: 200 entries, 0 to 199
dtypes: float64(200)
memory usage: 132.3 MB


  idf = np.log(float(n_samples) / df) + 1.0


# Training
We can now train our algorithm of choice using the feature frames:

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier(n_neighbors = 10)
knc.fit(feature_frames['train'], y_train.toxic)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

# Log Loss and Conclusion
Finally, we assess our log loss:

In [13]:
from sklearn.metrics import log_loss

result = pd.DataFrame(knc.predict_proba(feature_frames['test']), index = feature_frames['test'].index)

result['actual'] = y_test.toxic
result['text'] = X_test.comment_text

print(log_loss(y_test.toxic, result[1]))

0.943145538092


And here are some examples of predictions and their corresponding comments (again, viewer discretion is advised):

In [14]:
pd.set_option('max_colwidth', 100)
result[[1, 'actual', 'text']][(result.actual == 1) & (result[1] > 0.5)][:10]

Unnamed: 0,1,actual,text
51076,0.7,1,"""\n\nSmash Lab, Part II\nYou really are a piece of work. This is not what """"some people in chat..."
45407,1.0,1,"Yeah, all you have to do is stop being such a bitch."
64130,0.7,1,"FUCK YOU, YOU FUCKING DIRTY KIKE. YOU SUPPORT THOSE WHO VANDALIZE PAGES, NOW IT IS TURNED ON YOU"
94993,1.0,1,"I have a valid reason, stop being a bitch."
10540,0.6,1,"""\nHe is a washout. Do you have any facts to prove that he is not a washout? Fine, in AR 635-2..."
48529,0.8,1,you are gay gay gay gay gay gay
28729,0.6,1,"""\nI haven't the slightest idea of what you talking about. But you probably need to screw myself..."
30274,0.7,1,"Tool Box\nFuck you piece of shit. You think your in control, fucking faggot? You think you can h..."
18387,0.9,1,new name \n\nPerezhilton has changed his name to dick sucker.
72992,1.0,1,hello \n\nwill you suck my dick for $5? please reply


Afterword:

* In a live system such a model should use additional matching criteria for pursposefully misspelled obscenities (e.g. 'id10t' instead of 'idiot')
* The model could be improved by using ngrams 
* The model could be improved by using an ensemble of models