Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.

I would like to thank my team members, for their awesome contribution in the competition.

* Ashish Gupta (https://www.kaggle.com/roydatascience)
* Mukharbek Organokov (https://www.kaggle.com/muhakabartay)
* Firat Gonen (https://www.kaggle.com/frtgnn)
* Atharva (https://www.kaggle.com/atharvap329)
* Kirill Balakhonov (https://www.kaggle.com/kirill702b)

Please note: Here I am testing jazivxt kernel (https://www.kaggle.com/jazivxt/howling-with-wolf-on-l-genpresse) on my best submissions.

In [1]:
from sklearn import *
import numpy as np
import pandas as pd
import glob

data = {k.split('/')[-1][:-4]:k for k in glob.glob('/kaggle/input/**/**.csv')}
train = pd.read_csv(data['jigsaw-toxic-comment-train'], usecols=['id', 'comment_text', 'toxic'])
val = pd.read_csv(data['validation'], usecols=['comment_text', 'toxic'])
test = pd.read_csv(data['test'], usecols=['id', 'content'])
test.columns = ['id', 'comment_text']
test['toxic'] = 0.5

##Our team second best submission (non normalized submission)
sub2 = pd.read_csv('../input/finalsubmission/submission-.9480.csv')

#Our team best submission : Taking ensemble of .9479 Kernel (Ashish) and .9480 (best stable submission Normalized)
sub4 = pd.read_csv('../input/finalsubmission/submission-.9481.csv')

In [2]:
sub2.head(5)

Unnamed: 0,id,toxic
0,0,0.01151
1,1,0.018096
2,2,0.267872
3,3,0.010647
4,4,0.011961


In [3]:
%%time
def f_experience(c, s):
    it = {'memory':10,
        'influence':0.5,
        'inference':0.5,
        'interest':0.9,
        'sentiment':1e-10,
        'harmony':0.5}
    
    exp = {}
    
    for i in range(len(c)):
        words = set([w for w in str(c[i]).lower().split(' ')])
        for w in words:
            try:
                exp[w]['influence'] = exp[w]['influence'][1:] + [s[i]] #need to normalize
                exp[w]['inference'] += 1
                exp[w]['interest'] = exp[w]['interest'][1:] + [(exp[w]['interest'][it['memory']-1] + (s[i] * it['interest']))/2]
                exp[w]['sentiment'] += s[i]
                #exp[w]['harmony']
            except:
                m = [0. for m_ in range(it['memory'])]
                exp[w] = {}
                exp[w]['influence'] = m[1:] + [s[i]]
                exp[w]['inference'] = 1
                exp[w]['interest'] = m[1:] + [s[i] * it['interest'] / 2]
                exp[w]['sentiment'] = s[i]
                #exp[w]['harmony'] = 0
                
    for w in exp:
        exp[w]['sentiment'] /= exp[w]['inference'] + it['sentiment']
        exp[w]['inference'] /= len(c) * it['inference']

    return exp

exp = f_experience(train['comment_text'].values, train['toxic'].values)

CPU times: user 3min 1s, sys: 1.74 s, total: 3min 3s
Wall time: 3min 3s


In [4]:
%%time
def features(df):
    df['len'] = df['comment_text'].map(len)
    df['wlen'] = df['comment_text'].map(lambda x: len(str(x).split(' ')))
    
    df['influence_sum'] = df['comment_text'].map(lambda x: np.sum([np.mean(exp[w]['influence']) if w in exp else 0 for w in str(x).lower().split(' ')]))
    df['influence_mean'] = df['comment_text'].map(lambda x: np.mean([np.mean(exp[w]['influence']) if w in exp else 0 for w in str(x).lower().split(' ')]))
    
    df['inference_sum'] = df['comment_text'].map(lambda x: np.sum([exp[w]['inference'] if w in exp else 0 for w in str(x).lower().split(' ')]))
    df['inference_mean'] = df['comment_text'].map(lambda x: np.mean([exp[w]['inference'] if w in exp else 0 for w in str(x).lower().split(' ')]))
    
    df['interest_sum'] = df['comment_text'].map(lambda x: np.sum([np.mean(exp[w]['interest']) if w in exp else 0 for w in str(x).lower().split(' ')]))
    df['interest_mean'] = df['comment_text'].map(lambda x: np.mean([np.mean(exp[w]['interest']) if w in exp else 0 for w in str(x).lower().split(' ')]))
    
    df['sentiment_sum'] = df['comment_text'].map(lambda x: np.sum([exp[w]['sentiment'] if w in exp else 0.5 for w in str(x).lower().split(' ')]))
    df['sentiment_mean'] = df['comment_text'].map(lambda x: np.mean([exp[w]['sentiment'] if w in exp else 0.5 for w in str(x).lower().split(' ')]))
    return df

val = features(val)
test= features(test)

CPU times: user 4min 18s, sys: 162 ms, total: 4min 18s
Wall time: 4min 18s


In [5]:
col = [c for c in val if c not in ['id', 'comment_text', 'toxic']]
x1, x2, y1, y2 = model_selection.train_test_split(val[col], val['toxic'], test_size=0.3, random_state=20)

model = ensemble.ExtraTreesClassifier(n_estimators=1000, max_depth=7, n_jobs=-1, random_state=20)
model.fit(x1, y1)
print(metrics.roc_auc_score(y2, model.predict_proba(x2)[:,1].clip(0.,1.)))

model.fit(val[col], val['toxic'])
test['toxic'] = model.predict_proba(test[col])[:,1].clip(0.,1.)
sub1 = test[['id', 'toxic']]

0.7103550873413887


In [6]:
sub1.rename(columns={'toxic':'toxic1'}, inplace=True)
sub2.rename(columns={'toxic':'toxic2'}, inplace=True)
sub4.rename(columns={'toxic':'toxic4'}, inplace=True)

sub3 = sub1.merge(sub2,on='id').merge(sub4,on='id')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [7]:
sub3.head(5)

Unnamed: 0,id,toxic1,toxic2,toxic4
0,0,0.136369,0.01151,0.073287
1,1,0.134002,0.018096,0.16656
2,2,0.150677,0.267872,0.50009
3,3,0.075362,0.010647,0.065305
4,4,0.030821,0.011961,0.111355


In [8]:
sub3['toxic'] = (sub3['toxic1'] * 0.1) + (sub3['toxic4'] * 0.9) #blend 1
sub3['toxic'] = (sub3['toxic2'] * 0.49) + (sub3['toxic'] * 0.51) #blend 2

sub3[['id', 'toxic']].to_csv('submission.csv', index=False)

In [9]:
#Is it toxic :)
test = pd.DataFrame(['Howling with Wolf on Lügenpresse'], columns=['comment_text'])
test['id'] = test.index
test= features(test)
test['toxic'] = model.predict_proba(test[col])[:,1].clip(0.,1.)
test[['id', 'comment_text', 'toxic']].head()

Unnamed: 0,id,comment_text,toxic
0,0,Howling with Wolf on Lügenpresse,0.205481
