# Capestone Project Solution2
## / Toxic Comment Classification /

- - -
<ul>
<li><a href="#prepare">I 环境准备</a></li>
<li><a href="#vecing">II 向量化</a></li>
<li><a href="#model">III 基础模型</a></li>
<li><a href="#model2">IV 实际模型</a></li>
<li><a href="#conclusions">V 结论</a></li>
</ul>

<a id='intro'></a>

<center><a id='prepare'>I 环境准备</a></center>

In [15]:
# 导入语句
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 导入机器学习库
from sklearn.model_selection import train_test_split

# 导入深度学习库
# from sklearn.datasets import load_files       
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Embedding, Flatten, Dense, Dropout
#from keras.layers import Flatten
#from keras.layers import Dense
#from keras.layers import Dropout
from keras.layers import BatchNormalization, Conv1D, MaxPooling1D

from keras.optimizers import Adam

from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

#from keras.utils import np_utils

#from glob import glob

# 设置参数显示长文本
pd.options.display.max_colwidth = 500

# 行内显示
%matplotlib inline

In [2]:
# 导入文件
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

# 检查标签
train.head(1)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,0,0,0,0,0


In [3]:
# get comment
train_comment = train.comment_text
test_comment = test.comment_text

In [4]:
# check data
print(train_comment[0])
print(train_comment.shape)
print(test_comment[0])
print(test_comment.shape)

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
(159571,)
Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,
(153164,)


<center><a id='vecing'>II 向量化</a></center>

In [5]:
# --- 6B 300D ---
# 根据 solution2_test 文件，300D的成绩明显好于50D
# 也说明合适的维度非常重要

# 向量化词

## prepare tokenizer
t = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
## https://keras.io/preprocessing/text/
## 可以考虑直接去掉奇怪字符、lower
t.fit_on_texts(train_comment)
## 这里是对输入句子进行拆词
## 在这个数据里，如果输入了1000句（测试文件中的train_short)，结果是10007个
## 如果输出的话在1000之后会有个...
## 全部输入的话是21万多
vocab_size = len(t.word_index) + 1
## 根据上面设置 vocab_size

## integer encode the documents
encoded_train = t.texts_to_sequences(train_comment)
encoded_test = t.texts_to_sequences(test_comment)

## pad documents to a max length of 100 words
max_length = 300

## max_length 就是每个commet要处理的单词数
padded_train = pad_sequences(encoded_train, maxlen=max_length, padding='post')
padded_test = pad_sequences(encoded_test, maxlen=max_length, padding='post')

In [7]:
# check padded
print(padded_train.shape)
print(padded_train[99])

print(padded_test.shape)
print(padded_test[100])
## 可以看出来，100个词的 padded 数据，没有的用0填充了

(159571, 300)
[    6    40    33    42   173   282   145    89    26    22     6    96
     5   672   884    16   528     2    33    57     4    18    57  2292
     8    39    73   371     4     6   361     2    16   101  1463    21
   567    37     6   292    18     5   672    50    96    48    11  1018
   439  3246 11844     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0 

In [106]:
# check word list
len(t.word_index)
## 注意这里是先将词统计完的计数
## 与 max_length 无关，因为 pad_sequences 是后发生的

210337

<center><a id='model'>III 基础模型</a></center>

In [8]:
# 创建 embedding 层（300D）

## 导入pre-training
embeddings_index = dict( )
f = open('glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

## 创建嵌入矩阵
embedding_matrix = np.zeros((vocab_size, 300))
## 注意这里要和嵌入矩阵的维度300相同
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


In [9]:
# check embedding_matrix
embedding_matrix.shape

(210338, 300)

In [71]:
# 创建模型(basic)
# 注意本代码框不再执行
# 最初使用的是50D
# eporch50时间较长

'''
model = Sequential()
## 设置embedding层
e = Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=50, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
## compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
## summarize the model
print(model.summary())
## fit the model
model.fit(padded_training, train.toxic, epochs=50, verbose=0)
## evaluate the model
loss, accuracy = model.evaluate(padded_training, train.toxic, verbose=0)
print('Accuracy: %f' % (accuracy*100))
'''

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 50, 50)            10516900  
_________________________________________________________________
flatten_13 (Flatten)         (None, 2500)              0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 2501      
Total params: 10,519,401
Trainable params: 2,501
Non-trainable params: 10,516,900
_________________________________________________________________
None
Accuracy: 93.387270


In [24]:
# 用 toxic 做测试

# 使用 pretrained embedding 
# eporch 会影响准确度
# eporch5 和 eporch20 分别做测试

# split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.toxic, test_size=0.2, shuffle=True, random_state=42)

In [19]:
## model
model = Sequential()
## 设置embedding层
e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=300, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
## compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
## summarize the model
print(model.summary())
## fit the model
model.fit(padded_train_new, target_train, epochs=5, validation_split = 0.2, verbose=2)
## evaluate the model
loss, accuracy = model.evaluate(padded_train_val, target_val)
print('Accuracy: %f' % (accuracy*100))
### Tokenizer不做任何处理是 Accuracy: 95.334365
### 加了处理结果有一点上升 Accuracy: 95.417087
### 具有随机性

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 300, 300)          63101400  
_________________________________________________________________
flatten_7 (Flatten)          (None, 90000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 90001     
Total params: 63,191,401
Trainable params: 90,001
Non-trainable params: 63,101,400
_________________________________________________________________
None
Train on 102124 samples, validate on 25532 samples
Epoch 1/5
 - 38s - loss: 0.2243 - acc: 0.9310 - val_loss: 0.2231 - val_acc: 0.9273
Epoch 2/5
 - 34s - loss: 0.1494 - acc: 0.9507 - val_loss: 0.2220 - val_acc: 0.9369
Epoch 3/5
 - 36s - loss: 0.1282 - acc: 0.9566 - val_loss: 0.2356 - val_acc: 0.9375
Epoch 4/5
 - 36s - loss: 0.1143 - acc: 0.9606 - val_loss: 0.2629 - val_acc: 0.9265
E

<center><a id='model2'>IV 实际模型</a></center>

In [None]:
# model
model = Sequential()
## 设置embedding层
e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=300, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
## compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [33]:
# for 6 classes
## get class
class_list = list(train.columns[2:])
print(class_list)

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [28]:
# toxic

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.toxic, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
toxic_pred = model.predict(padded_test).T
## 发现 epoch 越大 val acc的准确性持续微小下降
## 是否因为 pretrain 的模型已经训练得很好了，再多了 eporch 会调坏了？

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 36s - loss: 0.0973 - acc: 0.9696 - val_loss: 0.2416 - val_acc: 0.9398


In [27]:
# toxic left for compare epoch 5(do not run)

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.toxic, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=5, validation_split = 0.2, verbose=2)
## predict
toxic_pred = model.predict(padded_test).T
# class_pred = model.predict(padded_training)[:,0]

Train on 102124 samples, validate on 25532 samples
Epoch 1/5
 - 36s - loss: 0.1744 - acc: 0.9524 - val_loss: 0.1666 - val_acc: 0.9517
Epoch 2/5
 - 34s - loss: 0.1316 - acc: 0.9610 - val_loss: 0.2036 - val_acc: 0.9487
Epoch 3/5
 - 34s - loss: 0.1180 - acc: 0.9639 - val_loss: 0.2095 - val_acc: 0.9440
Epoch 4/5
 - 35s - loss: 0.1091 - acc: 0.9667 - val_loss: 0.2088 - val_acc: 0.9452
Epoch 5/5
 - 34s - loss: 0.1031 - acc: 0.9681 - val_loss: 0.2198 - val_acc: 0.9430


In [29]:
# severe_toxic

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.severe_toxic, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
severe_toxic_pred = model.predict(padded_test).T

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 36s - loss: 0.0731 - acc: 0.9880 - val_loss: 0.0700 - val_acc: 0.9890


In [31]:
# obscene_toxic

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.obscene, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
obsence_pred = model.predict(padded_test).T

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 36s - loss: 0.1742 - acc: 0.9656 - val_loss: 0.1901 - val_acc: 0.9637


In [32]:
# threat

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.threat, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
threat_pred = model.predict(padded_test).T

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 38s - loss: 0.0553 - acc: 0.9938 - val_loss: 0.0410 - val_acc: 0.9951


In [34]:
# insult

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.insult, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
insult_pred = model.predict(padded_test).T

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 34s - loss: 0.1742 - acc: 0.9651 - val_loss: 0.1995 - val_acc: 0.9633


In [51]:
insult_pred

array([[9.9999988e-01, 4.8082884e-04, 4.1227546e-03, ..., 2.1985815e-09,
        2.9259364e-14, 1.5655816e-01]], dtype=float32)

In [38]:
# identity_hate

## split train and val
padded_train_new, padded_train_val , target_train, target_val = train_test_split(padded_train, train.identity_hate, test_size=0.2, shuffle=True, random_state=42)
## fit
model.fit(padded_train_new, target_train, epochs=1, validation_split = 0.2, verbose=2)
## predict
identity_hate_pred = model.predict(padded_test).T

Train on 102124 samples, validate on 25532 samples
Epoch 1/1
 - 35s - loss: 0.0690 - acc: 0.9889 - val_loss: 0.0778 - val_acc: 0.9897


In [39]:
# create submissions
## all predicts
predicts = []

predicts.append(toxic_pred)
predicts.append(severe_toxic_pred)
predicts.append(obsence_pred)
predicts.append(threat_pred)
predicts.append(insult_pred)
predicts.append(identity_hate_pred)

In [40]:
predicts

[array([[1.0000000e+00, 9.5114810e-04, 9.2521552e-03, ..., 3.4142070e-07,
         2.3914837e-11, 8.6604714e-01]], dtype=float32),
 array([[7.5107273e-08, 1.2346321e-06, 1.9661244e-03, ..., 2.3210789e-10,
         2.2403461e-20, 3.8483468e-06]], dtype=float32),
 array([[1.0000000e+00, 4.8686619e-04, 1.3243208e-02, ..., 6.6443512e-10,
         9.3519276e-11, 1.9664212e-01]], dtype=float32),
 array([[2.4706423e-11, 4.7854414e-06, 3.3522677e-03, ..., 2.1565295e-17,
         7.4278944e-28, 1.0512570e-06]], dtype=float32),
 array([[9.9999988e-01, 4.8082884e-04, 4.1227546e-03, ..., 2.1985815e-09,
         2.9259364e-14, 1.5655816e-01]], dtype=float32),
 array([[6.7305180e-08, 5.3950102e-06, 3.7144421e-04, ..., 7.2796637e-11,
         3.5488846e-20, 1.2342658e-05]], dtype=float32)]

In [45]:
pd.DataFrame(data=predicts)

ValueError: Must pass 2-d input

In [43]:
submissions = pd.DataFrame({'id':test.id,
                            'toxic':predicts[0],
                           'severe_toxic':predicts[1],
                           'obscene':predicts[2],
                           'threat':predicts[3],
                           'insult':predicts[4],
                           'identity_hate':predicts[5]}).set_index('id')

Exception: Data must be 1-dimensional

In [41]:
submissions = pd.DataFrame({'id':test.id,
                            'toxic':predicts[0],
                           'severe_toxic':predicts[1],
                           'obscene':predicts[2],
                           'threat':predicts[3],
                           'insult':predicts[4],
                           'identity_hate':predicts[5]}).set_index('id')

# Pandas automatically sorts the columns alphabetically by column name.
# Therefore, we need to re-order the columns to match the sample submission file.
submissions = submissions[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

# create a submission csv file
submissions.to_csv('submission_s2.csv', 
                  columns=['toxic','severe_toxic','obscene','threat','insult','identity_hate']) 

Exception: Data must be 1-dimensional

In [None]:
# loop for 6 class
for col in class_list:
    name = col
    print(name)
    
    # model
    model = Sequential()
    ## 设置embedding层
    e = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=300, trainable=False)
    model.add(e)
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))

    ## compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    ## summarize the model
    # print(model.summary())
    
    ## fit the model
    model.fit(padded_train_new, target_train, epochs=5, validation_split = 0.2, verbose=2)
    
    ## evaluate the model
loss, accuracy = model.evaluate(padded_train_val, target_val)
print('Accuracy: %f' % (accuracy*100))

for col in target_cols:
    
    print('\n')
    
    # set the value of y
    y = col
    
    # create a stratified split
    X_train, X_eval, y_train ,y_eval = train_test_split(X, y,test_size=0.25,shuffle=True,
                                                    random_state=5,stratify=y)

<center><a id='conclusions'>V 结论</a></center>

In [146]:
class_pred = model.predict(padded_training).T
# class_pred = model.predict(padded_training)[:,0]

array([[0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ]], dtype=float32)

In [145]:
class_pred

array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
       0.2512099 ], dtype=float32)

In [155]:
pred.append(class_pred)
pred

[array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32),
 array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32),
 array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32),
 array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32),
 array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32),
 array([0.0084911 , 0.00119909, 0.00929063, ..., 0.13800797, 0.06203122,
        0.2512099 ], dtype=float32)]

In [3]:
results = pd.DataFrame({'id':train.id,
                            'toxic':pred[0],
                           'severe_toxic':pred[1],
                           'obscene':pred[2],
                           'threat':pred[3],
                           'insult':pred[4],
                           'identity_hate':pred[5]}).set_index('id')

NameError: name 'train' is not defined

In [143]:
## get classes
class_list = list(train.columns[2:])
print(class_list)

## create preds
pred = []

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [128]:
for col in class_list:
    y = col
    print(y)

toxic
severe_toxic
obscene
threat
insult
identity_hate


In [73]:
# output submission
filename = 'submission_s2_1.csv'
submission.to_csv(filename, index=False)
print('Complete: output file saved as {}'.format(filename))

NameError: name 'submission' is not defined

> 主要参考资料：
1. [项目建议中的LR + 词袋模式](https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams)
2. [Cross-validation Performance](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

> 小结：
1. Solution1 为 LR + CBOW 的方式进行多分类计算
2. 输出结果是每个分类的可能性[0,1]

> Kaggle Score:
1. 0.97576