This notebook is following []()

In [1]:
import os
user_folder = os.path.expanduser("~")
data_folder = os.path.join(user_folder, 'E:/git/database/Toxic_Comment')
os.listdir(data_folder)

['test.csv', 'test.csv.zip', 'train.csv', 'train.csv.zip']

In [2]:
import numpy as np
import pandas as pd

train_df = pd.read_csv(os.path.join(data_folder, 'train.csv'))
test_df = pd.read_csv(os.path.join(data_folder, 'test.csv'), delimiter=',')

### 数据探索
[汇总统计和可视化](http://blog.csdn.net/sinat_22594309/article/details/75215023)

In [3]:
train_df.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


In [4]:
from sklearn.model_selection import train_test_split

np.random.seed(2)
var = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_mes, valid_mes, train_l, valid_l = train_test_split(train_df['comment_text'], train_df[var], test_size=0.2)

## remove non-sense characters and words
* [三种文本特征提取（TF-IDF/Word2Vec/CountVectorizer）及Spark MLlib调用实例（Scala/Java/python）](http://blog.csdn.net/liulingyuan6/article/details/53390949)
* [CountVectorizer](http://blog.csdn.net/mmc2015/article/details/46866537)
* [TF-IDF及其算法](http://blog.csdn.net/sangyongjia/article/details/52440063)


### Method 1: using words database in string and nltk module
* **string.punctuation** is a character containing common punctuation
* **nltk.corpus** containg one class called stopwords, and **stopwords.words('english')** contains common non-sense words
* when importing nltk for the first time, one need to download stopwords resource using command
```python
import nltk; nltk.download('stopwords')
```
[stopwords(应删除词） with NLTK](https://www.cnblogs.com/webRobot/p/6079919.html)

In [23]:
import string
from nltk.corpus import stopwords

def text_process(comment):
    nopunc = [char for char in comment if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    result = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return result

In [39]:
print(string.punctuation)
print(stopwords.words('english')[:10])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


#### 文本特征提取
* 使用CountVectorizer()类
```python
transform_com = CountVectorizer().fit()  #生成词汇表
comments_train = transform_com.transform(train_df['comment_text']) #将字符串转换为词汇表数值矩阵，列名是词汇表，每一行是该文档词出现的次数
```

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
import gc
transform_com = CountVectorizer(tokenizer=text_process).fit(pd.concat([train_df['comment_text'], test_df['comment_text']], axis=0))

1835

In [46]:
comments_train = transform_com.transform(train_mes)
comments_valid = transform_com.transform(valid_mes)
comments_test = transform_com.transform(test_df['comment_text'])
gc.collect()

882

#### **comments_train is a csr_matrix (Compressed Sparse Row format**
* module scipy.sparse can be used to construct sparse matrix
* csv_matrix has a method: todense. This method can convert a sparse matrix to a dense matrix

In [59]:
b = comments_train[0:1000].todense()
print(b.sum(axis=0))
print([comments_train.shape, train_mes.shape])

[[3 0 0 ... 0 0 0]]
[(127656, 475143), (127656,)]


### Using xgboost to combine various classifier
[xgboost入门与实战](http://blog.csdn.net/sb19931201/article/details/52557382)
<br>
[xgboost参数](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
<br>
[揭秘Kaggle神器xgboost](http://geek.csdn.net/news/detail/201207)
<br>
[xgboost使用说明](http://blog.csdn.net/vitodi/article/details/60141301)
<br>
[boost算法原理与实践](http://blog.csdn.net/sinat_22594309/article/details/60957594)

In [60]:
import xgboost as xgb

In [61]:
def runXGB(train_X, train_y, test_X, test_y=None, feature_names=None, seed_val=2017, num_rounds=400):
    param = {}
    param['objective'] = 'binary:logistic'
    param['eta'] = 0.1
    param['max_depth'] = 6
    param['silent'] = 1
    param['eval_metric'] = 'auc'
    param['min_child_weight'] = 1
    param['subsample'] = 0.5
    param['colsample_bytree'] = 0.7
    param['seed'] = seed_val
    param['nthread'] = 4
    num_rounds = num_rounds

    plst = list(param.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    return model

In [62]:
col = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
preds = np.zeros((test_df.shape[0], len(col)))

In [63]:
for i, j in enumerate(col):
    print('fit '+j)
    model = runXGB(comments_train, train_l[j], comments_valid,valid_l[j])
    preds[:,i] = model.predict(xgb.DMatrix(comments_test), ntree_limit = model.best_ntree_limit)
    gc.collect()

fit toxic
[0]	train-auc:0.666997	test-auc:0.675622
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 20 rounds.
[1]	train-auc:0.720597	test-auc:0.723434
[2]	train-auc:0.726657	test-auc:0.729102
[3]	train-auc:0.727314	test-auc:0.729777
[4]	train-auc:0.727371	test-auc:0.729838
[5]	train-auc:0.727322	test-auc:0.729755
[6]	train-auc:0.727345	test-auc:0.729766
[7]	train-auc:0.727501	test-auc:0.729883
[8]	train-auc:0.731772	test-auc:0.735711
[9]	train-auc:0.752546	test-auc:0.756963
[10]	train-auc:0.761578	test-auc:0.762653
[11]	train-auc:0.766036	test-auc:0.767609
[12]	train-auc:0.80173	test-auc:0.8035
[13]	train-auc:0.812071	test-auc:0.81297
[14]	train-auc:0.812563	test-auc:0.812559
[15]	train-auc:0.815512	test-auc:0.815081
[16]	train-auc:0.819485	test-auc:0.818069
[17]	train-auc:0.821573	test-auc:0.821611
[18]	train-auc:0.824401	test-auc:0.825634
[19]	train-auc:0.824078	test-auc:0.825261
[20]	train-auc:0.837002

[191]	train-auc:0.956502	test-auc:0.945354
[192]	train-auc:0.956631	test-auc:0.945201
[193]	train-auc:0.956851	test-auc:0.945464
[194]	train-auc:0.956938	test-auc:0.945643
[195]	train-auc:0.957021	test-auc:0.945652
[196]	train-auc:0.957245	test-auc:0.945787
[197]	train-auc:0.957409	test-auc:0.945978
[198]	train-auc:0.957522	test-auc:0.946026
[199]	train-auc:0.957561	test-auc:0.946063
[200]	train-auc:0.957675	test-auc:0.946119
[201]	train-auc:0.957756	test-auc:0.946272
[202]	train-auc:0.957865	test-auc:0.946354
[203]	train-auc:0.957969	test-auc:0.946401
[204]	train-auc:0.958159	test-auc:0.946353
[205]	train-auc:0.95826	test-auc:0.946406
[206]	train-auc:0.958303	test-auc:0.946444
[207]	train-auc:0.958454	test-auc:0.946744
[208]	train-auc:0.958503	test-auc:0.946759
[209]	train-auc:0.95851	test-auc:0.94677
[210]	train-auc:0.958595	test-auc:0.946882
[211]	train-auc:0.958716	test-auc:0.946962
[212]	train-auc:0.95882	test-auc:0.946903
[213]	train-auc:0.958941	test-auc:0.946965
[214]	train-auc

[383]	train-auc:0.970152	test-auc:0.954329
[384]	train-auc:0.970187	test-auc:0.954328
[385]	train-auc:0.970237	test-auc:0.954412
[386]	train-auc:0.970297	test-auc:0.954436
[387]	train-auc:0.970345	test-auc:0.954459
[388]	train-auc:0.970403	test-auc:0.954353
[389]	train-auc:0.970441	test-auc:0.954416
[390]	train-auc:0.970464	test-auc:0.95453
[391]	train-auc:0.970545	test-auc:0.954578
[392]	train-auc:0.970608	test-auc:0.954647
[393]	train-auc:0.970665	test-auc:0.954663
[394]	train-auc:0.970757	test-auc:0.954634
[395]	train-auc:0.970793	test-auc:0.95473
[396]	train-auc:0.970813	test-auc:0.954737
[397]	train-auc:0.970866	test-auc:0.954779
[398]	train-auc:0.970913	test-auc:0.954793
[399]	train-auc:0.970965	test-auc:0.954836
fit severe_toxic
[0]	train-auc:0.807732	test-auc:0.834535
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 20 rounds.
[1]	train-auc:0.891436	test-auc:0.912354
[2]	train-auc:0.906855	test-auc

[58]	train-auc:0.960981	test-auc:0.958682
[59]	train-auc:0.961764	test-auc:0.958827
[60]	train-auc:0.961914	test-auc:0.958884
[61]	train-auc:0.962822	test-auc:0.959676
[62]	train-auc:0.962919	test-auc:0.959699
[63]	train-auc:0.962978	test-auc:0.959692
[64]	train-auc:0.963386	test-auc:0.960637
[65]	train-auc:0.964807	test-auc:0.96174
[66]	train-auc:0.964801	test-auc:0.961714
[67]	train-auc:0.965678	test-auc:0.963632
[68]	train-auc:0.966007	test-auc:0.963903
[69]	train-auc:0.966869	test-auc:0.964641
[70]	train-auc:0.967176	test-auc:0.964968
[71]	train-auc:0.967794	test-auc:0.965364
[72]	train-auc:0.967871	test-auc:0.965148
[73]	train-auc:0.968469	test-auc:0.965933
[74]	train-auc:0.9687	test-auc:0.966344
[75]	train-auc:0.969934	test-auc:0.96793
[76]	train-auc:0.970221	test-auc:0.968126
[77]	train-auc:0.970369	test-auc:0.968464
[78]	train-auc:0.97108	test-auc:0.9691
[79]	train-auc:0.971483	test-auc:0.969502
[80]	train-auc:0.971475	test-auc:0.969437
[81]	train-auc:0.97162	test-auc:0.969486


[251]	train-auc:0.986542	test-auc:0.980514
[252]	train-auc:0.986572	test-auc:0.980495
[253]	train-auc:0.986574	test-auc:0.980483
[254]	train-auc:0.986617	test-auc:0.98052
[255]	train-auc:0.986649	test-auc:0.980436
[256]	train-auc:0.986659	test-auc:0.980453
[257]	train-auc:0.986664	test-auc:0.980481
[258]	train-auc:0.986682	test-auc:0.980477
[259]	train-auc:0.986691	test-auc:0.98053
[260]	train-auc:0.986704	test-auc:0.980522
[261]	train-auc:0.986727	test-auc:0.98051
[262]	train-auc:0.986734	test-auc:0.980439
[263]	train-auc:0.986789	test-auc:0.980476
[264]	train-auc:0.986858	test-auc:0.980485
[265]	train-auc:0.986847	test-auc:0.980554
[266]	train-auc:0.986903	test-auc:0.980504
[267]	train-auc:0.986917	test-auc:0.980513
[268]	train-auc:0.986935	test-auc:0.98053
[269]	train-auc:0.986952	test-auc:0.980563
[270]	train-auc:0.987027	test-auc:0.980634
[271]	train-auc:0.987037	test-auc:0.98063
[272]	train-auc:0.987046	test-auc:0.980641
[273]	train-auc:0.987074	test-auc:0.9807
[274]	train-auc:0.

[102]	train-auc:0.98197	test-auc:0.961616
[103]	train-auc:0.982084	test-auc:0.962018
[104]	train-auc:0.982165	test-auc:0.961883
[105]	train-auc:0.982134	test-auc:0.961646
[106]	train-auc:0.982265	test-auc:0.961886
[107]	train-auc:0.982532	test-auc:0.961673
[108]	train-auc:0.982252	test-auc:0.962289
[109]	train-auc:0.982095	test-auc:0.96243
[110]	train-auc:0.982057	test-auc:0.962446
[111]	train-auc:0.982036	test-auc:0.963226
[112]	train-auc:0.982343	test-auc:0.963186
[113]	train-auc:0.982487	test-auc:0.963311
[114]	train-auc:0.98275	test-auc:0.963965
[115]	train-auc:0.982847	test-auc:0.964157
[116]	train-auc:0.982968	test-auc:0.964341
[117]	train-auc:0.982897	test-auc:0.963955
[118]	train-auc:0.98313	test-auc:0.964447
[119]	train-auc:0.983169	test-auc:0.964657
[120]	train-auc:0.983192	test-auc:0.964861
[121]	train-auc:0.983191	test-auc:0.964984
[122]	train-auc:0.983308	test-auc:0.964958
[123]	train-auc:0.983493	test-auc:0.964958
[124]	train-auc:0.983654	test-auc:0.965307
[125]	train-auc

[35]	train-auc:0.910305	test-auc:0.914171
[36]	train-auc:0.911295	test-auc:0.915068
[37]	train-auc:0.916009	test-auc:0.917627
[38]	train-auc:0.922122	test-auc:0.925115
[39]	train-auc:0.92451	test-auc:0.926853
[40]	train-auc:0.927142	test-auc:0.928502
[41]	train-auc:0.928103	test-auc:0.928988
[42]	train-auc:0.931418	test-auc:0.931636
[43]	train-auc:0.931879	test-auc:0.931518
[44]	train-auc:0.932164	test-auc:0.932403
[45]	train-auc:0.93324	test-auc:0.933861
[46]	train-auc:0.933952	test-auc:0.934345
[47]	train-auc:0.935613	test-auc:0.936052
[48]	train-auc:0.936375	test-auc:0.936195
[49]	train-auc:0.936747	test-auc:0.936711
[50]	train-auc:0.938031	test-auc:0.937147
[51]	train-auc:0.93878	test-auc:0.937758
[52]	train-auc:0.939226	test-auc:0.938219
[53]	train-auc:0.939105	test-auc:0.938319
[54]	train-auc:0.939764	test-auc:0.938727
[55]	train-auc:0.940796	test-auc:0.94021
[56]	train-auc:0.940709	test-auc:0.940117
[57]	train-auc:0.94111	test-auc:0.940521
[58]	train-auc:0.94319	test-auc:0.94276

[228]	train-auc:0.975791	test-auc:0.967494
[229]	train-auc:0.975837	test-auc:0.967531
[230]	train-auc:0.975855	test-auc:0.967646
[231]	train-auc:0.975921	test-auc:0.967809
[232]	train-auc:0.975988	test-auc:0.967844
[233]	train-auc:0.976049	test-auc:0.967999
[234]	train-auc:0.976107	test-auc:0.968086
[235]	train-auc:0.976169	test-auc:0.968011
[236]	train-auc:0.976273	test-auc:0.968067
[237]	train-auc:0.976344	test-auc:0.968149
[238]	train-auc:0.976447	test-auc:0.968171
[239]	train-auc:0.976477	test-auc:0.968204
[240]	train-auc:0.976523	test-auc:0.968271
[241]	train-auc:0.976623	test-auc:0.968345
[242]	train-auc:0.976658	test-auc:0.968319
[243]	train-auc:0.976667	test-auc:0.968324
[244]	train-auc:0.976687	test-auc:0.968335
[245]	train-auc:0.976726	test-auc:0.968349
[246]	train-auc:0.976851	test-auc:0.968341
[247]	train-auc:0.976888	test-auc:0.968369
[248]	train-auc:0.97693	test-auc:0.968387
[249]	train-auc:0.977025	test-auc:0.968441
[250]	train-auc:0.977043	test-auc:0.968423
[251]	train-

[135]	train-auc:0.976822	test-auc:0.954584
[136]	train-auc:0.976888	test-auc:0.954849
[137]	train-auc:0.976898	test-auc:0.954844
[138]	train-auc:0.976922	test-auc:0.954973
[139]	train-auc:0.977162	test-auc:0.954952
[140]	train-auc:0.977186	test-auc:0.954951
[141]	train-auc:0.977205	test-auc:0.954901
[142]	train-auc:0.977329	test-auc:0.954647
[143]	train-auc:0.977378	test-auc:0.954265
[144]	train-auc:0.977669	test-auc:0.954238
[145]	train-auc:0.97782	test-auc:0.954081
[146]	train-auc:0.97781	test-auc:0.95408
[147]	train-auc:0.978001	test-auc:0.954285
[148]	train-auc:0.978103	test-auc:0.954314
[149]	train-auc:0.978249	test-auc:0.954691
[150]	train-auc:0.978323	test-auc:0.955347
[151]	train-auc:0.97838	test-auc:0.955552
[152]	train-auc:0.978437	test-auc:0.955765
[153]	train-auc:0.978491	test-auc:0.955972
[154]	train-auc:0.978539	test-auc:0.955946
[155]	train-auc:0.9786	test-auc:0.955832
[156]	train-auc:0.978854	test-auc:0.955989
[157]	train-auc:0.978945	test-auc:0.955733
[158]	train-auc:0

In [78]:
subm = pd.read_csv(os.path.join(data_folder,'sample_submission.csv'))    
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds, columns = col)], axis=1)

In [79]:
# check whether id of subm is identical to id of test_df
a = subm.id
b = test_df.id
np.average(a==b)

1.0

In [80]:
submission.head(3)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.99858,0.148201,0.974944,0.036403,0.954417,0.102639
1,0000247867823ef7,0.036035,0.004412,0.02004,0.001155,0.028396,0.004375
2,00013b17ad220c46,0.032298,0.004019,0.01587,0.000848,0.021036,0.004005


In [81]:
submission.to_csv('myNewbies_result.csv', index=False)

In [83]:
submission.shape

(153164, 7)