# Capestone Project Solution1
## / Toxic Comment Classification /

- - -
<ul>
<li><a href="#prepare">I 环境准备</a></li>
<li><a href="#wrangling">II 向量化</a></li>
<li><a href="#eda">III 评分</a></li>
<li><a href="#conclusions">IV 结论</a></li>
</ul>

<a id='intro'></a>

<center><a id='prepare'>I 环境准备</a></center>

In [2]:
# prpare env 

# 用这个框对你计划使用的所有数据包进行设置
# 导入语句
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 设置参数显示长文本
pd.options.display.max_colwidth = 500

# 行内显示
%matplotlib inline

# 机器学习库
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

In [4]:
# import files

test = pd.read_csv('test.csv')
## found utf8 content
## -1 是可能的选择，0为非攻击性语言

test_labels = pd.read_csv('test_labels.csv')

train = pd.read_csv('train.csv')
## 1 是标记为恶毒的分类

In [5]:
# check files

test.head(1)
## 注意第6行是良好评论数据

Unnamed: 0,id,comment_text
0,00001cee341fdb12,"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,"


In [6]:
test_labels.head(1)
## 注意在test_labels中，提示了良好评论的分类（全为0）

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1


In [7]:
train.head(1)
## 注意第7行是train的目标数据处理结果
## 将涉及到的负面类型通过1来标记

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,0,0,0,0,0


<center><a id='wrangling'>II 向量化</a></center>

In [91]:
# vectorize

## set classes
class_list = list(train.columns[2:])
class_list

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [97]:
# get comment
train_comment = train.comment_text
test_comment = test.comment_text

In [98]:
# build vectorizer

## set comment vectorizer
comment_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    ## 表示只拆解1个单词，如果是(1:2)表示拆解1-2个单词
    ## https://stackoverflow.com/questions/24005762/understanding-the-ngram-range-argument-in-a-countvectorizer-in-sklearn
    max_features=10000)

## fit comment vectorizer
comment_vectorizer.fit(train_comment)

## get train and test comment featrues
train_comment_features = comment_vectorizer.transform(train_comment)
test_comment_features = comment_vectorizer.transform(test_comment)

In [99]:
## check comment out put
train_comment_features.shape
##  输出是所有单词在1w个向量上的得分，每个数据是这样的：
##  (0, 9974)	0.21031181661230514
## 说明第0号词，在9974分类向量的概率为 0.21

(159571, 10000)

In [100]:
## set ngram vectorizer
character_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 4),
    ## 标识对应 2到4个字母的组合
    max_features=50000)

## fit character vectorizer
character_vectorizer.fit(train_comment)

## get train and test ngram featrues
train_character_features = character_vectorizer.transform(train_comment)
test_character_features = character_vectorizer.transform(test_comment)

In [101]:
## check features with both word and character (train)
train_features = hstack([train_comment_features, train_character_features])
train_features.shape
## 可以看出对于每个评论，featrues 输出为 1w word + 5w ngram = 6w

(159571, 60000)

In [104]:
## check features with both word and character (test)
test_features = hstack([test_comment_features, test_character_features])
test_features.shape

(153164, 60000)

<center><a id='eda'>III 评分</a></center>

In [114]:
# 评分

## 建立评分列表
score_list = []

## 根据submission sample定义 submission格式
submission = pd.DataFrame.from_dict({'id': test['id']})

## 定义 classifier
classifier = LogisticRegression(C=0.1, solver='sag')

## 循环每一个分类
for name in class_list:
    # 获取当前分类数据
    train_target = train[name]
    
    # fit classifier
    classifier.fit(train_features, train_target)
    
    # 计算 training score
    score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc', cv=5))
    ## 新版本的默认 cv 已经从3变为5，手动设定5，保持一致

    # 将 score 写入 score_list
    score_list.append(score)
    # 输出 score
    print('(training) Class {} Score : {}'.format(name, cv_score))
    
    # 使用 classier 输出 test 的结果
    submission[name] = classifier.predict_proba(test_features)[:, 1]

(training) Class toxic Score : 0.971520588226553
(training) Class severe_toxic Score : 0.9876289384719945
(training) Class obscene Score : 0.9860771390906266
(training) Class threat Score : 0.9835028550329452
(training) Class insult Score : 0.97875904830758
(training) Class identity_hate Score : 0.9752953024910692


In [110]:
# get finial score
print('(training) Average Class Score: {})'.format(np.mean(score_list)))

(training) Average Class Score: 0.980464041358258)


<center><a id='conclusions'>IV 结论</a></center>

In [113]:
# output submission
filename = 'submission_s1.csv'
submission.to_csv(filename, index=False)
print('Complete: output file saved as {}'.format(filename))

Complete: output file saved as submission_s1.csv


> 主要参考资料：
1. [项目建议中的LR + 词袋模式](https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams)
2. [Cross-validation Performance](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

> 小结：
1. Solution1 为 LR + CBOW 的方式进行多分类计算
2. 输出结果是每个分类的可能性[0,1]

> Kaggle Score:
1. 0.97576