# Toxic Comment Classification Challenge
- https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/

## The Goal
- **Create a model which predicts a probability of each type of toxicity for each comment.**

## Data Dictionary
- id : 유저 ID
- comment_text : 유저가 작성한 내용
- toxic : 욕설
- severe_toxic : 심한 욕설
- obscene : 음란한 내용
- threat : 위협적인 내용
- insult : 모욕적인 내용
- identity_hate : 증오

In [1]:
import pandas as pd
import numpy as np
import nltk

In [2]:
train = pd.read_csv('train.csv')
train.tail()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0
159570,fff46fc426af1f9a,"""\nAnd ... I really don't think you understand...",0,0,0,0,0,0


In [3]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


# 1. Pre-Processing
- 'tokenize' 컬럼 추가
- 'clean_comment' 컬럼 추가

## 1.1 Tokenize column

In [4]:
# tokenize
from nltk.tokenize import word_tokenize
train['tokenize'] = train['comment_text'].apply(word_tokenize)
train['tokenize'].iloc[0]

['Explanation',
 'Why',
 'the',
 'edits',
 'made',
 'under',
 'my',
 'username',
 'Hardcore',
 'Metallica',
 'Fan',
 'were',
 'reverted',
 '?',
 'They',
 'were',
 "n't",
 'vandalisms',
 ',',
 'just',
 'closure',
 'on',
 'some',
 'GAs',
 'after',
 'I',
 'voted',
 'at',
 'New',
 'York',
 'Dolls',
 'FAC',
 '.',
 'And',
 'please',
 'do',
 "n't",
 'remove',
 'the',
 'template',
 'from',
 'the',
 'talk',
 'page',
 'since',
 'I',
 "'m",
 'retired',
 'now.89.205.38.27']

In [5]:
# clean tokens
tokens = [[word.lower() for word in sent if word.isalpha()] for sent in train['tokenize']]
tokens[0]

['explanation',
 'why',
 'the',
 'edits',
 'made',
 'under',
 'my',
 'username',
 'hardcore',
 'metallica',
 'fan',
 'were',
 'reverted',
 'they',
 'were',
 'vandalisms',
 'just',
 'closure',
 'on',
 'some',
 'gas',
 'after',
 'i',
 'voted',
 'at',
 'new',
 'york',
 'dolls',
 'fac',
 'and',
 'please',
 'do',
 'remove',
 'the',
 'template',
 'from',
 'the',
 'talk',
 'page',
 'since',
 'i',
 'retired']

In [6]:
train['tokenize'] = pd.Series(tokens)
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,tokenize
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, why, the, edits, made, under, my..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[he, matches, this, background, colour, i, see..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, i, really, not, trying, to, edit, w..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[more, i, ca, make, any, real, suggestions, on..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[you, sir, are, my, hero, any, chance, you, re..."


## 1.2 Clean comment columns

In [7]:
sents = [' '.join(sent) for sent in tokens]
train['clean_comment'] = pd.Series(sents)
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,tokenize,clean_comment
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, why, the, edits, made, under, my...",explanation why the edits made under my userna...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[he, matches, this, background, colour, i, see...",he matches this background colour i seemingly ...
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, i, really, not, trying, to, edit, w...",hey man i really not trying to edit war it jus...
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[more, i, ca, make, any, real, suggestions, on...",more i ca make any real suggestions on improve...
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[you, sir, are, my, hero, any, chance, you, re...",you sir are my hero any chance you remember wh...


# 2. EDA

### 2.1 Check NaN values
- None

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 10 columns):
id               159571 non-null object
comment_text     159571 non-null object
toxic            159571 non-null int64
severe_toxic     159571 non-null int64
obscene          159571 non-null int64
threat           159571 non-null int64
insult           159571 non-null int64
identity_hate    159571 non-null int64
tokenize         159571 non-null object
clean_comment    159571 non-null object
dtypes: int64(6), object(4)
memory usage: 12.2+ MB


In [9]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153164 entries, 0 to 153163
Data columns (total 2 columns):
id              153164 non-null object
comment_text    153164 non-null object
dtypes: object(2)
memory usage: 2.3+ MB


### 2.2 Word Dictionary
- Number of words
- Number of unique words
- most common word

In [10]:
# flatten list
words = []
for sent in train['tokenize']:
    words += sent

print('Number of words : ', len(words)) # 단어의 개수 합
print('Number of unique words : ', len(set(words)))

Number of words :  10338440
Number of unique words :  157455


In [11]:
# count words
from collections import Counter
count = Counter(words)
count.most_common(5)

[('the', 495401),
 ('to', 296851),
 ('i', 236559),
 ('of', 224008),
 ('and', 222709)]

In [15]:
# other mehtod
fdist = nltk.FreqDist(words)
fdist.most_common(5)

[('the', 495401),
 ('to', 296851),
 ('i', 236559),
 ('of', 224008),
 ('and', 222709)]

In [None]:
token_stop = [[word for word in sent if word not in stopwords.words('English')] for sent in tokens_stop]

# 새롭게 배운 것
## 1. Pandas의 row에 각각 tokenize 함수를 적용하기 위해서는 .apply()함수 이용
- str를 이용하려고 했으나 실패
- .apply(word_tokenize)을 통해 각각의 row에 효과 적용

## 2. How to use stopword
- https://pythonspot.com/nltk-stop-words/

## 3. How to clean
- https://machinelearningmastery.com/clean-text-machine-learning-python/

## 4. stopwords 제거 시간 단축
- stopwords를 list가 아닌 set으로 변환 후 for문 실시
- https://stackoverflow.com/questions/15286401/print-multiple-arguments-in-python