### Packages

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

### Initial data import

In [2]:
comments_df = pd.read_csv('train.csv')
comments_df['comment_text'] = comments_df['comment_text'].astype(str)
print('There are ' + format(len(comments_df), ',d') + ' rows in this data set.')
print('The first ten are...')
comments_df.iloc[0:10, :]

There are 159,571 rows in this data set.
The first ten are...


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


### Label count

In [3]:
for label in range(2, 8):
    label_count = str(np.sum(comments_df.iloc[:, label]))
    label_name = comments_df.columns[label]
    if (label == 2):
        print('\n')
    print('There are ' + label_count + ' ' + label_name + ' comments')



There are 15294 toxic comments
There are 1595 severe_toxic comments
There are 8449 obscene comments
There are 478 threat comments
There are 7877 insult comments
There are 1405 identity_hate comments


### Text processing

Setting up the text-cleaning utilities

In [4]:
stopwords = set(stopwords.words('english'))
punctuation = set(string.punctuation) 
lemmatize = WordNetLemmatizer()

def cleaning(article):
    no_stopwords = " ".join([i for i in article.lower().split() if i not in stopwords])
    and_no_punctuation = "".join(i for i in no_stopwords if i not in punctuation)
    lemmatized = " ".join(lemmatize.lemmatize(i) for i in and_no_punctuation.split())
    tokenized = nltk.word_tokenize(lemmatized)
    return(tokenized)

text_only = pd.DataFrame(comments_df['comment_text'])

The cleaning and tokenization

In [5]:
random_index = np.random.choice(len(comments_df))

print('The unprocessed comments are along the lines of... \n')
print('\t' + text_only['comment_text'][random_index])

processed_text = text_only.applymap(cleaning)['comment_text']

print('\n\nThe cleaned and tokenized comments are along the lines of... \n')
print(processed_text[random_index])

The unprocessed comments are along the lines of... 

	Dude, seriously - you need to relax a bit. Life isn’t all about research and fact hunting. You went crazy when you saw a hoax page, and did hours of research into it. It was a blatant hoax, and it WAS just a childish game, which you got involved in. I hope you’re happy that you spotted the hoax, but anyone could have done it. You even got your friend on the case to do research into it!

 I agree that Wikipedia has no place for rubbish, but it’s the administrator’s job to delete the rubbish. I think at the end of the day, you've got to have fun and researching about some fake person isn't the way to do it. I think to a degree everyone’s idea of fun is different, but almost everyone in this world will agree that what you did wasn't it. I’d prefer to play scrabble with Arabic letters than research into a prank, if you want a debate then I’ll be on my user page, I always enjoy having chats with you - that IS fun!


The cleaned and token