# <center>Purpose of this Notebook</center>




### Purpose of this Notebook

In this Notebook we perform an initial eyeball exploration of the datasets to find cleaning steps that might be particular to the individual datsets only (does not include generall cleaning steps like mentions, hashtags, punctuation removal, etc).

This helps to reduce computing cost instead of brute forcing the same cleaning steps for all of the combined data sources. 

In [2]:
import pandas as pd
import numpy as np

import re

In [3]:
pd.set_option('display.max_colwidth', 700)
#('MAX_COL_WIDTH', 500)

# Face Book Hate Speech

In [3]:
fb = pd.read_csv("transformed_data/facebook_hate_speech_translated.csv", encoding='utf-8')

In [4]:
# drop the duplicates
fb = fb.drop_duplicates(subset=['translated_message'])

In [5]:
fb.label.value_counts()

abuse       663
no_abuse      1
Name: label, dtype: int64

# Observe
To find the nuances of this dataset we will observe the dataset by sampling it multiple times.

- repeating !!!
- &quot;
- &#39;(unicode in dataset)  with '. In fact remove all decimal encoded puctuation marks with the actula punctuation marks.
- @ 137c9c6970afb7fc
- remove accented characters
- URL

[^a-zA-Z_\.\s,] --> Unnecessary characters could be removed. All characters except alphabets, full stop, unsderscore, comma and white space.
Multiple white spaces by one white space.


Each sentence of the document should be predicted and then if any of the sentences are abusive, whole document should be classified as abusive.
Also highlight the abusive part.

## Steps for cleaning

- puctuation marks correction
- Mentions, Hashtags

In [30]:
x = fb.loc[643,'translated_message']
re.sub("[^a-zA-Z_\.\s,]",'', x) # remove all special characters and numbers
re.sub(r"\b(([a-z]+\d+)|(\d+[a-z]+))(\w)+\b", '', x) # 02ab63aad79877f5, ab63aad79877f5, 3f3g6hj7j5v and fg54jkk098ui

# re.sub("&#39;", "'", x)

'CLARIFICATION TO COVER UP THE REASONS FOR THE GOODS, BY THE LIES POLICY and YOUR LIES MEDIA, FOR PEGIDA '

In [23]:

re.findall("\d+", ''.join(re.findall("&#\d+;", "Hello &#33;")))

# re.findall("\d+", "&#33;") 

['33']

In [37]:
x = fb.translated_message.loc[56]
re.sub(r"\b(([a-z]+\d+)|(\d+[a-z]+))(\w)+\b", '', x)

'@  what&#39;s so bad about being put to the right so I always openly say my opinion about these shit foreigners I don&#39;t care what the others think or say and I think that should do a lot more otherwise the state never ends what there is much too few neo-nazis say your opinion openly, these shit shit should get out of our country !!!!!!!'

Also remove sentences that are abusive only in specific context. We want a generalised system.

Remove :
0, 9, 10, 13, 18, 31, 34, 35, 39, 41, 52, 55, 59, 69, 70, 72, 73, 78, 101, 103, 107, 116, 117

In [24]:
delete_rows = [0, 9, 10, 13, 18, 31, 34, 35, 39, 41, 52, 55, 59, 69, 70, 72, 73, 78, 101, 103, 107, 116, 117]

In [27]:
fb.drop(index=delete_rows).shape

(641, 2)

# Wikipedia Personal Attacks

In [12]:
wiki = pd.read_csv("transformed_data/wikipedia_personal_attacks.csv", encoding='utf-8')

In [13]:
wiki.shape

(115864, 2)

In [16]:
wiki = wiki.drop_duplicates(subset=['comment'])

In [18]:
wiki.label.value_counts()

no_abuse    100142
abuse        15563
Name: label, dtype: int64

In [29]:
msg = "NEWLINE_TOKENNEWLINE_TOKEN== Statement ==NEWLINE_TOKENI would like to be unblocked please, my actions four years ago were unwarranted and I apologise, I would like to contribute constructively."
re.sub("[^a-zA-Z_\.\s,]", "", re.sub(r"NEWLINE_TOKEN", "", msg))

' Statement I would like to be unblocked please, my actions four years ago were unwarranted and I apologise, I would like to contribute constructively.'

## Observations
Remove:
- NEWLINE_TOKEN
- remove accented characters
- (UTC)

Notes:
- ``.*`` indicates quotes
- u with you

# White Supremist 

In [3]:
w_s = pd.read_csv("transformed_data/white_supremist_data.csv", encoding="utf-8")

In [6]:
print('Original Shape: {0}'.format(w_s.shape))
print('After Duplicate removal:', w_s.drop_duplicates(subset=['text']).shape)

Original Shape: (10703, 2)
After Duplicate removal: (10534, 2)


In [7]:
w_s.label.value_counts()

no_abuse    9507
abuse       1196
Name: label, dtype: int64

## Observations

- "[....]" | '[....]' : every document is a list element.


Notes:
n't --> not

In [50]:
# x = ''.join(w_s.text.loc[[6310, 2334]].values)
x = ''.join(w_s.text.loc[6310])
# re.sub('([(\"|\')).*((\"|\')])', "", x)
# ''.join(re.sub('''("|')\]''', "", re.sub('''\[("|')''', "", x)))
# x
x.replace('["', '').replace('"]', '')

"I 'm surprised Greek or some Asiatic languages did n't pop up ."

# Tweeter

In [4]:
tweetr = pd.read_csv("transformed_data/tweeter_data.csv")

In [67]:
tweetr.columns

Index(['tweet', 'label'], dtype='object')

In [5]:
print('Original Shape: {0}'.format(tweetr.shape))
print('After Duplicate removal:', tweetr.drop_duplicates(subset=['tweet']).shape)

Original Shape: (67079, 2)
After Duplicate removal: (20484, 2)


In [6]:
tweetr = tweetr.drop_duplicates(subset=['tweet'])

In [7]:
tweetr.label.unique()

array(['label', 'abuse'], dtype=object)

In [10]:
tweetr = tweetr.drop(index=0)

(20483, 2)

## Observation

Remove:

- RT
- emoji &#9825; &#128166; &#128540; --> &#\d+;
- &#8220; &#128526;&#8221;

NOte:

Column name change in combined data. Data in DB already has same column names.

In [77]:
x = tweetr.tweet.loc[14721]

re.sub("&#\d+;", "", x)

'RT @TrillAssKass: wet pussy makes the dick slip out '

In [21]:
del tweetr

# Toxic Comments

In [22]:
toxic = pd.read_csv("transformed_data/toxic_comments.csv", encoding='utf-8')

In [23]:
toxic.columns

Index(['comment_text', 'label'], dtype='object')

In [24]:
print("Shape of original data:", toxic.shape)

Shape of original data: (2223063, 2)


In [25]:
toxic = toxic.drop_duplicates(subset=['comment_text'])

In [27]:
print("Shape after duplicate removal:", toxic.shape)

Shape after duplicate removal: (2195400, 2)


In [28]:
toxic.label.unique()

array(['no_abuse', 'abuse'], dtype=object)

## Observation

Remove:

- \n \r
- dawggg

Note:

- 50% -> fifty percent : Eg: Rihanna is 50% black. Her mother is also mixed race, not black.

In [43]:
x = toxic.comment_text.loc[377713]# 1839087

# x.replace("\n", '')
x

'You are the one who recently claimed an assistant professor at the UO makes $200k and then "looked it up" and found assistant profs at the UO make $113k. Both were very wrong. You are in no position to claim anyone else is writing fantasy or not when you make such wild statements without support.'

In [44]:
toxic.sample(10)

Unnamed: 0,comment_text,label
267106,"If a president needs to be crooked and dishonest, then yes Hillary fits the bill.",abuse
304372,Pleasure to be somewhat in agreement with ya here this morning Mr. 313 but restructuring the Permanent Fund ain't gonna happen unless more is taken back from the gift package given to the oil industry. Demi are not going to allow the Permanent Fund to be restructured only to continue the $600 Million per year giveaway to industry. And the governor will not allow it. The big majority of people who have actually worked the computer calculation app see that indeed restructuring the Perm Fund is a huge part of fiscal stability but it can't happen if the massive giveaway stays in place. We all have to pay the fiddler including and especially the oil industry that has used this states resource...,no_abuse
578639,"Such a Screechy/Whiney voice! How can people listen to that woman? Waterboarding would be less obnoxious.\nI sincerely feel for those on the receiving end of her vitriol. Actually, worse than Sarah P.",no_abuse
1367477,"Spoken like a vulgar, disgusting pig. These talking heads Do Not Get It.",abuse
2104312,"Thank you, Bishop McElroy, for reminding us to care for the ""powerless, the worker, the homeless, the hungry, those without decent medical care, the unemployed”. In this time in the world (not just the USA) this is a prophetic message.",no_abuse
899035,"""Also on Thursday, Malaysia announced it was ending visa-free travel for North Koreans, citing security reasons.\nNorth Koreans can currently visit Malaysia for up to 30 days without securing a visa, under a reciprocal deal.\nBut Deputy Prime Minister Ahmad Zahid Hamidi has said that is changing as of Monday 6 March, the Bernama state news agency reports."" \n\nI'm curious. Are you for or against criminals coming into our country?",no_abuse
1254765,"Okay, do we shoot at US (or other countries) ships if they violate Arctic waters? Diplomacy is the solution.",no_abuse
203586,""" \n :It's good to focus on school history. You can sketch some things out below while the block is still up in the air. '''''' """,no_abuse
161357,""" \n ::Oh, and one more observationBernadette is wonderfully silly and goofy, and more than a little """"over-the-top"""", but in recent performances (most especially Follies) she is quite subtle and nuancedperhaps reflecting the role, perhaps the woman. But, always emotionally raw and vulnerable. """,no_abuse
481900,A huge heap of burning tires. How's that suppose to tie in with the protestors desire to protect the environment?,no_abuse
