# <center>Purpose of this Notebook</center>




### Purpose of this Notebook

In this Notebook we perform an initial eyeball exploration of the datasets to find cleaning steps that might be particular to the individual datsets only (does not include generall cleaning steps like mentions, hashtags, punctuation removal, etc).

This helps to reduce computing cost instead of brute forcing the same cleaning steps for all of the combined data sources. 

In [2]:
import pandas as pd
import numpy as np

import re

In [3]:
pd.set_option('display.max_colwidth', 700)
#('MAX_COL_WIDTH', 500)

# Face Book Hate Speech

In [3]:
fb = pd.read_csv("transformed_data/facebook_hate_speech_translated.csv", encoding='utf-8')

In [4]:
# drop the duplicates
fb = fb.drop_duplicates(subset=['translated_message'])

In [5]:
fb.label.value_counts()

abuse       663
no_abuse      1
Name: label, dtype: int64

# Observe
To find the nuances of this dataset we will observe the dataset by sampling it multiple times.

Remove:
- &quot;
- &#39;(unicode in dataset)  with '. In fact remove all decimal encoded puctuation marks with the actual punctuation marks.
- @ 137c9c6970afb7fc
- repeating !!!


Note:
- remove accented characters
- URL

[^a-zA-Z_\.\s,] --> Unnecessary characters could be removed. All characters except alphabets, full stop, unsderscore, comma and white space.
Multiple white spaces by one white space.


Each sentence of the document should be predicted and then if any of the sentences are abusive, whole document should be classified as abusive.
Also highlight the abusive part.

## Steps for cleaning

- puctuation marks correction
- Mentions, Hashtags

In [30]:
x = fb.loc[643,'translated_message']
re.sub("[^a-zA-Z_\.\s,]",'', x) # remove all special characters and numbers
re.sub(r"\b(([a-z]+\d+)|(\d+[a-z]+))(\w)+\b", '', x) # 02ab63aad79877f5, ab63aad79877f5, 3f3g6hj7j5v and fg54jkk098ui

# re.sub("&#39;", "'", x)

'CLARIFICATION TO COVER UP THE REASONS FOR THE GOODS, BY THE LIES POLICY and YOUR LIES MEDIA, FOR PEGIDA '

In [23]:

re.findall("\d+", ''.join(re.findall("&#\d+;", "Hello &#33;")))

# re.findall("\d+", "&#33;") 

['33']

In [37]:
x = fb.translated_message.loc[56]
re.sub(r"\b(([a-z]+\d+)|(\d+[a-z]+))(\w)+\b", '', x)

'@  what&#39;s so bad about being put to the right so I always openly say my opinion about these shit foreigners I don&#39;t care what the others think or say and I think that should do a lot more otherwise the state never ends what there is much too few neo-nazis say your opinion openly, these shit shit should get out of our country !!!!!!!'

Also remove sentences that are abusive only in specific context. We want a generalised system.

Remove :
0, 9, 10, 13, 18, 31, 34, 35, 39, 41, 52, 55, 59, 69, 70, 72, 73, 78, 101, 103, 107, 116, 117

In [24]:
delete_rows = [0, 9, 10, 13, 18, 31, 34, 35, 39, 41, 52, 55, 59, 69, 70, 72, 73, 78, 101, 103, 107, 116, 117]

In [27]:
fb.drop(index=delete_rows).shape

(641, 2)

# Wikipedia Personal Attacks

In [12]:
wiki = pd.read_csv("transformed_data/wikipedia_personal_attacks.csv", encoding='utf-8')

In [13]:
wiki.shape

(115864, 2)

In [16]:
wiki = wiki.drop_duplicates(subset=['comment'])

In [18]:
wiki.label.value_counts()

no_abuse    100142
abuse        15563
Name: label, dtype: int64

In [29]:
msg = "NEWLINE_TOKENNEWLINE_TOKEN== Statement ==NEWLINE_TOKENI would like to be unblocked please, my actions four years ago were unwarranted and I apologise, I would like to contribute constructively."
re.sub("[^a-zA-Z_\.\s,]", "", re.sub(r"NEWLINE_TOKEN", "", msg))

' Statement I would like to be unblocked please, my actions four years ago were unwarranted and I apologise, I would like to contribute constructively.'

## Observations
Remove:
- NEWLINE_TOKEN
- ``.*`` indicates quotes
- (UTC)

Notes:
- remove accented characters
- u with you

# White Supremist 

In [3]:
w_s = pd.read_csv("transformed_data/white_supremist_data.csv", encoding="utf-8")

In [6]:
print('Original Shape: {0}'.format(w_s.shape))
print('After Duplicate removal:', w_s.drop_duplicates(subset=['text']).shape)

Original Shape: (10703, 2)
After Duplicate removal: (10534, 2)


In [7]:
w_s.label.value_counts()

no_abuse    9507
abuse       1196
Name: label, dtype: int64

## Observations

- "[....]" | '[....]' : every document is a list element.


Notes:
n't --> not

In [50]:
# x = ''.join(w_s.text.loc[[6310, 2334]].values)
x = ''.join(w_s.text.loc[6310])
# re.sub('([(\"|\')).*((\"|\')])', "", x)
# ''.join(re.sub('''("|')\]''', "", re.sub('''\[("|')''', "", x)))
# x
x.replace('["', '').replace('"]', '')

"I 'm surprised Greek or some Asiatic languages did n't pop up ."

# Tweeter

In [4]:
tweetr = pd.read_csv("transformed_data/tweeter_data.csv")

In [67]:
tweetr.columns

Index(['tweet', 'label'], dtype='object')

In [5]:
print('Original Shape: {0}'.format(tweetr.shape))
print('After Duplicate removal:', tweetr.drop_duplicates(subset=['tweet']).shape)

Original Shape: (67079, 2)
After Duplicate removal: (20484, 2)


In [6]:
tweetr = tweetr.drop_duplicates(subset=['tweet'])

In [7]:
tweetr.label.unique()

array(['label', 'abuse'], dtype=object)

In [10]:
tweetr = tweetr.drop(index=0)

(20483, 2)

## Observation

Remove:

- RT

NOte:
- emoji &#9825; &#128166; &#128540; --> &#\d+;
- &#8220; &#128526;&#8221;
- Emojis have similar regex compared to apostrophe. Hence they must be removed only after apostrophe substitution.
Column name change in combined data. Data in DB already has same column names.

In [77]:
x = tweetr.tweet.loc[14721]

re.sub("&#\d+;", "", x)

'RT @TrillAssKass: wet pussy makes the dick slip out '

In [21]:
del tweetr

# Toxic Comments

In [22]:
toxic = pd.read_csv("transformed_data/toxic_comments.csv", encoding='utf-8')

In [23]:
toxic.columns

Index(['comment_text', 'label'], dtype='object')

In [24]:
print("Shape of original data:", toxic.shape)

Shape of original data: (2223063, 2)


In [25]:
toxic = toxic.drop_duplicates(subset=['comment_text'])

In [27]:
print("Shape after duplicate removal:", toxic.shape)

Shape after duplicate removal: (2195400, 2)


In [28]:
toxic.label.unique()

array(['no_abuse', 'abuse'], dtype=object)

## Observation

Remove:

- \n \r
- dawggg

Note:

- 50% -> fifty percent : Eg: Rihanna is 50% black. Her mother is also mixed race, not black.

In [43]:
x = toxic.comment_text.loc[377713]# 1839087

# x.replace("\n", '')
x

'You are the one who recently claimed an assistant professor at the UO makes $200k and then "looked it up" and found assistant profs at the UO make $113k. Both were very wrong. You are in no position to claim anyone else is writing fantasy or not when you make such wild statements without support.'