## Problem: Detection of aggressive tweets

Dataset: https://www.kaggle.com/dataturks/dataset-for-detection-of-cybertrolls/home

The dataset has 20001 tweets (in english) which are labeled (by human) as:
* 1 (Cyber-Aggressive)
* 0 (Non Cyber-Aggressive)

# Preparing Data

### Reading data

In [1]:
import json

In [20]:
data_file_path = './Data'
input_file_name = 'Dataset_for_Detection_of_Cyber-Trolls.json'

In [21]:
with open('/'.join([data_file_path, input_file_name]), 'r') as file:
    json_str = '[' + file.read() + ']'
    json_str = json_str.replace('}\n', '},', json_str.count('}\n') - 1)
    data_raw = json.loads(json_str)
    
    assert len(data_raw) == 20001

### Reviewing data

In [22]:
import numpy as np
from pandas.io.json import json_normalize

In [23]:
data_raw[0]

{'annotation': {'label': ['1'], 'notes': ''},
 'content': ' Get fucking real dude.',
 'extras': None}

In [24]:
# check if the field 'extras' is relevant
(np.array([record['extras'] for record in data_raw]) != None).sum()

0

In [25]:
# check if the field 'notes' is relevant
(np.array([record['annotation']['notes'] for record in data_raw]) != '').sum()

0

In [26]:
# keep only relevant information
data = json_normalize(data_raw, record_path=[['annotation', 'label']], meta=['content'])

In [27]:
data.head()

Unnamed: 0,0,content
0,1,Get fucking real dude.
1,1,She is as dirty as they come and that crook ...
2,1,why did you fuck it up. I could do it all day...
3,1,Dude they dont finish enclosing the fucking s...
4,1,WTF are you talking about Men? No men thats n...


In [28]:
data.rename(columns={0: 'label'}, inplace=True)

In [29]:
data.shape

(20001, 2)

In [30]:
data.dtypes

label      object
content    object
dtype: object

In [31]:
data['label'] = data.label.astype(np.int)

### Dividing data into the training and test part

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
np.unique(data.label, return_counts=True)

(array([0, 1]), array([12179,  7822], dtype=int64))

In [34]:
print('Data: ', 'class 1 contribution = %.2f' % data.label.mean(), 
      'shape = %s' % (data.shape,), sep='\n')

Data: 
class 1 contribution = 0.39
shape = (20001, 2)


In [35]:
data_train, data_test = train_test_split(data, test_size=0.2, shuffle=True, random_state=123)

In [36]:
print('Training data: ', 'class 1 contribution = %.2f' % data_train.label.mean(), 
      'shape = %s' % (data_train.shape,), sep='\n', end='\n\n')
print('Test data: ', 'class 1 contribution = %.2f' % data_test.label.mean(), 
      'shape = %s' % (data_test.shape,), sep='\n')

Training data: 
class 1 contribution = 0.39
shape = (16000, 2)

Test data: 
class 1 contribution = 0.38
shape = (4001, 2)


In [19]:
# save the training and test dataset
train_file_name = 'train.json'
test_file_name = 'test.json'

data_train.to_json('/'.join([data_file_path, train_file_name]), orient='records')
data_test.to_json('/'.join([data_file_path, test_file_name]), orient='records')