## Problem: Detection of aggressive tweets

Training dataset has 12800 tweets (in english) and validation dataset has 3200 tweets.<br/>
Tweets are labeled (by human) as:
* 1 (Cyber-Aggressive; 9714 items)
* 0 (Non Cyber-Aggressive; 6286 items)

# Tweet Analysis

### Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
data_train = pd.read_json('./Data/train.json')

In [3]:
X_train = data_train.content
y_train = data_train.label

Only training part is analysed

### Punctuation

In [4]:
import string
import re
from scipy import stats

In [5]:
PUNCTUATION = string.punctuation
print(PUNCTUATION)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [6]:
def compute_binom_pvalue(data, column_name):
    '''Compute p-value for two-sided test of the null hypothesis 
    that two classes are equally likely to occur.
    Parameters:
    - data: dataframe; data with the binary column 'label',
    - column_name: string; the name of the column to be checked.
    Returns:
    - pvalue: float; the p-value.
    '''
    
    matching_labels = data.label[data[column_name].notnull()]
    no_successes = matching_labels[matching_labels == 1].shape[0]
    no_trials = matching_labels.shape[0]
    pvalue = stats.binom_test(x=no_successes, n=no_trials, p=0.5)
    return pvalue

def print_matching_statistics(data, column_name):
    '''Print matching statistic.
    Parameters:
    - data: dataframe; data with the binary column 'label',
    - column_name: string; the name of the column to be checked.
    '''
    
    matching_labels = data.label[data[column_name].notnull()]
    
    score = matching_labels.mean()
    print('The mean of labels of matching tweets = %.2f' % score)
    
    rate = matching_labels.shape[0] / data.shape[0]
    print('The rate of matching tweets = %.3f' % rate)
    
    pvalue = compute_binom_pvalue(data, column_name)
    print('p-value = %.3f' % pvalue)
    

def find_all_matches(X, y, regex):
    '''Find all matches of regex in a tweet and print matching statistics.
    Parameters:
    - X: array-like; list of tweets,
    - y: array-like; list of labels,
    - regex: string; regular expression.
    Returns:
    - data: dataframe; it has three columns: content, label, list of matches.
    '''
    
    data = pd.DataFrame(np.c_[X, y], columns=['content', 'label'])
    data[regex] = data.content.apply(lambda doc: re.findall(regex, doc) if re.findall(regex, doc) != [] else None)
    
    return data

Many exclamation marks say nothing

In [7]:
regex = '!{3,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.50
The rate of matching tweets = 0.034
p-value = 0.962


Unnamed: 0,content,label,"!{3,}"
8,i hate hate hate it with a passion!! lol...i k...,1,[!!!]
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,[!!!]
34,she got yo ass spooked!!!,0,[!!!]
38,Im doing good thankyuh!!!Heyyy hbu?!,0,[!!!]
49,@mekdot yo...heroes sucks!!! I loved season 1 ...,1,[!!!]


Question marks are more related to nonaggressive tweets

In [8]:
regex = '\\?{1,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.26
The rate of matching tweets = 0.245
p-value = 0.000


Unnamed: 0,content,label,"\?{1,}"
6,You coming to work today E? I know it's Gay D...,1,[?]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[?]
19,what do you usually wear? like to school?,0,"[?, ?]"
24,Is someone on your mind right now?,0,[?]
38,Im doing good thankyuh!!!Heyyy hbu?!,0,[?]


Many dots say nothing

In [9]:
regex = '\\.{3,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.128
p-value = 0.287


Unnamed: 0,content,label,"\.{3,}"
8,i hate hate hate it with a passion!! lol...i k...,1,[...]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[...]
27,Girl...you good. Lol at hearing your own name....,0,"[..., ...]"
46,I'm more into shows like Dave Chapelle...I fuc...,1,[...]
49,@mekdot yo...heroes sucks!!! I loved season 1 ...,1,[...]


Emoticons (especially happy faces) are more related to nonaggressive tweets

In [10]:
regex = '[:;=8x]\'?-?[)D\]*3/(x#|\[Pp{]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.35
The rate of matching tweets = 0.149
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]'?-?[)D\]*3/(x#|\[Pp{]
2,Hey don't knock the Port+OJ until you've trie...,1,[;)]
3,Famous! Duhh Haha that means I would be rich...,0,[;D]
11,haha yeah okay. Sounds good :] Sleep well.,0,[:]]
22,Sorry about the bday thing -- that sucks. :(,0,[:(]
31,LOL damn. great article though. too bad it see...,0,[:-)]


In [11]:
# happy face
regex = '[:;=8x]-?[)D\]*]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.27
The rate of matching tweets = 0.090
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]-?[)D\]*]
2,Hey don't knock the Port+OJ until you've trie...,1,[;)]
3,Famous! Duhh Haha that means I would be rich...,0,[;D]
11,haha yeah okay. Sounds good :] Sleep well.,0,[:]]
31,LOL damn. great article though. too bad it see...,0,[:-)]
61,yo yo. so i emaled you about the 30th you stu...,0,[:-)]


In [12]:
# sad face
regex = '[:;=8x]\'?-?[/(x#|\[{]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.039
p-value = 0.007


Unnamed: 0,content,label,[:;=8x]'?-?[/(x#|\[{]
22,Sorry about the bday thing -- that sucks. :(,0,[:(]
36,awww man sorry to hear you got laid off damn e...,0,[:(]
72,that sucks *hugs* :(,1,[:(]
145,http://twitpic.com/xinl - I hate him so much. ...,0,[:/]
166,well i have other people following me outside ...,0,[:-(]


In [13]:
# heart
regex = '<3'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.04
The rate of matching tweets = 0.009
p-value = 0.000


Unnamed: 0,content,label,<3
45,Twitter Family <333,0,[<3]
89,i lovee joiee<3,0,[<3]
140,uhm i think i love me some freaking joie thas...,0,[<3]
464,Whosz that niggaaa ? Lol yes i do <3,0,[<3]
497,I am done spamming for right now. I am still ...,0,[<3]


### Words

In [14]:
import nltk

Words with stars say nothing

In [15]:
regex = '\\b\\w+\\*{1,}\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = 0.60
The rate of matching tweets = 0.001
p-value = 0.607


Unnamed: 0,content,label,"\b\w+\*{1,}\w+\b"
209,whwhwhwhoa just slow down. huh!? what do you ...,0,"[f*****ck, F*ck, F*ck, F*ck, F*ck, F*ck]"
471,thats what i am saying...they need to pay me.t...,1,[m*ttaf]
761,Aw f**k. That sucks.,0,[f**k]
960,lenee that bo*o ass incense you gave me smell...,1,[bo*o]
1210,A BAMF is a bad-ass mother f*cker.,1,[f*cker]
2250,oooooooohhhhhhh sh*t! Damn where all the white...,1,[sh*t]
2504,my bleeping employment agency f***ed up my pay...,1,[f***ed]
2526,lenee that bo*o ass incense you gave me smell...,1,[bo*o]
3775,LMAO......... U starting SH*T already... Ha! I...,0,[SH*T]
4252,A BAMF is a bad-ass mother f*cker.,1,[f*cker]


Uppercase say nothing

In [16]:
regex = '\\b[A-Z]{2,}\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.168
p-value = 0.167


Unnamed: 0,content,label,"\b[A-Z]{2,}\b"
2,Hey don't knock the Port+OJ until you've trie...,1,[OJ]
8,i hate hate hate it with a passion!! lol...i k...,1,[LOL]
12,your LAME ass . oviously iloveyouhh moree. (;,1,[LAME]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[WHY]
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,"[WIFF, HIS, BIG, GAY, TEEFS, BRYCE, FROM, TRS,..."


In [17]:
regex = '^[^a-z]+$'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.020
p-value = 0.616


Unnamed: 0,content,label,^[^a-z]+$
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,[P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS...
50,THANKYOU! HAHA,0,[ THANKYOU! HAHA]
91,I STILL HATE YOU.,1,[I STILL HATE YOU.]
134,:),0,[ :)]
153,SARCASTIC. :],0,[ SARCASTIC. :]]


Repeating letters is more related to nonaggressive tweets

In [18]:
regex = '(([a-zA-Z])\\2{3,})'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.027
p-value = 0.000


Unnamed: 0,content,label,"(([a-zA-Z])\2{3,})"
32,nigga u geigh lmao! fuck yo finals beeeeeitch,1,"[(eeeee, e)]"
64,mr big dick daddy 4rm cincinnati....ooooowwww...,0,"[(ooooo, o), (wwwww, w)]"
205,Hmmmm,0,"[(mmmm, m)]"
287,http://twitpic.com/qtt0 - O.O bingo is waaaaay...,1,"[(aaaaa, a)]"
307,That Sweeeet fukn ASS!,0,"[(eeee, e)]"


Stopwords

In [19]:
STOPWORDS = nltk.corpus.stopwords.words('english')
print(STOPWORDS)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [20]:
X_train_lower = X_train.apply(str.lower)
IRRELEVANT_STOPWORDS = []
for stopword in np.sort(STOPWORDS):
    print(stopword)
    data_extended = find_all_matches(X_train_lower, y_train, stopword)
    pvalue = compute_binom_pvalue(data_extended, stopword)
    print_matching_statistics(data_extended, stopword)
    if pvalue >= 0.01:
        IRRELEVANT_STOPWORDS.append(stopword)

a
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.921
p-value = 0.000
about
The mean of labels of matching tweets = 0.36
The rate of matching tweets = 0.036
p-value = 0.000
above
The mean of labels of matching tweets = 0.78
The rate of matching tweets = 0.001
p-value = 0.180
after
The mean of labels of matching tweets = 0.47
The rate of matching tweets = 0.007
p-value = 0.602
again
The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.009
p-value = 0.018
against
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.002
p-value = 0.690
ain
The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.034
p-value = 0.006
all
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.127
p-value = 0.000
am
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.221
p-value = 0.000
an
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.39

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.026
p-value = 0.000
most
The mean of labels of matching tweets = 0.31
The rate of matching tweets = 0.015
p-value = 0.000
mustn
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
mustn't
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
my
The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.120
p-value = 0.000
myself
The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.006
p-value = 0.908
needn
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
needn't
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
no
The mean of labels of matching tweets = 0.38
The rate of matching tweets = 0.220
p-value = 0.000
nor
The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.006


The mean of labels of matching tweets = 0.60
The rate of matching tweets = 0.015
p-value = 0.005
you've
The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.002
p-value = 0.541
your
The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.087
p-value = 0.000
yours
The mean of labels of matching tweets = 0.46
The rate of matching tweets = 0.006
p-value = 0.556
yourself
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.004
p-value = 0.488
yourselves
The mean of labels of matching tweets = 0.00
The rate of matching tweets = 0.000
p-value = 1.000


In [21]:
print(IRRELEVANT_STOPWORDS)

['above', 'after', 'again', 'against', 'aren', "aren't", 'because', 'being', 'below', 'between', 'both', 'couldn', "couldn't", 'didn', "didn't", "doesn't", 'doing', "don't", 'down', 'each', 'few', 'further', 'hadn', "hadn't", 'hasn', "hasn't", "haven't", 'having', 'herself', 'him', 'himself', 'his', 'into', 'isn', "isn't", 'itself', 'mightn', "mightn't", 'mustn', "mustn't", 'myself', 'needn', "needn't", 'nor', 'off', 'ourselves', 'over', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'such', "that'll", 'theirs', 'them', 'themselves', 'then', 'these', 'they', 'those', 'through', 'until', 'up', 'wasn', "wasn't", 'weren', "weren't", 'while', 'whom', 'will', 'won', "won't", 'wouldn', "wouldn't", "you'd", "you'll", "you've", 'yours', 'yourself', 'yourselves']


In [22]:
len(STOPWORDS), len(IRRELEVANT_STOPWORDS)

(179, 85)

In [23]:
df = pd.DataFrame(IRRELEVANT_STOPWORDS)
df.to_csv('./Data/irrelevant_stopwords.csv', index=False, header=False)