## Problem: Detection of aggressive tweets

Training dataset has 12800 tweets (in english) and validation dataset has 3200 tweets.<br/>
Tweets are labeled (by human) as:
* 1 (Cyber-Aggressive; 9714 items)
* 0 (Non Cyber-Aggressive; 6286 items)

# Tweet Analysis

### Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_train = pd.read_json('./Data/train.json')

In [3]:
X_train = data_train.content
y_train = data_train.label

In [4]:
print('Training data: ', 'class 1 contribution = %.2f' % y_train.mean(), 
      '# = %s' % X_train.shape[0], sep='\n')

Training data: 
class 1 contribution = 0.39
# = 12800


Only training part is analysed

### Punctuation

In [5]:
import string
import re
from scipy import stats
import nltk

In [6]:
def compute_binom_pvalue(data, column_name):
    '''Compute p-value for two-sided test of the null hypothesis that 
    the probability of occurrence of an aggressive tweet is 0.39.
    Parameters:
    - data: dataframe 
        data with the binary column 'label'
    - column_name: string
        the name of the column to be checked
    Returns:
    - pvalue: float
        the p-value
    '''
    
    matching_labels = data.label[data[column_name].notnull()]
    no_successes = matching_labels[matching_labels == 1].shape[0]
    no_trials = matching_labels.shape[0]
    pvalue = stats.binom_test(x=no_successes, n=no_trials, p=0.39)
    return pvalue

def print_matching_statistics(data, column_name):
    '''Print matching statistic.
    Parameters:
    - data: dataframe 
        data with the binary column 'label'
    - column_name: string 
        the name of the column to be checked
    '''
    
    matching_labels = data.label[data[column_name].notnull()]
    
    score = matching_labels.mean()
    print('The mean of labels of matching tweets = %.2f' % score)
    
    rate = matching_labels.shape[0] / data.shape[0]
    print('The rate of matching tweets = %.3f' % rate)
    
    pvalue = compute_binom_pvalue(data, column_name)
    print('p-value = %.3f' % pvalue)
    

def find_all_matches(X, y, regex):
    '''Find all matches of regex in a tweet and print matching statistics.
    Parameters:
    - X: array-like
        list of tweets
    - y: array-like 
        list of labels
    - regex: string 
        regular expression
    Returns:
    - data: dataframe 
        it has three columns: content, label, list of matches
    '''
    
    data = pd.DataFrame(np.c_[X, y], columns=['content', 'label'])
    data[regex] = data.content.apply(lambda doc: re.findall(regex, doc) if re.findall(regex, doc) != [] else None)
    
    return data

In [7]:
def compute_matching_words_rate(regex, X):
    '''Compute the ratio of matching words to all words in a tweet.
    Parameters:
    - regex: string
        regular expression
    - X: array-like
        list of tweets
    Returns:
    - matching_words_rate: array
        the ratio of matching words to all words in a tweet
    '''
    
    matching_words_count = np.array([len(re.findall(regex, doc)) for doc in X])

    words_count = np.array([len(list(filter(None, re.split('[^\\w\'*]', doc)))) for doc in X_train])
    words_count[words_count == 0] = 1

    matching_words_rate = matching_words_count / words_count
    return matching_words_rate

In [8]:
print(np.random.choice(X_train.values, 15))

[ 'Damn you  I ended up ordering the damn Galactus MM  I hope you are happy. When my wife bitches at me  I am blaming you.'
 'could you imagine John F Kennedy or Churchill or Abraham Lincoln shouting "yaa boo sucks you \'kin wankaz" at the opposition ?'
 'so there were like 11 Five Guys in Orlando...we never went there once...GAY!!'
 "Damn straight. I just don't understand it. I was born thinking like this. I'd like to believe everyone else was  too... we'll see."
 " lol  but what was the rumor? Gah I feel like I'm back in High School  I wanna fuck up whoever says shit about Dave  he's kickass"
 'Coke Zero should absolutely not exist. Cherry Coke Zero is alright  but I still say fuck it.'
 ' Who are your Twitter besties?' "  Yes ma'am :] thank you lady."
 ' If you could bring one character to life from your favorite book  who would it be?'
 '  First.'
 "YOU STUPID WHORE I HOPE YOU ARE IN YOUR HOUSE WHEN I BURN DOWN YOUR HOUSE YOU STUPID WHORE WHY WOULDN'T YOU GO TO PROM WITH ME"
 'Exac

In [9]:
PUNCTUATION = string.punctuation
print(PUNCTUATION)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Exclamation marks are more related to aggressive tweets

In [10]:
regex = '(?<![?!])!(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.48
The rate of matching tweets = 0.216
p-value = 0.000


Unnamed: 0,content,label,(?<![?!])!(?!!)
0,Heh! Parcells is a bad-ass who knows how to he...,0,[!]
3,Famous! Duhh Haha that means I would be rich...,0,"[!, !]"
10,Tears for fears! Oh well time for my third He...,0,[!]
20,Damn!! I was hoping you were giving me good ne...,1,[!]
27,Girl...you good. Lol at hearing your own name....,0,[!]


In [11]:
regex = '(?<![?!])!{2}(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.026
p-value = 0.910


Unnamed: 0,content,label,(?<![?!])!{2}(?!!)
8,i hate hate hate it with a passion!! lol...i k...,1,[!!]
20,Damn!! I was hoping you were giving me good ne...,1,[!!]
49,@mekdot yo...heroes sucks!!! I loved season 1 ...,1,[!!]
232,and holy crap that's a long time!! D: that su...,1,[!!]
237,Yumm... Justin Bieber P-diddys son JDoir YUM...,0,"[!!, !!]"


In [12]:
regex = '(?<!\\?)!{2,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.057
p-value = 0.001


Unnamed: 0,content,label,"(?<!\?)!{2,}"
8,i hate hate hate it with a passion!! lol...i k...,1,"[!!, !!!]"
20,Damn!! I was hoping you were giving me good ne...,1,[!!]
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,[!!!]
34,she got yo ass spooked!!!,0,[!!!]
38,Im doing good thankyuh!!!Heyyy hbu?!,0,[!!!]


Single question marks are more related to nonaggressive tweets

In [13]:
regex = '(?<!\\?)\\?(?![?!])'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.25
The rate of matching tweets = 0.227
p-value = 0.000


Unnamed: 0,content,label,(?<!\?)\?(?![?!])
6,You coming to work today E? I know it's Gay D...,1,[?]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[?]
19,what do you usually wear? like to school?,0,"[?, ?]"
24,Is someone on your mind right now?,0,[?]
41,What do you think happens to the missing sock...,0,[?]


In [14]:
regex = '(?<!\\?)\\?{2}(?![?!])'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.38
The rate of matching tweets = 0.008
p-value = 0.837


Unnamed: 0,content,label,(?<!\?)\?{2}(?![?!])
215,Has it snowed where you are?? I miss the damn ...,1,[??]
269,hmmm...I had a feeling bout u! but Damn No Lab...,1,[??]
362,cum cloape??,0,[??]
430,How is the Loser episode? Lotsa crying??,1,[??]
635,yeah I did that a few months ago. How's your ...,1,[??]


In [15]:
regex = '\\?{2,}(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.38
The rate of matching tweets = 0.014
p-value = 0.700


Unnamed: 0,content,label,"\?{2,}(?!!)"
172,what is the thing that annoyes you most???,0,[???]
215,Has it snowed where you are?? I miss the damn ...,1,[??]
269,hmmm...I had a feeling bout u! but Damn No Lab...,1,[??]
362,cum cloape??,0,[??]
430,How is the Loser episode? Lotsa crying??,1,[??]


Marks ?! say nothing

In [16]:
regex = '\\?+!+'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.009
p-value = 0.347


Unnamed: 0,content,label,\?+!+
38,Im doing good thankyuh!!!Heyyy hbu?!,0,[?!]
194,SHUT THE FUCK UP! What?! Why does Micky Rouke ...,1,[?!]
526,OMG! really?! cool thats so awesome! yeah i ...,0,[?!]
589,WHAT THE FUCK?! WHY?! Do they have fucking bra...,1,"[?!, ?!, ?!]"
640,burritoville is dead? What the fuck?!,1,[?!]


Dots are more related to aggressive tweets

In [17]:
regex = '(?<!\\.)\\.(?!\\.)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.472
p-value = 0.000


Unnamed: 0,content,label,(?<!\.)\.(?!\.)
0,Heh! Parcells is a bad-ass who knows how to he...,0,"[., .]"
1,Then I would be gay and I would kill myself. D...,1,"[., .]"
2,Hey don't knock the Port+OJ until you've trie...,1,"[., .]"
4,Not yet the woman was wanting to escape but ...,0,[.]
5,damn dude. that is spot on (wrt fb v twitter),0,[.]


In [18]:
regex = '(?<!\\.)\\.{2}(?!\\.)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.039
p-value = 0.438


Unnamed: 0,content,label,(?<!\.)\.{2}(?!\.)
13,I'm late to the emo-ball.. I'm not even dresse...,1,[..]
18,Lol I say gay all the time in the gay way and ...,0,"[.., ..]"
46,I'm more into shows like Dave Chapelle...I fuc...,1,[..]
74,you got that right! People stealing others' ge...,1,[..]
78,Do you have a specific work-out routine?..if ...,0,[..]


In [19]:
regex = '\\.{2,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.46
The rate of matching tweets = 0.158
p-value = 0.000


Unnamed: 0,content,label,"\.{2,}"
8,i hate hate hate it with a passion!! lol...i k...,1,[...]
13,I'm late to the emo-ball.. I'm not even dresse...,1,"[.., ...]"
18,Lol I say gay all the time in the gay way and ...,0,"[.., ..]"
27,Girl...you good. Lol at hearing your own name....,0,"[..., ...]"
46,I'm more into shows like Dave Chapelle...I fuc...,1,"[..., ..]"


Emoticons (happy faces and hearts) are more related to nonaggressive tweets

In [20]:
regex = '[:;=8x]\'?-?[)D\]*3/(x#|\[Pp{]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.35
The rate of matching tweets = 0.149
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]'?-?[)D\]*3/(x#|\[Pp{]
2,Hey don't knock the Port+OJ until you've trie...,1,[;)]
3,Famous! Duhh Haha that means I would be rich...,0,[;D]
11,haha yeah okay. Sounds good :] Sleep well.,0,[:]]
22,Sorry about the bday thing -- that sucks. :(,0,[:(]
31,LOL damn. great article though. too bad it see...,0,[:-)]


In [21]:
# happy face
regex = '[:;=8x]-?[)D\\]*]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.27
The rate of matching tweets = 0.090
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]-?[)D\]*]
2,Hey don't knock the Port+OJ until you've trie...,1,[;)]
3,Famous! Duhh Haha that means I would be rich...,0,[;D]
11,haha yeah okay. Sounds good :] Sleep well.,0,[:]]
31,LOL damn. great article though. too bad it see...,0,[:-)]
61,yo yo. so i emaled you about the 30th you stu...,0,[:-)]


In [22]:
# rate of happy face
regex = '[:;=8x]-?[)D\\]*]'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,[:;=8x]-?[)D\]*],rate
2,Hey don't knock the Port+OJ until you've trie...,1,[;)],0.043478
3,Famous! Duhh Haha that means I would be rich...,0,[;D],0.058824
11,haha yeah okay. Sounds good :] Sleep well.,0,[:]],0.142857
31,LOL damn. great article though. too bad it see...,0,[:-)],0.071429
61,yo yo. so i emaled you about the 30th you stu...,0,[:-)],0.052632


In [23]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.181
0.339


In [24]:
regex = '[:;=8x]-?[pP]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.020
p-value = 0.478


Unnamed: 0,content,label,[:;=8x]-?[pP]
185,crutches suck big time. i hate them lol. hope ...,1,[:P]
216,Once I reach law school I may call on you for ...,0,[xp]
217,maybe :P,0,[:P]
259,hour meetings foe a job you're failing at bec...,1,[xp]
306,Lulz. OK. Internet Explorer sucks.,1,[xp]


In [25]:
# sad face
regex = '[:;=8x]\'?-?[/(x#|\\[{]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.039
p-value = 0.025


Unnamed: 0,content,label,[:;=8x]'?-?[/(x#|\[{]
22,Sorry about the bday thing -- that sucks. :(,0,[:(]
36,awww man sorry to hear you got laid off damn e...,0,[:(]
72,that sucks *hugs* :(,1,[:(]
145,http://twitpic.com/xinl - I hate him so much. ...,0,[:/]
166,well i have other people following me outside ...,0,[:-(]


In [26]:
# heart
regex = '<3+|&lt;3'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.24
The rate of matching tweets = 0.013
p-value = 0.000


Unnamed: 0,content,label,<3+|&lt;3
45,Twitter Family <333,0,[<333]
89,i lovee joiee<3,0,[<3]
140,uhm i think i love me some freaking joie thas...,0,[<3]
394,&lt;3 ::huggles:: nyaaaa~ its snowing T_T i ha...,0,[&lt;3]
464,Whosz that niggaaa ? Lol yes i do <3,0,[<3]


Quotation marks are more related to aggressive tweets

In [27]:
regex = '("|&quot;).+("|&quot;)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.039
p-value = 0.000


Unnamed: 0,content,label,"(""|&quot;).+(""|&quot;)"
66,have you ever had a &quot;Summer Love&quot;?,0,"[(&quot;, &quot;)]"
70,"i call it ""hey-dumb-fuck-guys-stop-trying-so-h...",1,"[("", "")]"
108,"Trying. Fuck. It was due ""yesterday"" and that ...",0,"[("", "")]"
132,What's the big deal? I do that all the time. O...,0,"[("", "")]"
138,"""hey've just proved how fucking irrelevant the...",1,"[("", "")]"


In [28]:
regex = '("|&quot;)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.042
p-value = 0.000


Unnamed: 0,content,label,"(""|&quot;)"
30,i don't know how to pronounce it but it sound...,1,"[""]"
66,have you ever had a &quot;Summer Love&quot;?,0,"[&quot;, &quot;]"
70,"i call it ""hey-dumb-fuck-guys-stop-trying-so-h...",1,"["", ""]"
108,"Trying. Fuck. It was due ""yesterday"" and that ...",0,"["", ""]"
132,What's the big deal? I do that all the time. O...,0,"["", ""]"


In [29]:
nltk.word_tokenize('"test"')

['``', 'test', "''"]

Marks \# are more related to aggressive tweets

In [30]:
regex = '(?<!&)#'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.57
The rate of matching tweets = 0.008
p-value = 0.000


Unnamed: 0,content,label,(?<!&)#
173,I nominate @PhillyD for a Shorty Award in #ent...,1,[#]
194,SHUT THE FUCK UP! What?! Why does Micky Rouke ...,1,[#]
537,I nominate @yiyinglu for a Shorty Award in #de...,1,[#]
563,"if gov is going to make it ""uncomfortable"" to ...",1,[#]
706,#NAME?,0,[#]


Marks @ are more related to aggressive tweets

In [31]:
regex = '@'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.031
p-value = 0.000


Unnamed: 0,content,label,@
49,@mekdot yo...heroes sucks!!! I loved season 1 ...,1,[@]
77,Y'know @QueenofSpain doesn't hate me so I do...,0,[@]
115,Amen. Thrilled to see Brad Sucks on Rock Band...,1,[@]
116,give @ffckatg a break. Her body is awash with ...,0,[@]
173,I nominate @PhillyD for a Shorty Award in #ent...,1,[@]


### Html symbols

In [32]:
regex = '&#?\\w*?;'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.23
The rate of matching tweets = 0.040
p-value = 0.000


Unnamed: 0,content,label,&#?\w*?;
66,have you ever had a &quot;Summer Love&quot;?,0,"[&quot;, &quot;]"
139,Can someone honestly be thinking about nothin...,0,"[&;, &;]"
161,What&;s the quickest you&;ve fell for someone?,0,"[&;, &;]"
167,Can a short person &quot;talk down&quot; to a...,0,"[&quot;, &quot;]"
211,heyyy your an ugly fucking bitch you think ...,1,"[&;, &amp;, &amp;]"


In [33]:
html_symbol_dict = {}
for html_symbols in data_extended.loc[data_extended[regex].notnull()].iloc[:, 2]:
    for html_symbol in html_symbols:
        html_symbol_dict[html_symbol] = html_symbol_dict.get(html_symbol, 0) + 1

In [34]:
html_symbol_dict

{'&#163;': 1,
 '&#169;': 3,
 '&#172;': 3,
 '&#174;': 3,
 '&#191;': 1,
 '&#224;': 3,
 '&#232;': 1,
 '&#233;': 5,
 '&#234;': 2,
 '&#252;': 1,
 '&#58126;': 1,
 '&#58371;': 1,
 '&#58372;': 1,
 '&#58382;': 2,
 '&#58390;': 3,
 '&#8212;': 1,
 '&#8217;': 13,
 '&#9773;': 3,
 '&#9786;': 2,
 '&#9824;': 1,
 '&#9829;': 3,
 '&#9834;': 3,
 '&#9835;': 7,
 '&;': 364,
 '&amp;': 68,
 '&apos;': 29,
 '&gt;': 38,
 '&lt;': 76,
 '&quot;': 116}

In [35]:
regex = '&#8217;'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.50
The rate of matching tweets = 0.001
p-value = 0.526


Unnamed: 0,content,label,&#8217;
1131,nothin&#8217; i did my hair THAT WAS EXCITIN...,0,[&#8217;]
1406,Compared to Eeyore and that stupid bear and th...,1,[&#8217;]
1742,Compared to Eeyore and that stupid bear and th...,1,[&#8217;]
4243,Compared to Eeyore and that stupid bear and th...,1,[&#8217;]
5180,Football. I hate it. It&#8217;s official I&#8...,1,"[&#8217;, &#8217;]"


In [36]:
regex = '<\\w+?>'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.00
The rate of matching tweets = 0.000
p-value = 0.286


Unnamed: 0,content,label,<\w+?>
2319,Favorite candy ? :D<br>,0,[<br>]
3525,....oats<br>,0,[<br>]
12505,Are you wearing the p<br>,0,[<br>]


### Words

Negations (n't) are more related to aggressive tweets

In [37]:
regex = '\\b\\w+n\'t\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.080
p-value = 0.000


Unnamed: 0,content,label,\b\w+n't\b
1,Then I would be gay and I would kill myself. D...,1,[Don't]
2,Hey don't knock the Port+OJ until you've trie...,1,[don't]
6,You coming to work today E? I know it's Gay D...,1,[wasn't]
7,Love it but I don't it too much currently,0,[don't]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[can't]


In [38]:
regex = '\\bnot\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.36
The rate of matching tweets = 0.051
p-value = 0.137


Unnamed: 0,content,label,\bnot\b
9,its not hate on government its hate on our ex...,0,[not]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[not]
33,not too often once a year perhaps but it hi...,0,[not]
48,Then Clive is a total pussy and after reading ...,1,[not]
56,probably not but who the fuck knows what's goi...,1,[not]


In [39]:
nltk.word_tokenize('test don\'t didn\'t won\'t you\'ll')

['test', 'do', "n't", 'did', "n't", 'wo', "n't", 'you', "'ll"]

Shortened forms (with an apostrophe) are more related to aggressive tweets

In [40]:
regex = '\\b\\w+\'\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.48
The rate of matching tweets = 0.248
p-value = 0.000


Unnamed: 0,content,label,\b\w+'\w+\b
1,Then I would be gay and I would kill myself. D...,1,[Don't]
2,Hey don't knock the Port+OJ until you've trie...,1,"[don't, you've, you're]"
4,Not yet the woman was wanting to escape but ...,0,[I'm]
6,You coming to work today E? I know it's Gay D...,1,"[it's, wasn't]"
7,Love it but I don't it too much currently,0,[don't]


In [41]:
regex = '\\b\\w+\'[^t]\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.51
The rate of matching tweets = 0.068
p-value = 0.000


Unnamed: 0,content,label,\b\w+'[^t]\w+\b
2,Hey don't knock the Port+OJ until you've trie...,1,"[you've, you're]"
35,Damn you're cool,0,[you're]
65,I've noticed he can't spell much beyond NOM an...,1,[I've]
74,you got that right! People stealing others' ge...,1,[others' generators]
77,Y'know @QueenofSpain doesn't hate me so I do...,0,[Y'know]


Words with stars say nothing

In [42]:
regex = '\\b\\w+\\*{1,}\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = 0.60
The rate of matching tweets = 0.001
p-value = 0.114


Unnamed: 0,content,label,"\b\w+\*{1,}\w+\b"
209,whwhwhwhoa just slow down. huh!? what do you ...,0,"[f*****ck, F*ck, F*ck, F*ck, F*ck, F*ck]"
471,thats what i am saying...they need to pay me.t...,1,[m*ttaf]
761,Aw f**k. That sucks.,0,[f**k]
960,lenee that bo*o ass incense you gave me smell...,1,[bo*o]
1210,A BAMF is a bad-ass mother f*cker.,1,[f*cker]
2250,oooooooohhhhhhh sh*t! Damn where all the white...,1,[sh*t]
2504,my bleeping employment agency f***ed up my pay...,1,[f***ed]
2526,lenee that bo*o ass incense you gave me smell...,1,[bo*o]
3775,LMAO......... U starting SH*T already... Ha! I...,0,[SH*T]
4252,A BAMF is a bad-ass mother f*cker.,1,[f*cker]


Uppercase are more related to aggressive tweets

In [43]:
regex = '\\b[A-Z]{2,}\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.168
p-value = 0.000


Unnamed: 0,content,label,"\b[A-Z]{2,}\b"
2,Hey don't knock the Port+OJ until you've trie...,1,[OJ]
8,i hate hate hate it with a passion!! lol...i k...,1,[LOL]
12,your LAME ass . oviously iloveyouhh moree. (;,1,[LAME]
13,I'm late to the emo-ball.. I'm not even dresse...,1,[WHY]
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,"[WIFF, HIS, BIG, GAY, TEEFS, BRYCE, FROM, TRS,..."


In [44]:
# rate of uppercase
regex = '\\b[A-Z]{2,}\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,"\b[A-Z]{2,}\b",rate
2,Hey don't knock the Port+OJ until you've trie...,1,[OJ],0.043478
8,i hate hate hate it with a passion!! lol...i k...,1,[LOL],0.043478
12,your LAME ass . oviously iloveyouhh moree. (;,1,[LAME],0.166667
13,I'm late to the emo-ball.. I'm not even dresse...,1,[WHY],0.041667
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,"[WIFF, HIS, BIG, GAY, TEEFS, BRYCE, FROM, TRS,...",0.947368


In [45]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.570
0.465


In [46]:
regex = '^[^a-z]+$'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.020
p-value = 0.000


Unnamed: 0,content,label,^[^a-z]+$
21,P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS ...,1,[P33T WIFF HIS BIG GAY TEEFS!!! BRYCE FROM TRS...
50,THANKYOU! HAHA,0,[ THANKYOU! HAHA]
91,I STILL HATE YOU.,1,[I STILL HATE YOU.]
134,:),0,[ :)]
153,SARCASTIC. :],0,[ SARCASTIC. :]]


Words that start with a capital are more related to nonaggressive tweets

In [47]:
regex = '\\b[A-Z][a-z]+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.532
p-value = 0.001


Unnamed: 0,content,label,\b[A-Z][a-z]+\b
0,Heh! Parcells is a bad-ass who knows how to he...,0,"[Heh, Parcells, Hey]"
1,Then I would be gay and I would kill myself. D...,1,"[Then, Don]"
2,Hey don't knock the Port+OJ until you've trie...,1,"[Hey, Port, As]"
3,Famous! Duhh Haha that means I would be rich...,0,"[Famous, Duhh, Haha]"
4,Not yet the woman was wanting to escape but ...,0,[Not]


In [48]:
# rate of words with the first capital letter
regex = '\\b[A-Z][a-z]+\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,\b[A-Z][a-z]+\b,rate
0,Heh! Parcells is a bad-ass who knows how to he...,0,"[Heh, Parcells, Hey]",0.125
1,Then I would be gay and I would kill myself. D...,1,"[Then, Don]",0.133333
2,Hey don't knock the Port+OJ until you've trie...,1,"[Hey, Port, As]",0.130435
3,Famous! Duhh Haha that means I would be rich...,0,"[Famous, Duhh, Haha]",0.176471
4,Not yet the woman was wanting to escape but ...,0,[Not],0.043478


In [49]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.360
0.391


Lowercase say nothing

In [50]:
regex = '^[^A-Z]+$'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.302
p-value = 0.587


Unnamed: 0,content,label,^[^A-Z]+$
5,damn dude. that is spot on (wrt fb v twitter),0,[damn dude. that is spot on (wrt fb v twitter)]
9,its not hate on government its hate on our ex...,0,[its not hate on government its hate on our e...
14,you were right the uv filter that came with t...,1,[you were right the uv filter that came with ...
19,what do you usually wear? like to school?,0,[ what do you usually wear? like to school?]
23,thats a awesome ass hat lol,0,[thats a awesome ass hat lol]


Repeating letters is more related to nonaggressive tweets

In [51]:
regex = '(([a-zA-Z])\\2{2,})'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.34
The rate of matching tweets = 0.058
p-value = 0.003


Unnamed: 0,content,label,"(([a-zA-Z])\2{2,})"
4,Not yet the woman was wanting to escape but ...,0,"[(ppp, p), (fff, f)]"
32,nigga u geigh lmao! fuck yo finals beeeeeitch,1,"[(eeeee, e)]"
36,awww man sorry to hear you got laid off damn e...,0,"[(www, w)]"
38,Im doing good thankyuh!!!Heyyy hbu?!,0,"[(yyy, y)]"
64,mr big dick daddy 4rm cincinnati....ooooowwww...,0,"[(ooooo, o), (wwwww, w)]"


In [52]:
# rate of words with repeating letters
regex = '\\b\\w*(([a-zA-Z])\\2{2,})\\w*\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,"(([a-zA-Z])\2{2,})",rate
4,Not yet the woman was wanting to escape but ...,0,"[(ppp, p), (fff, f)]",0.043478
32,nigga u geigh lmao! fuck yo finals beeeeeitch,1,"[(eeeee, e)]",0.125
36,awww man sorry to hear you got laid off damn e...,0,"[(www, w)]",0.083333
38,Im doing good thankyuh!!!Heyyy hbu?!,0,"[(yyy, y)]",0.166667
64,mr big dick daddy 4rm cincinnati....ooooowwww...,0,"[(ooooo, o), (wwwww, w)]",0.142857


In [53]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.251
0.419


Laugh is more related to nonaggressive tweets

In [54]:
regex = '(b?w?a?(ha|he)\\2{1,}h?)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.26
The rate of matching tweets = 0.050
p-value = 0.000


Unnamed: 0,content,label,"(b?w?a?(ha|he)\2{1,}h?)"
11,haha yeah okay. Sounds good :] Sleep well.,0,"[(haha, ha)]"
86,what song? wtf haha,0,"[(haha, ha)]"
98,I don't . but when i was little sweet and s...,0,"[(hahah, ha)]"
114,I hate my life haha,1,"[(haha, ha)]"
120,yes. i dont eat them. haha,0,"[(haha, ha)]"


Stopwords

In [55]:
STOPWORDS = nltk.corpus.stopwords.words('english')
print(STOPWORDS)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [56]:
'n\'t' in STOPWORDS, 'not' in STOPWORDS

(False, True)

In [57]:
X_train_lower = X_train.apply(str.lower)
IRRELEVANT_STOPWORDS = []
for stopword in np.sort(STOPWORDS):
    print(stopword)
    data_extended = find_all_matches(X_train_lower, y_train, stopword)
    pvalue = compute_binom_pvalue(data_extended, stopword)
    print_matching_statistics(data_extended, stopword)
    if pvalue >= 0.01:
        IRRELEVANT_STOPWORDS.append(stopword)

a
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.921
p-value = 0.000
about
The mean of labels of matching tweets = 0.36
The rate of matching tweets = 0.036
p-value = 0.167
above
The mean of labels of matching tweets = 0.78
The rate of matching tweets = 0.001
p-value = 0.033
after
The mean of labels of matching tweets = 0.47
The rate of matching tweets = 0.007
p-value = 0.135
again
The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.009
p-value = 1.000
against
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.002
p-value = 0.683
ain
The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.034
p-value = 0.069
all
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.127
p-value = 0.128
am
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.221
p-value = 0.000
an
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.39

In [58]:
print(IRRELEVANT_STOPWORDS)

['about', 'above', 'after', 'again', 'against', 'ain', 'all', 'aren', "aren't", 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'couldn', "couldn't", 'd', 'did', 'does', 'doesn', "doesn't", 'doing', 'don', 'down', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'haven', "haven't", 'having', 'her', 'here', 'hers', 'herself', 'himself', 'how', 'i', 'if', 'into', 'isn', "isn't", "it's", 'itself', 'just', 'll', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'own', 'same', 'shan', "shan't", "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'than', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'there', 'these', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'very', 'was', "wasn't", 'we', 'were', 'weren', "weren't", 'when', 'wh

In [59]:
len(STOPWORDS), len(IRRELEVANT_STOPWORDS)

(179, 136)

In [60]:
print(set(STOPWORDS) - set(IRRELEVANT_STOPWORDS))

{'wasn', 's', 'have', 'do', 'didn', 'he', 'a', 'his', "you're", 'over', 'such', 'm', 'it', 'an', 'my', "don't", 'are', 't', 'then', 'in', 'which', 'ma', 're', 'as', 'she', 'and', 'of', 'its', 'during', 'up', 'at', 'that', 'me', 'is', "didn't", 'what', 'can', 'am', 'off', 'any', 'him', 'they', 've'}


In [61]:
df = pd.DataFrame(IRRELEVANT_STOPWORDS)
df.to_csv('./Data/irrelevant_stopwords.csv', index=False, header=False)