## Problem: Detection of aggressive tweets

Training dataset has 12776 tweets (in english) and validation dataset has 3194 tweets.<br/>
Tweets are labeled (by human) as:
* 1 (Cyber-Aggressive)
* 0 (Non Cyber-Aggressive)

# Tweet Analysis

## Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_train = pd.read_json('./Data/train.json')

In [3]:
X_train = data_train.content
y_train = data_train.label

In [4]:
print('Training data: ', 'class 1 contribution = %.2f' % y_train.mean(), 
      '# = %s' % X_train.shape[0], sep='\n')

Training data: 
class 1 contribution = 0.40
# = 12776


In [5]:
p = 0.40

Only training part is analysed

## Punctuation

In [15]:
import string
import re
import nltk
from myutils import compute_binom_pvalue
from myutils import print_matching_statistics
from myutils import find_all_matches
from myutils import compute_matching_words_rate

In [7]:
print(np.random.choice(X_train.values, 15))

['We\'ve maybe had 1" total here across 3-4 different "snows" My first time with snow tires ever and I can\'t have any damn fun!'
 'Lmao not as bad as other people :) *cough* @allieoop95 @brookoverroxx *cough* my Mommys :) Hehe. Are you a twitter whore? :)'
 '  yeah. money i love finding it too'
 ' how did you spend your 2 weeks time with out formspring?'
 ' What is the most expensive thing you have on right now?'
 'me too. So entertaining. Dude you watch this mvie called trailor parl of terror. Fucking brilliant.'
 'whelchers fuckin suck ass!'
 "Hey  don't knock the Port+OJ until you've tried it.As cold remedies go  you're still sick but no longer give a damn. ;)"
 'I STILL HATE YOU.'
 "I'm sorry. The situation sucks (I'm sure far more than I know)."
 "could be that I'm not typing hard enough? happened a lot more when I first got this laptop about 3 weeks ago. Sucks ass"
 'Ack  still no power... that sucks dude.  Hopefully it comes back soon for you!'
 ' a c t u a l l y;;i DO want him

In [8]:
PUNCTUATION = string.punctuation
print(PUNCTUATION)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Exclamation marks are more related to aggressive tweets

In [9]:
regex = '(?<![?!])!(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.216
p-value = 0.000


Unnamed: 0,content,label,(?<![?!])!(?!!)
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,[!]
4,thx for the well wishes lisa! i hate taking me...,1,[!]
10,What would you do if a leprecon jumped out of...,0,[!]
12,i twissed you! (twitter-missed yo ass).,0,[!]
13,I feel like such a nerd watching this! I LOVE ...,1,[!]


In [10]:
regex = '(?<![?!])!{2}(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.024
p-value = 0.644


Unnamed: 0,content,label,(?<![?!])!{2}(?!!)
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,[!!]
25,so do our cops! I got to ride one over break! ...,0,[!!]
98,Save the dog and plea with my boss i was on ...,0,[!!]
108,yeah u are!! Asian ass embrace it,1,[!!]
123,fuck yeah!!,1,[!!]


In [11]:
regex = '(?<!\\?)!{2,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.055
p-value = 0.009


Unnamed: 0,content,label,"(?<!\?)!{2,}"
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,"[!!!, !!!, !!!, !!!, !!!, !!]"
13,I feel like such a nerd watching this! I LOVE ...,1,[!!!!!!!!!]
25,so do our cops! I got to ride one over break! ...,0,[!!]
51,I fucking hope you move you bitch!!!!,1,[!!!!]
98,Save the dog and plea with my boss i was on ...,0,[!!]


Single question marks are more related to nonaggressive tweets

In [12]:
regex = '(?<!\\?)\\?(?![?!])'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.26
The rate of matching tweets = 0.230
p-value = 0.000


Unnamed: 0,content,label,(?<!\?)\?(?![?!])
9,u30c4 You come across a zombie eating a waffl...,0,[?]
10,What would you do if a leprecon jumped out of...,0,[?]
21,is any of them super bad ass?,0,[?]
22,You're still alive? Well damn I just lost a ...,1,[?]
24,you do have a problem. What did u get bitch?,0,[?]


In [13]:
regex = '(?<!\\?)\\?{2}(?![?!])'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.008
p-value = 0.843


Unnamed: 0,content,label,(?<!\?)\?{2}(?![?!])
46,hmmm...I had a feeling bout u! but Damn No Lab...,1,[??]
84,Fist fight?? No. Im a lover not a fighter. ha,0,[??]
408,That sucks! I can't imagine it would be easy ...,0,[??]
532,cunt? Whore?? Lololol,1,[??]
609,M3 @nd m@ d@d vv@$ 3@+!n d!nn3r @nd ! Qu!(kly...,0,[??]


In [14]:
regex = '\\?{2,}(?!!)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.014
p-value = 0.359


Unnamed: 0,content,label,"\?{2,}(?!!)"
46,hmmm...I had a feeling bout u! but Damn No Lab...,1,[??]
84,Fist fight?? No. Im a lover not a fighter. ha,0,[??]
113,lol how ya kno????,0,[????]
136,did you like the patti labelle sing along too....,0,[???]
349,formspring says oops! cant find you...sorry....,0,[???]


Marks ?! say nothing

In [15]:
regex = '\\?+!+'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.009
p-value = 0.636


Unnamed: 0,content,label,\?+!+
217,omg u missed weekend update?! I laughed so har...,0,[?!]
264,fuck you do you know how cold it was in nyc t...,0,[?!]
395,Im doing good thankyuh!!!Heyyy hbu?!,0,[?!]
473,Ahaha!!! Yess of course aint dat rite @TruAc...,0,"[?!, ?!]"
605,Are yuh single and ready 2 mingle?!,0,[?!]


Dots are more related to aggressive tweets

In [16]:
regex = '(?<!\\.)\\.(?!\\.)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.471
p-value = 0.000


Unnamed: 0,content,label,(?<!\.)\.(?!\.)
0,i hate u like miniature dogs hate people dress...,1,[.]
2,austins new girllfriend is not uglly &amp;&a...,1,"[., .]"
3,Not currently.,0,[.]
9,u30c4 You come across a zombie eating a waffl...,0,[.]
11,I find the exact opposite to be true. I can r...,1,"[., ., .]"


In [17]:
regex = '(?<!\\.)\\.{2}(?!\\.)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.037
p-value = 0.637


Unnamed: 0,content,label,(?<!\.)\.{2}(?!\.)
30,...go absolutely insane.hate to be the bearer ...,1,"[.., ..]"
37,tak to si snad ani nemel psat... ted to budu m...,0,"[.., ..]"
44,Yo it's a wack ass night here at boston colleg...,0,[..]
74,your a pale whore..,1,[..]
128,I love big bang it is hilarious..and makes me ...,0,[..]


In [18]:
regex = '\\.{2,}'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.46
The rate of matching tweets = 0.155
p-value = 0.000


Unnamed: 0,content,label,"\.{2,}"
8,fuck @TheWhiteHouse you need to come sit on B...,0,[...]
28,which year was your favoroute school year... ...,0,"[..., ...]"
29,I've noticed he can't spell much beyond NOM an...,1,[...]
30,...go absolutely insane.hate to be the bearer ...,1,"[..., .., ..]"
37,tak to si snad ani nemel psat... ted to budu m...,0,"[..., .., ..]"


Emoticons (happy faces and hearts) are more related to nonaggressive tweets

In [19]:
regex = '[:;=8x]\'?-?[)D\]*3/(x#|\[Pp{]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.35
The rate of matching tweets = 0.148
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]'?-?[)D\]*3/(x#|\[Pp{]
4,thx for the well wishes lisa! i hate taking me...,1,[:)]
7,re: why friendfeed sucks - no doubt a powerful...,1,[xp]
25,so do our cops! I got to ride one over break! ...,0,[:D]
32,I'd still be walking around with it stuck to m...,0,[;-)]
36,for real? That sucks. I have hosting there. Y...,0,[;)]


In [20]:
# happy faces
regex = '[:;=8x]-?[)D\\]\\}>*]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.28
The rate of matching tweets = 0.090
p-value = 0.000


Unnamed: 0,content,label,[:;=8x]-?[)D\]\}>*]
4,thx for the well wishes lisa! i hate taking me...,1,[:)]
25,so do our cops! I got to ride one over break! ...,0,[:D]
32,I'd still be walking around with it stuck to m...,0,[;-)]
36,for real? That sucks. I have hosting there. Y...,0,[;)]
48,My pussy remembers what u do! Lol :),1,[:)]


In [21]:
# the ratio of matching words to all words in a tweet
# rate of happy faces
regex = '[:;=8x]-?[)D\\]*]'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,[:;=8x]-?[)D\]\}>*],rate
4,thx for the well wishes lisa! i hate taking me...,1,[:)],0.0625
25,so do our cops! I got to ride one over break! ...,0,[:D],0.055556
32,I'd still be walking around with it stuck to m...,0,[;-)],0.037037
36,for real? That sucks. I have hosting there. Y...,0,[;)],0.071429
48,My pussy remembers what u do! Lol :),1,[:)],0.142857


In [22]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.184
0.341


In [23]:
regex = '[:;=8x]-?[pP]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.019
p-value = 0.196


Unnamed: 0,content,label,[:;=8x]-?[pP]
7,re: why friendfeed sucks - no doubt a powerful...,1,[xp]
79,Aw that sucks. I didn't get incomprehensible p...,0,[:P]
91,how do you express your anger?,0,[xp]
104,Know how you don't smell perfume you use a lot...,1,[xp]
137,i love you tabi your beautiful and the &;&;p...,0,[;p]


In [24]:
# sad face
regex = '[:;=8x]\'?-?[/(x#|\\[{<]'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.038
p-value = 0.290


Unnamed: 0,content,label,[:;=8x]'?-?[/(x#|\[{<]
66,damn your yankees! http://tinyurl.com/8vgfyp,0,[:/]
112,Then my dad would either bitch or just leave ...,0,[:(]
153,that sucks *hugs* :(,1,[:(]
154,Lmao not as bad as other people :) *cough* @al...,0,[xx]
170,Jonas Brothers Concert 8/13/09 ;D I still lo...,0,[8/]


In [25]:
# heart
regex = '<3+|&lt;3'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.25
The rate of matching tweets = 0.012
p-value = 0.000


Unnamed: 0,content,label,<3+|&lt;3
61,"YOU""RE A GAY!? .... =) I &lt;3 you",1,[&lt;3]
115,WHO CARES I HATE THEM! FTW! Knitting and croch...,1,[&lt;3]
152,Wild flowers. <3,0,[<3]
183,Your beautiful<3 don&;t listen to others(: t...,0,"[<3, <3, <3]"
267,Shaun Diviney's house and everyday haha. we ...,0,[<3]


Quotation marks are more related to aggressive tweets

In [26]:
regex = '("|&quot;).+("|&quot;)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.50
The rate of matching tweets = 0.040
p-value = 0.000


Unnamed: 0,content,label,"(""|&quot;).+(""|&quot;)"
2,austins new girllfriend is not uglly &amp;&a...,1,"[(&quot;, &quot;)]"
56,"This is a better Hodgman excerpt: ""it is hard ...",0,"[("", "")]"
95,evilbeet do u not feel that when u watch jon ...,1,"[("", "")]"
155,ahah! and i'll say exactly what i'd say to him...,1,"[("", "")]"
168,"*sigh* oh Karrine. You said the phrase ""Sluts...",1,"[("", "")]"


In [27]:
regex = '("|&quot;)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.50
The rate of matching tweets = 0.044
p-value = 0.000


Unnamed: 0,content,label,"(""|&quot;)"
2,austins new girllfriend is not uglly &amp;&a...,1,"[&quot;, &quot;]"
56,"This is a better Hodgman excerpt: ""it is hard ...",0,"["", ""]"
61,"YOU""RE A GAY!? .... =) I &lt;3 you",1,"[""]"
95,evilbeet do u not feel that when u watch jon ...,1,"["", ""]"
155,ahah! and i'll say exactly what i'd say to him...,1,"["", ""]"


In [28]:
nltk.word_tokenize('"test"')

['``', 'test', "''"]

Marks \# are more related to aggressive tweets

In [29]:
regex = '(?<!&)#'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.58
The rate of matching tweets = 0.007
p-value = 0.001


Unnamed: 0,content,label,(?<!&)#
351,@hermanos @jamessime wow I totally know that f...,1,[#]
607,I nominate @benmack for a Shorty Award in #bus...,1,[#]
663,#NAME?,1,[#]
1007,#NAME?,1,[#]
1039,Crazy lawsuits happen every day. God and the f...,0,[#]


Marks @ are more related to aggressive tweets

In [30]:
regex = '@'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.53
The rate of matching tweets = 0.033
p-value = 0.000


Unnamed: 0,content,label,@
8,fuck @TheWhiteHouse you need to come sit on B...,0,[@]
49,@TheSoXRoXmAsTer aw shucks you guys...are gay.,1,[@]
82,Now I have to try my best to get my ass blocke...,1,[@]
154,Lmao not as bad as other people :) *cough* @al...,0,"[@, @]"
176,"where'd ya put it? o btw: ""kick ass quote Copy...",1,[@]


## Html symbols

In [31]:
regex = '&#?\\w*?;'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.24
The rate of matching tweets = 0.040
p-value = 0.000


Unnamed: 0,content,label,&#?\w*?;
2,austins new girllfriend is not uglly &amp;&a...,1,"[&amp;, &amp;, &quot;, &quot;]"
61,"YOU""RE A GAY!? .... =) I &lt;3 you",1,[&lt;]
75,Here&;s an idea how about stop trying o get ...,1,[&;]
115,WHO CARES I HATE THEM! FTW! Knitting and croch...,1,[&lt;]
126,damn for real? That's news to me. I'm 100% sur...,0,[&gt;]


In [32]:
html_symbol_dict = {}
for html_symbols in data_extended.loc[data_extended[regex].notnull()].iloc[:, 2]:
    for html_symbol in html_symbols:
        html_symbol_dict[html_symbol] = html_symbol_dict.get(html_symbol, 0) + 1

In [33]:
html_symbol_dict

{'&amp;': 78,
 '&quot;': 122,
 '&lt;': 66,
 '&;': 364,
 '&gt;': 42,
 '&#8217;': 15,
 '&#169;': 3,
 '&apos;': 38,
 '&#191;': 1,
 '&#58390;': 3,
 '&#233;': 8,
 '&#232;': 3,
 '&#224;': 4,
 '&#234;': 1,
 '&#9786;': 1,
 '&#9835;': 7,
 '&#9824;': 2,
 '&#58371;': 1,
 '&#9834;': 3,
 '&#163;': 1,
 '&#174;': 3,
 '&#228;': 1,
 '&#58126;': 1,
 '&#172;': 3,
 '&#8230;': 4,
 '&#9829;': 2,
 '&#58382;': 1,
 '&#58389;': 1,
 '&#9773;': 1}

In [34]:
regex = '&#8217;'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.001
p-value = 0.764


Unnamed: 0,content,label,&#8217;
148,...dude I&#8217;m a producer and exporter of ...,0,"[&#8217;, &#8217;]"
699,I just entered Ms. Single Mama&#8217;s Kick As...,0,[&#8217;]
2281,I just entered Ms. Single Mama&#8217;s Kick As...,0,[&#8217;]
4145,Football. I hate it. It&#8217;s official I&#8...,1,"[&#8217;, &#8217;]"
4628,...fucking coded shit. I don&#8217;t watch old...,0,[&#8217;]


In [35]:
regex = '<\\w+?>'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.00
The rate of matching tweets = 0.000
p-value = 0.155


Unnamed: 0,content,label,<\w+?>
855,Favorite candy ? :D<br>,0,[<br>]
7879,....oats<br>,0,[<br>]
8583,Do you have a boyfriend?<br>,0,[<br>]
12057,what&;s your favorite song? :D<br>,0,[<br>]


## Words

Negations (n't) are more related to aggressive tweets

In [36]:
regex = '\\b\\w+n\'t\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.080
p-value = 0.000


Unnamed: 0,content,label,\b\w+n't\b
5,you won't die in Queens we're civilized here i...,1,[won't]
29,I've noticed he can't spell much beyond NOM an...,1,[can't]
32,I'd still be walking around with it stuck to m...,0,[hadn't]
33,I can't abide Family Guy I hate it more than ...,0,[can't]
70,LOWKEYY;; hehe Don't even trip<$,0,[Don't]


In [37]:
regex = '\\bnot\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.050
p-value = 0.159


Unnamed: 0,content,label,\bnot\b
2,austins new girllfriend is not uglly &amp;&a...,1,[not]
28,which year was your favoroute school year... ...,0,[not]
40,Totally. The ass end of my Jeep is not but n...,1,[not]
52,They all live in the town in which I grew up ...,0,[not]
56,"This is a better Hodgman excerpt: ""it is hard ...",0,[not]


In [38]:
nltk.word_tokenize('test don\'t didn\'t won\'t you\'ll')

['test', 'do', "n't", 'did', "n't", 'wo', "n't", 'you', "'ll"]

Shortened forms (with an apostrophe) are more related to aggressive tweets

In [39]:
regex = '\\b\\w+\'\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.49
The rate of matching tweets = 0.247
p-value = 0.000


Unnamed: 0,content,label,\b\w+'\w+\b
4,thx for the well wishes lisa! i hate taking me...,1,[i'm]
5,you won't die in Queens we're civilized here i...,1,"[won't, we're, there's, McDonald's]"
6,I'm being gay by using SQL Server,0,[I'm]
8,fuck @TheWhiteHouse you need to come sit on B...,0,[Bernanke's]
16,I'll have to work on that one. Ass clown is a...,0,[I'll]


In [40]:
regex = '\\b\\w+\'[^t]\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.51
The rate of matching tweets = 0.068
p-value = 0.000


Unnamed: 0,content,label,\b\w+'[^t]\w+\b
5,you won't die in Queens we're civilized here i...,1,[we're]
16,I'll have to work on that one. Ass clown is a...,0,[I'll]
22,You're still alive? Well damn I just lost a ...,1,[You're]
26,You're quite right. I withdraw my aggressive c...,1,[You're]
29,I've noticed he can't spell much beyond NOM an...,1,[I've]


Words with stars say nothing

In [41]:
regex = '\\b\\w+\\*{1,}\\w+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = 0.54
The rate of matching tweets = 0.001
p-value = 0.397


Unnamed: 0,content,label,"\b\w+\*{1,}\w+\b"
1120,A BAMF is a bad-ass mother f*cker.,1,[f*cker]
1351,my bleeping employment agency f***ed up my pay...,1,[f***ed]
2216,whwhwhwhoa just slow down. huh!? what do you ...,0,"[f*****ck, F*ck, F*ck, F*ck, F*ck, F*ck]"
2559,A BAMF is a bad-ass mother f*cker.,1,[f*cker]
5086,"the lox ""f**k you "" ""keep it thoro"" prodigy ""...",1,[f**k]
5443,Aw f**k. That sucks.,0,[f**k]
6422,lenee that bo*o ass incense you gave me smell...,1,[bo*o]
7096,the drive to Bear sucks tho. Very windy (curve...,0,[b*tch]
7460,oooooooohhhhhhh sh*t! Damn where all the white...,1,[sh*t]
7658,thats what i am saying...they need to pay me.t...,1,[m*ttaf]


Links say nothing

In [42]:
regex = '\\w+://\\S+'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.47
The rate of matching tweets = 0.014
p-value = 0.052


Unnamed: 0,content,label,\w+://\S+
66,damn your yankees! http://tinyurl.com/8vgfyp,0,[http://tinyurl.com/8vgfyp]
207,What do you think of my new graphic? http://...,0,[http://poetic-beauty81.deviantart.com/art/A-R...
229,http://twitpic.com/xm6u - You have an inbox fo...,1,[http://twitpic.com/xm6u]
311,goddamn it kj...fuck you 49 times http://tinyu...,1,[http://tinyurl.com/5mq4sq]
421,http://twitpic.com/py9p - HER ass is on top,1,[http://twitpic.com/py9p]


Emails

In [78]:
regex = '\\b(\\w[\\w.-]+@[\\w-][\\w.-]*\\w(?:\\.[a-zA-Z]{1,4}))\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000


Unnamed: 0,content,label,"(\w[\w.-]+@[\w-][\w.-]*\w(?:\.[a-zA-Z]{1,4}))"


Single letters/Shortcuts

In [44]:
regex = '\\sm\\s'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = 0.50
The rate of matching tweets = 0.000
p-value = 0.689


Unnamed: 0,content,label,\sm\s
1389,haris i m trying to control my fat tummy did'n...,1,[ m ]
5480,:O andrew u loser u have no life :P lol but i...,0,[ m ]
5948,haris i m trying to control my fat tummy did'n...,1,[ m ]
10909,and also jus my lappy is gone so m usin last f...,0,[ m ]
11538,haris i m trying to control my fat tummy did'n...,1,[ m ]
11652,Hahaha... I guess I m loser now. I dont go out...,0,[ m ]


In [45]:
regex = '\\br\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.25
The rate of matching tweets = 0.009
p-value = 0.001


Unnamed: 0,content,label,\br\b
10,What would you do if a leprecon jumped out of...,0,[r]
20,hmm.r my husband.,0,[r]
376,hahaha yeah. that beard guy haha. we never s...,0,[r]
437,r Has anybody ever told you that &quot;you co...,0,[r]
555,U r soooooooooooooooooooo pretty Alexis<3 y r...,0,"[r, r]"


In [46]:
regex = '\\bu\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.52
The rate of matching tweets = 0.037
p-value = 0.000


Unnamed: 0,content,label,\bu\b
0,i hate u like miniature dogs hate people dress...,1,[u]
24,you do have a problem. What did u get bitch?,0,[u]
31,i fuck wit u. u got skillz homie.,1,"[u, u]"
46,hmmm...I had a feeling bout u! but Damn No Lab...,1,[u]
48,My pussy remembers what u do! Lol :),1,[u]


In [47]:
regex = '[^\\w:;/]c\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.54
The rate of matching tweets = 0.001
p-value = 0.397


Unnamed: 0,content,label,[^\w:;/]c\b
1855,jealous! I will still hug u lick ur face off ...,1,[ c]
3001,"""background too damn busy"" I'd much rather c y...",0,[ c]
3063,a c t u a l l y;;i DO want himm<33 [: so you ...,0,[ c]
3978,I hate that the x and c keys are right next to...,0,[ c]
5801,YOU? FAT? R U SERIOUS?!? Remind me to slap U w...,1,[ c]


In [48]:
regex = '\\bb/c\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.00
The rate of matching tweets = 0.001
p-value = 0.014


Unnamed: 0,content,label,\bb/c\b
2691,well I need The.Perfect.Dog. b/c the kids want...,0,[b/c]
2799,we lose perspective living around so many fit ...,0,[b/c]
4845,My poor mom works w/ people like that & she's...,0,[b/c]
5787,lol oo no!..guurrrl u must be crazy!!..b/c i ...,0,[b/c]
7489,Hope b/c Im a cynic & I dont xpect much from s...,0,"[b/c, b/c]"


In [49]:
regex = '([^\\w](?:[a-zA-Z] ){4,}(?:[a-zA-Z]\\b)?)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()]

The mean of labels of matching tweets = 0.17
The rate of matching tweets = 0.000
p-value = 0.412


Unnamed: 0,content,label,"([^\w](?:[a-zA-Z] ){4,}(?:[a-zA-Z]\b)?)"
3063,a c t u a l l y;;i DO want himm<33 [: so you ...,0,[ a c t u a l l y]
6231,L O V E broccoli. cheesey broccoli hahaha,0,[ L O V E ]
9213,O H I O :],0,[ O H I O ]
9908,N E V E R.,0,[ N E V E R]
10622,Do you stroke your pussyr r r r r r r r r r r...,1,[ r r r r r r r r r r r r r r r r r r r r r r ...
11897,A L W A Y S : ],0,[ A L W A Y S ]


Uppercase are more related to aggressive tweets

In [50]:
regex = '\\b[A-Z]{2,}\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.51
The rate of matching tweets = 0.167
p-value = 0.000


Unnamed: 0,content,label,"\b[A-Z]{2,}\b"
6,I'm being gay by using SQL Server,0,[SQL]
13,I feel like such a nerd watching this! I LOVE ...,1,"[LOVE, IT]"
17,that's even better! I'm insured and I hate all...,0,[XD]
25,so do our cops! I got to ride one over break! ...,0,[SO]
29,I've noticed he can't spell much beyond NOM an...,1,"[NOM, BARK]"


In [51]:
# rate of uppercase
regex = '\\b[A-Z]{2,}\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,"\b[A-Z]{2,}\b",rate
6,I'm being gay by using SQL Server,0,[SQL],0.142857
13,I feel like such a nerd watching this! I LOVE ...,1,"[LOVE, IT]",0.181818
17,that's even better! I'm insured and I hate all...,0,[XD],0.083333
25,so do our cops! I got to ride one over break! ...,0,[SO],0.055556
29,I've noticed he can't spell much beyond NOM an...,1,"[NOM, BARK]",0.090909


In [52]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.563
0.456


In [84]:
regex = '^[^a-z]{2,}$'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.48
The rate of matching tweets = 0.018
p-value = 0.019


Unnamed: 0,content,label,"^[^a-z]{2,}$"
67,NOT AT ALL,0,[ NOT AT ALL]
346,BITCH GET ONLINE I FOUND CUTE SHOES,0,[BITCH GET ONLINE I FOUND CUTE SHOES]
411,THAT WAS THE MOST MOST BEAUTIFUL CHRISTMAS CA...,1,[THAT WAS THE MOST MOST BEAUTIFUL CHRISTMAS C...
425,VAMPIRE WEEKEND!,0,[ VAMPIRE WEEKEND!]
457,YOU STUPID WHORE I HOPE YOU ARE IN YOUR HOUSE ...,1,[YOU STUPID WHORE I HOPE YOU ARE IN YOUR HOUSE...


Words that start with a capital are more related to nonaggressive tweets

In [54]:
regex = '\\b[A-Z][a-z]+\\b'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.533
p-value = 0.000


Unnamed: 0,content,label,\b[A-Z][a-z]+\b
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,"[Bahahahahhaha, Jk, See, Hahahahaha]"
2,austins new girllfriend is not uglly &amp;&a...,1,[Ha]
3,Not currently.,0,[Not]
5,you won't die in Queens we're civilized here i...,1,"[Queens, Queens, Starbucks]"
6,I'm being gay by using SQL Server,0,[Server]


In [55]:
# rate of words with the first capital
regex = '\\b[A-Z][a-z]+\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,\b[A-Z][a-z]+\b,rate
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,"[Bahahahahhaha, Jk, See, Hahahahaha]",0.181818
2,austins new girllfriend is not uglly &amp;&a...,1,[Ha],0.029412
3,Not currently.,0,[Not],0.5
5,you won't die in Queens we're civilized here i...,1,"[Queens, Queens, Starbucks]",0.136364
6,I'm being gay by using SQL Server,0,[Server],0.142857


In [56]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.360
0.398


Lowercase say nothing

In [57]:
regex = '^[^A-Z]+$'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.303
p-value = 0.555


Unnamed: 0,content,label,^[^A-Z]+$
0,i hate u like miniature dogs hate people dress...,1,[i hate u like miniature dogs hate people dres...
4,thx for the well wishes lisa! i hate taking me...,1,[thx for the well wishes lisa! i hate taking m...
7,re: why friendfeed sucks - no doubt a powerful...,1,[re: why friendfeed sucks - no doubt a powerfu...
12,i twissed you! (twitter-missed yo ass).,0,[i twissed you! (twitter-missed yo ass).]
14,shits gay!,1,[shits gay!]


Repeating letters is more related to nonaggressive tweets

In [58]:
regex = '(([a-zA-Z])\\2{2,})'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.35
The rate of matching tweets = 0.059
p-value = 0.005


Unnamed: 0,content,label,"(([a-zA-Z])\2{2,})"
2,austins new girllfriend is not uglly &amp;&a...,1,"[(sss, s), (llll, l), (lll, l)]"
46,hmmm...I had a feeling bout u! but Damn No Lab...,1,"[(mmm, m)]"
55,Yes one of my fave movies... of... alll.... f...,0,"[(lll, l)]"
72,WHATTTTT O________O omg.....nahhhh i hate roll...,1,"[(TTTTT, T), (hhhh, h)]"
87,mmmmmkayyyyy,0,"[(mmmmm, m), (yyyyy, y)]"


In [59]:
# rate of words with repeating letters
regex = '\\b\\w*(([a-zA-Z])\\2{2,})\\w*\\b'
matching_words_rate = compute_matching_words_rate(regex, X_train)
data_extended['rate'] = matching_words_rate

matching_data_extended = data_extended.loc[data_extended.iloc[:, 2].notnull()]
matching_data_extended.head()

Unnamed: 0,content,label,"(([a-zA-Z])\2{2,})",rate
2,austins new girllfriend is not uglly &amp;&a...,1,"[(sss, s), (llll, l), (lll, l)]",0.088235
46,hmmm...I had a feeling bout u! but Damn No Lab...,1,"[(mmm, m)]",0.083333
55,Yes one of my fave movies... of... alll.... f...,0,"[(lll, l)]",0.1
72,WHATTTTT O________O omg.....nahhhh i hate roll...,1,"[(TTTTT, T), (hhhh, h)]",0.133333
87,mmmmmkayyyyy,0,"[(mmmmm, m), (yyyyy, y)]",1.0


In [60]:
print('%.3f' % matching_data_extended[matching_data_extended.rate > 0.1].label.mean())
print('%.3f' % matching_data_extended[matching_data_extended.rate < 0.1].label.mean())

0.262
0.435


Laugh is more related to nonaggressive tweets

In [61]:
regex = '(b?w?a?(ha|he)\\2{1,}h?)'
data_extended = find_all_matches(X_train, y_train, regex)
print_matching_statistics(data_extended, regex, p)
data_extended.loc[data_extended[regex].notnull()].head()

The mean of labels of matching tweets = 0.26
The rate of matching tweets = 0.050
p-value = 0.000


Unnamed: 0,content,label,"(b?w?a?(ha|he)\2{1,}h?)"
1,you asian!!! I hate asians!!! Bahahahahhaha!!!...,1,"[(ahahahah, ha), (haha, ha), (ahahahaha, ha)]"
39,Its better for me to get too little. I can g...,0,"[(haha, ha)]"
42,haha damn the ninja thing gave it away,0,"[(haha, ha)]"
44,Yo it's a wack ass night here at boston colleg...,0,"[(haha, ha)]"
69,Watch me fuck up typing on my iPhone... Hahaha,0,"[(ahaha, ha)]"


Stopwords

In [62]:
STOPWORDS = nltk.corpus.stopwords.words('english')
print(STOPWORDS)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [63]:
'n\'t' in STOPWORDS, 'not' in STOPWORDS

(False, True)

In [64]:
X_train_lower = X_train.apply(str.lower)
IRRELEVANT_STOPWORDS = []
for stopword in np.sort(STOPWORDS):
    print(stopword)
    data_extended = find_all_matches(X_train_lower, y_train, stopword)
    pvalue = compute_binom_pvalue(data_extended, stopword, p)
    print_matching_statistics(data_extended, stopword, p)
    if pvalue >= 0.01:
        IRRELEVANT_STOPWORDS.append(stopword)

a
The mean of labels of matching tweets = 0.41
The rate of matching tweets = 0.921
p-value = 0.019
about
The mean of labels of matching tweets = 0.37
The rate of matching tweets = 0.037
p-value = 0.224
above
The mean of labels of matching tweets = 0.75
The rate of matching tweets = 0.001
p-value = 0.067
after
The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.008
p-value = 1.000
again
The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.010
p-value = 0.782
against
The mean of labels of matching tweets = 0.65
The rate of matching tweets = 0.002
p-value = 0.037
ain
The mean of labels of matching tweets = 0.44
The rate of matching tweets = 0.033
p-value = 0.082
all
The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.126
p-value = 0.195
am
The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.224
p-value = 0.001
an
The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.39

The mean of labels of matching tweets = 0.38
The rate of matching tweets = 0.025
p-value = 0.569
most
The mean of labels of matching tweets = 0.30
The rate of matching tweets = 0.014
p-value = 0.008
mustn
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
mustn't
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
my
The mean of labels of matching tweets = 0.43
The rate of matching tweets = 0.121
p-value = 0.036
myself
The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.006
p-value = 0.353
needn
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
needn't
The mean of labels of matching tweets = nan
The rate of matching tweets = 0.000
p-value = 1.000
no
The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.217
p-value = 0.253
nor
The mean of labels of matching tweets = 0.46
The rate of matching tweets = 0.006


The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.745
p-value = 0.676
you
The mean of labels of matching tweets = 0.39
The rate of matching tweets = 0.337
p-value = 0.171
you'd
The mean of labels of matching tweets = 0.62
The rate of matching tweets = 0.002
p-value = 0.028
you'll
The mean of labels of matching tweets = 0.59
The rate of matching tweets = 0.004
p-value = 0.015
you're
The mean of labels of matching tweets = 0.59
The rate of matching tweets = 0.016
p-value = 0.000
you've
The mean of labels of matching tweets = 0.45
The rate of matching tweets = 0.002
p-value = 0.665
your
The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.088
p-value = 1.000
yours
The mean of labels of matching tweets = 0.42
The rate of matching tweets = 0.006
p-value = 0.729
yourself
The mean of labels of matching tweets = 0.40
The rate of matching tweets = 0.005
p-value = 1.000
yourselves
The mean of labels of matching tweets = nan
The rate of matching 

In [65]:
print(IRRELEVANT_STOPWORDS)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'any', 'aren', "aren't", 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'does', 'doesn', "doesn't", 'doing', 'don', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'haven', "haven't", 'having', 'her', 'here', 'hers', 'herself', 'himself', 'how', 'i', 'if', 'into', 'isn', "isn't", "it's", 'its', 'itself', 'ma', 'me', 'mightn', "mightn't", 'more', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'own', 're', 's', 'same', 'shan', "shan't", "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 't', 'than', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'there', 'these', 'they', 'those', 'through', 'to', 'too', 'under', 'until', 'very', 'was', '

In [66]:
len(STOPWORDS), len(IRRELEVANT_STOPWORDS)

(179, 145)

In [67]:
print(set(STOPWORDS) - set(IRRELEVANT_STOPWORDS))

{"don't", 'is', 'then', 'm', 'down', 'an', 'are', 'off', 'it', 'have', 've', 'he', 'she', 'do', 'what', "you're", 'at', 'and', 'during', 'as', 'this', 'just', 'that', 'over', 'am', 'up', 'such', "won't", 'll', 'him', 'which', 'most', 'his', 'in'}


In [68]:
df = pd.DataFrame(IRRELEVANT_STOPWORDS)
df.to_csv('./Data/irrelevant_stopwords.csv', index=False, header=False)

## Stemming and Lemmatization

In [69]:
import nltk
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV

In [70]:
stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [71]:
train_tagged_corpus = nltk.corpus.brown.tagged_sents()

In [72]:
tagger = nltk.DefaultTagger('X')
for n in range(1, 4):
    tagger = nltk.NgramTagger(n, train_tagged_corpus, backoff=tagger)

In [73]:
pos_to_wordnet_dict = {
    'J': ADJ,
    'R': ADV,
    'N': NOUN,
    'V': VERB
}

In [74]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


In [75]:
n = 1359
print(X_train[n], end='\n\n')

print('======== nltk ===============================')
X_tokenized = nltk.word_tokenize(X_train[n])
print(X_tokenized, end='\n\n')

print('======== split ==============================')
X_tokenized = [list(filter(None, re.split('[^\\w\'*]', doc))) for doc in [X_train[n]]][0]
X_tokenized = list(map(str.lower, X_tokenized))
print(X_tokenized, end='\n\n')

tagged_words = tagger.tag(X_tokenized)
print(tagged_words, end='\n\n')

lem_w = [lemmatizer.lemmatize(word, pos=pos_to_wordnet_dict[pos[0]])
          if pos[0] in pos_to_wordnet_dict else word
         for word, pos in tagged_words]
stem_w = [stemmer.stem(word) for word in X_tokenized]
print('{0:<10s} {1:s}'.format('== LEM ==', '== STEM =='))
for words in zip(lem_w, stem_w):
    print('{0:<10s} {1:s}'.format(*words))

after my surgery mine was anywhere from 45-60% of a normal sized bladder. My life sucks.

['after', 'my', 'surgery', 'mine', 'was', 'anywhere', 'from', '45-60', '%', 'of', 'a', 'normal', 'sized', 'bladder', '.', 'My', 'life', 'sucks', '.']

['after', 'my', 'surgery', 'mine', 'was', 'anywhere', 'from', '45', '60', 'of', 'a', 'normal', 'sized', 'bladder', 'my', 'life', 'sucks']

[('after', 'IN'), ('my', 'PP$'), ('surgery', 'NN'), ('mine', 'NN'), ('was', 'BEDZ'), ('anywhere', 'RB'), ('from', 'IN'), ('45', 'CD'), ('60', 'CD'), ('of', 'IN'), ('a', 'AT'), ('normal', 'JJ'), ('sized', 'JJ'), ('bladder', 'X'), ('my', 'PP$'), ('life', 'NN'), ('sucks', 'X')]

== LEM ==  == STEM ==
after      after
my         my
surgery    surgeri
mine       mine
was        wa
anywhere   anywher
from       from
45         45
60         60
of         of
a          a
normal     normal
sized      size
bladder    bladder
my         my
life       life
sucks      suck


## Sentences

In [16]:
from scipy import stats

A smaller number of sentences is more related to nonaggressive tweets

In [17]:
sents_count = np.array(list(map(len, [nltk.sent_tokenize(doc) for doc in X_train])))

In [87]:
for i in range(1, 11):
    aggressive_tweets_no = (sents_count[y_train == 1] == i).sum()
    nonaggressive_tweets_no = (sents_count[y_train == 0] == i).sum()
    
    successes_no = aggressive_tweets_no
    trials_no = aggressive_tweets_no + nonaggressive_tweets_no
    pvalue = stats.binom_test(x=successes_no, n=trials_no, p=0.39)
    
    print('No. of sentences: {0:<5} No. of aggressive twets: {1:<5} No. of nonaggressive tweets: {2:<5} p-value {3:.3f}'.\
          format(i, aggressive_tweets_no, nonaggressive_tweets_no, pvalue))

No. of sentences: 1     No. of aggressive twets: 2139  No. of nonaggressive tweets: 3975  p-value 0.000
No. of sentences: 2     No. of aggressive twets: 1691  No. of nonaggressive tweets: 2181  p-value 0.000
No. of sentences: 3     No. of aggressive twets: 835   No. of nonaggressive tweets: 1026  p-value 0.000
No. of sentences: 4     No. of aggressive twets: 296   No. of nonaggressive tweets: 353   p-value 0.001
No. of sentences: 5     No. of aggressive twets: 87    No. of nonaggressive tweets: 111   p-value 0.166
No. of sentences: 6     No. of aggressive twets: 15    No. of nonaggressive tweets: 45    p-value 0.033
No. of sentences: 7     No. of aggressive twets: 3     No. of nonaggressive tweets: 6     p-value 1.000
No. of sentences: 8     No. of aggressive twets: 2     No. of nonaggressive tweets: 4     p-value 1.000
No. of sentences: 9     No. of aggressive twets: 0     No. of nonaggressive tweets: 2     p-value 0.524
No. of sentences: 10    No. of aggressive twets: 2     No. of no

Similarity to aggressive tweets

In [6]:
from mytextpreprocessing import SimilarityToAggressiveTweets

In [7]:
texts = [
        'Thanks!', 
         'she is way way better than your nasty asss. Bahahahahhaha!!! aggress that kristens pretty tighht', 
         'i hate u'
        ]

In [8]:
print(X_train[0], y_train[0])
print(X_train[1], y_train[1])
print(X_train[2], y_train[2])
print(X_train[3], y_train[3])

i hate u like miniature dogs hate people dressing them in t-shirts and little booties. 1
you asian!!! I hate asians!!! Bahahahahhaha!!! Jk! I am well on my way to getting fucked up!!! See you next year!!! Hahahahaha!! 1
 austins new girllfriend is not uglly  &amp;&amp; she is way way better than your nasty asss willll ever be. Ha evven youur so callled bestfriend &quot;katellyn&quot; aggress that kristens pretty tighht. 1
  Not currently. 0


In [9]:
sim = SimilarityToAggressiveTweets()
sim.fit(X_train, y_train);

In [26]:
import os
import pickle
models_path = './Models'
if not os.path.exists(models_path):
    os.makedirs(models_path)
with open(os.path.join(models_path, 'similarity.p'), 'wb') as file:
    pickle.dump(sim, file, protocol=4)

In [10]:
sim.transform(texts)

array([0.37796447, 0.52704628, 1.        ])

In [11]:
sim.transform(X_train[:4])

array([1.        , 0.81649658, 0.48507125, 0.40824829])

In [12]:
aggressive_sim = sim.transform(X_train[y_train == 1][:100])

In [18]:
# a number of aggressive tweets that have one sentence
(sents_count[y_train == 1][:100] == 1).sum()

39

In [19]:
(aggressive_sim == 1).sum()

42

In [20]:
(aggressive_sim >= 0.5).sum()

70

In [21]:
nonaggressive_sim = sim.transform(X_train[y_train == 0][:100])

In [22]:
(nonaggressive_sim < 0.5).sum()

65