# Final project

Written for Python 2

In [613]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

# Counting/transforming
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Loading and labeling the data

In [614]:
# load data
path = '~/AI/final_project/data/'
comments = pd.read_csv(path + 'attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv(path + 'attack_annotations.tsv',  sep = '\t')

### Find attack rating threshold for split

**Final threshold: > 0.5 mean rating**

In [615]:
# group by mean score, then read comments to determine an appropriate threshold for what is an "attack" 
meanattack = annotations.groupby('rev_id')['attack'].mean()
comments['attack_score'] = meanattack
pd.set_option('display.max_colwidth', -1)
comments.query('attack_score > 0.3')[['comment', 'attack_score']].head(30)

Unnamed: 0_level_0,comment,attack_score
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1
89320,"Next, maybe you could work on being less condescending with your suggestions about reading the naming conventions and FDL, both of which I read quite a while ago, thanks. I really liked the bit where you were explaining why you had no interest in fixing things I complained about because you felt insulted, yet you were being extremely insulting at the time. With any luck, you can learn to be less of a jerk. GregLindahlNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKEN",0.444444
694840,"`NEWLINE_TOKENNEWLINE_TOKEN:Click on my ``Annoying Users`` link! I guess Massachusetts is a republic, then? Defending facts against the lies of determined zealots seems to be a particular sticking-place for Wikipedia ( 02:58 Feb 21, 2003 (UTC)`",0.368421
801279,Iraq is not good ===NEWLINE_TOKENNEWLINE_TOKEN=== USA is bad NEWLINE_TOKENNEWLINE_TOKEN,0.555556
1240138,"AnonNEWLINE_TOKENNEWLINE_TOKEN:What the heck are you talking about? This is an encyclopedia, not a book store.",0.333333
1319467,"`GOD BLESS OUR FREEDOM FIGHTING TROOPS DEFEATING ARAB TERRORISM IN IRAQ AND AFGHANISTAN. NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENDEFEAT totalitarian IslamofascismNEWLINE_TOKENNEWLINE_TOKEN==Our commander in chief liberated Iraq and Afganistan. But there's much left to TAKE OUT and LIBERATE. These are the regimes that we need to effect a regime change sooner or later:==NEWLINE_TOKENNEWLINE_TOKENMost urgent:NEWLINE_TOKENNEWLINE_TOKENIran,NEWLINE_TOKENSyria,NEWLINE_TOKENCommunist China,NEWLINE_TOKENNorth Korea,NEWLINE_TOKENSaudi Arabia,NEWLINE_TOKENCuba,NEWLINE_TOKENLiberia,NEWLINE_TOKENLaos,NEWLINE_TOKENLebanon,NEWLINE_TOKENVietnam,NEWLINE_TOKENLibya,NEWLINE_TOKENZimbabwe,NEWLINE_TOKENBurma,NEWLINE_TOKENMalaysia,NEWLINE_TOKENBelarus,NEWLINE_TOKENVenezuela,NEWLINE_TOKENMoldova (elected an actual Communist Party member)NEWLINE_TOKENPalestinian AuthorityNEWLINE_TOKENNEWLINE_TOKENLess urgent, but necessary one day:NEWLINE_TOKENNEWLINE_TOKENBrazil (I laid out a case for regime change in Brazil on another user's page. If you're interested, I can direct you to that page.),NEWLINE_TOKENEcuador (elected a Communist),NEWLINE_TOKENArgentina (elected a Communist sympathizer),NEWLINE_TOKENAlgeria,NEWLINE_TOKENMorocco,NEWLINE_TOKENTunisia,NEWLINE_TOKENEgypt,NEWLINE_TOKENPakistan (the military dictator they have is ok to keep a lid on the Islamists, but he'll fall one day so we might as well occupy them),NEWLINE_TOKENQatar (Al Jazeria is based there spewing its anti-American propaganda),NEWLINE_TOKENTunisia,NEWLINE_TOKENand every other country that is not a democracy or is a democracy that elects anti-democratic leftists like Hugo ChavezNEWLINE_TOKENNEWLINE_TOKENCountries we should destabilize, but not necessarily prusue regime change through military means:NEWLINE_TOKENNEWLINE_TOKENFrance,NEWLINE_TOKENGermany,NEWLINE_TOKENBelgium,NEWLINE_TOKENCanada,NEWLINE_TOKENMexico,NEWLINE_TOKENSouth Africa,NEWLINE_TOKENTurkey,NEWLINE_TOKENRussia,NEWLINE_TOKENSweden,NEWLINE_TOKENand more...NEWLINE_TOKENNEWLINE_TOKEN==Countries WHERE FREEDOM RINGS==NEWLINE_TOKENNEWLINE_TOKENAMERICA (NUMBER ONE ON THIS ACCOUNT),NEWLINE_TOKENUK,NEWLINE_TOKENAustralia,NEWLINE_TOKENIsrael,NEWLINE_TOKENItaly ,NEWLINE_TOKENSpain,NEWLINE_TOKENPoland,NEWLINE_TOKENLatvia,NEWLINE_TOKENLithuania,NEWLINE_TOKENEstonia,NEWLINE_TOKENCzech Republic,NEWLINE_TOKENand more...NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENAbove: me at a pro-Iraq War rally.NEWLINE_TOKENNEWLINE_TOKEN==COLLEGE REPUBLICANS MAKE A DIFFERENCE!!!!!!!!!!==NEWLINE_TOKENhttp://www.crnc.org/NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKEN==DO YOUR PART AMERICA: BOYCOTT FRANCE AND FRENCH GOODS.==NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENAbove: THE OLD YELLER FRENCH ARMY KNIFE.NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENEvil exists. And militant Islamism (the militant Islamism of bin Laden, the Saudis, Saddam Hussein, the Baathists, and the Palestinian suicide bombers) or Islamofascism is the enemy of freedom and the distilled essence of evil. Totalitarian ideologies and fanaticisms have come in gone and have been defeated by America. All these ideologies are one in the same. They hate modernity, hate America, hate freedom, hate capitalism, hate liberal democracy, and love terrorism, oppression, genoicide, and fanatic hatred. In Germany the tyranical enemies of freedom and capitalism rallied behind Nazism, in Italy they rallied behind fascism, in Russia they rallied behind totalitarian socialism and communism, and now in the Middle East, where a lot of dictators and tyrants are threatened by freedom and American values, they rally behind militant Islamics. It is fact that there isnt a single Arab democracy. Muslim leaders (Saddam was just the worst of the lot. There will be more dictators/terrorists to fight like the Syrians) are all tyrants and terrorists who stifle the free press, kill their own people, crush their citizens hopes and dreams, and want to kill Americans like they did on 9/11. Their desire to kill Americans and supprot terror rests on one deep, abiding hatred: their irrational fear of America, which sticks up for freedom and opposes their tyranny with great scarafices, like America is doing right now defeating evil in the Arab countries of Iraq and Afghnaistan. NEWLINE_TOKENNEWLINE_TOKENThe ideology of militant Islamist terrorism is the totalitarian enemy that America confronts today. And patriotic Americans say it will be defeated like America defeated totalitarianisms in the past through heroic struggle: Communism, fascism, Nazism. NEWLINE_TOKENNEWLINE_TOKENA lot of conservative commentators who speak with moral clarity call America's struggle against the evil of Islamofascist totalitarianism right now World War IV. That this is freedom's fourth struggle against a totalitarian evil. In WWI it was the despotic rule of the Kaiser, in WWII it was the Nazis, in freedom's third struggle it was the communists in the Cold War (although it wasnt a ``hot war`` it was another global stuggle like a world war). Now America's forth stuggle is a worldwide campaign against states like Iraq that hate the free world, kill their own people, desire weapons of mass destruction, and support terrorism. NEWLINE_TOKENNEWLINE_TOKENThe antiwar liberal left appeases totalitarian evil, which they love to do. They rallied to defend the Communists in Vietnam. Now the amoral liberal left is opposing America's commander in chief George W Bush in his struggle in Iraq. Hundreds of millions were slaughtered (Communism murdered 100 million people while the liberals opposed the Cold War at every step) and Communism threatened the freedom of America and her allies. Conservatives say that America must stop this new totalitarian enemy before its murderous hate claims as many victims as Communism. NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKEN`",0.4
1450441,`NEWLINE_TOKENNEWLINE_TOKENBuddha - ``Some suggest that victims should be referred to additional resources and made aware of their rights under policy and law.``NEWLINE_TOKENNEWLINE_TOKENThat's an incredibly silly way to phrase that. `,0.4
1502668,"BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS, BOOBS",0.333333
2702703,"NEWLINE_TOKENNEWLINE_TOKEN____NEWLINE_TOKENfuck off you little asshole. If you want to talk to me as a human start showing some fear the way humans act around other humans, because if you continue your beligerant campaign, i will cross another boundary and begin off-site recruitmehnt. NEWLINE_TOKENNEWLINE_TOKENI can escalate till I am rhetorically nuclear with the whole goddamed mob of you if that is where you think you will find what you want. NEWLINE_TOKENNEWLINE_TOKENYou had better start expressing some interest in the concerns presented to you or your credibility as either a document or a community will be about that of a pile of shit.",1.0
3897107,"`===Plautus satire's irresponsible comments are those of a crackpot.===NEWLINE_TOKENNEWLINE_TOKENPlautus satire's comments are idiotic nonsense. I have refereed papers for five mathematics journals, and my published papers have been refereed (and some unpublished ones that I submitted). Obviously history is consulted; obviously proof-checking is not the main point of refereeing; obviously novel ideas are examined in the context of accepted notions. To say that mathematics is not a science but a ``language`` ignores the fact that hundreds of journals are devoted to publishing new discoveries in mathematics. Obviously in judging publication-worthiness one considers how new discoveries may be relevant to potential future research; one considers esthetics (which for most mathematicians is the main motive for doing mathematics or for learning mathematics). Whence this loony idea that there is no need to consult history? Do you not see sections on how a new discovery fits into the historical development of the subject in many research papers in mathematics?NEWLINE_TOKENNEWLINE_TOKENPlautus satire's comments are those of a crackpot. NEWLINE_TOKENNEWLINE_TOKEN`",0.388889
4619883,"`NEWLINE_TOKENNEWLINE_TOKEN*Paragraph one: if you are uncertain as to whether Lincoln was an important spokesman for the American System, why on earth would you be editing this article?NEWLINE_TOKENNEWLINE_TOKEN*Paragraph two: I used the qualifier ``arguably,`` and I think it is justified to discuss the track record of the American System approach, which is now virtually unknown in its country of origin, to the track record of the other two options, which are normally the only ones discussed. Feel free to cite a success story for Marxism or Laissez-Faire. However, I agree that this paragraph might be better couched in a ``proponents of the American system assert a, and opponents respond with b`` format. I won't attempt to edit it, however, until we get arbitration, because you, Andy, and your cohorts, are in the ``revert, don't debate`` mode.NEWLINE_TOKENNEWLINE_TOKEN*Paragraph three: simple statement of fact. Can you name another outspoken proponent of the American System, other than LaRouche? Remember, your dislike or LaRouche, or of the American System, is not at issue. And, just out of curiosity, what is your gripe against Sun Yat-Sen? `",0.333333


In [616]:
# labels a comment as an atack if the majority of annotators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

# join labels and comments
comments['attack'] = labels

### Check if other comment features are useful

(Useful meaning predictive of attack/not attack)

* **Useful features:** 'logged_in', 'ns', 'sample'
* **Not useful:** 'year'

In [617]:
# Comparing 'logged_in'
print "Attacks:\n",comments.query('attack')['logged_in'].value_counts(),"\n"
print "Not attacks:\n",comments.query('not attack')['logged_in'].value_counts()

Attacks:
False    7635
True     5955
Name: logged_in, dtype: int64 

Not attacks:
True     78963
False    23311
Name: logged_in, dtype: int64


In [618]:
# Comparing 'year'
print "Attacks:\n",comments.query('attack')['year'].value_counts(),"\n"
print "Not attacks:\n",comments.query('not attack')['year'].value_counts()

Attacks:
2008    2239
2006    1997
2007    1894
2009    1683
2010    1301
2011    1035
2012    824 
2015    775 
2014    688 
2013    661 
2005    445 
2016    36  
2004    11  
2003    1   
Name: year, dtype: int64 

Not attacks:
2006    15379
2007    14543
2008    14402
2009    10891
2010    9440 
2011    7609 
2012    6727 
2014    5967 
2015    5922 
2013    5788 
2005    4520 
2004    629  
2016    255  
2003    147  
2002    48   
2001    7    
Name: year, dtype: int64


In [619]:
# Comparing 'ns'
print "Attacks:\n",comments.query('attack')['ns'].value_counts(),"\n"
print "Not attacks:\n",comments.query('not attack')['ns'].value_counts()

Attacks:
user       11341
article    2249 
Name: ns, dtype: int64 

Not attacks:
user       53206
article    49068
Name: ns, dtype: int64


In [620]:
# Comparing 'sample'
print "Attacks:\n",comments.query('attack')['sample'].value_counts(),"\n"
print "Not attacks:\n",comments.query('not attack')['sample'].value_counts()

Attacks:
blocked    13246
random     344  
Name: sample, dtype: int64 

Not attacks:
blocked    65126
random     37148
Name: sample, dtype: int64


In [621]:
# looking at size of splits
comments['split'].value_counts()

train    69526
test     23178
dev      23160
Name: split, dtype: int64

# Cleaning the dataset

### Methods included:
* removing newlines and tabs
* removing punctuation
* differentiating between punctuation that mostly occurs in the middle of a word vs. at the end
* converting all text to lowercase

### Methods tried, but not included:
* excluding stop words
* ngrams set to ‘char_wb’ instead of ‘word’

In [622]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# replace with space
punctuation = ":`.=,-;?!)(&}{[]|<>"
for char in punctuation:
    comments['comment'] = comments['comment'].apply(lambda x: x.replace(char, " "))  

# replace with nothing (bc they generally appear in the middle of a word, e.g., you're -> youre)
punctuation = "*'@#$%"
for char in punctuation:
    comments['comment'] = comments['comment'].apply(lambda x: x.replace(char, ""))
    
comments['comment'] = comments['comment'].apply(lambda x: x.lower())

In [623]:
# check that data looks cleaned
comments

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,attack_score,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
37675,this is not creative those are the dictionary definitions of the terms insurance and ensurance as properly applied to destruction if you dont understand that fine legitimate criticism ill write up three man cell and bounty hunter and then it will be easy to understand why ensured and insured are different and why both differ from assured the sentence you quote is absolutely neutral you just arent familiar with the underlying theory of strike back e g submarines as employed in nuclear warfare guiding the insurance nor likely the three man cell structure that kept the ira from being broken by the british if thats my fault fine i can fix that to explain but theres nothing personal or creative about it im tired of arguing with you re the other article multi party turns up plenty and there is more use of mutually than mutual if i were to apply your standard id be moving mutual assured destruction to talk for not appealing to a reagan voters biases about its effectiveness and for dropping the ly there is a double standard in your edits if it comes from some us history book like peace movement or m a d as defined in 1950 you like it even if the definition is totally useless in 2002 and only of historical interest if it makes any even obvious connection or implication from the language chosen in multiple profession specific terms you consider it somehow non neutral gandhi thinks eye for an eye describes riots death penalty and war all at once but you dont what do you know that gandhi doesnt guess what reality is not neutral current use of terms is slightly more controversial neutrality requires negotiation and some willingness to learn this is your problem not mine you may dislike the writing fine that can be fixed but disregarding fundamental axioms of philosphy with names that recur in multiple phrases or failing to make critical distinctions like insurance versus assurance versus ensurance which are made in one quote by an air force general in an in context quote is just a disservice to the reader if someone comes here to research a topic like mad they want some context beyond history if this is a history book fine its a history book but that wasnt what it was claimed to be,2002,False,article,random,train,0.000000,False
44816,the term standard model is itself less npov than i think wed prefer if its new age speak then a lot of old age people speak it karl popper the pope etc heres karl poppers view of this the clearest title for this article would be particle physics cosmology but as i say that would require broader treatment of issues like the anthropic principle cognitive bias beyond the particle physics zoo etc as to accelerators its clear that while they are in use someone is still looking for particles so this is not yet a settled cosmology so certain that we abandon the search nor is it an arbitrary foundation ontology as you suggest not subject to question,2002,False,article,random,train,0.000000,False
49851,true or false the situation as of march 2002 was such a saudi proposal of land for peace and recognition by all arab countries was made the day the proposal was to be made formal by the arab league was the day the israelis under the command of ariel sharon began the invasion of the palestinian self rule areas user arab,2002,False,article,random,train,0.000000,False
89320,next maybe you could work on being less condescending with your suggestions about reading the naming conventions and fdl both of which i read quite a while ago thanks i really liked the bit where you were explaining why you had no interest in fixing things i complained about because you felt insulted yet you were being extremely insulting at the time with any luck you can learn to be less of a jerk greglindahl,2002,True,article,random,dev,0.444444,False
93890,this page will need disambiguation,2002,True,article,random,train,0.000000,False
102817,important note for all sysops there is a bug in the administrative move feature that truncates the moved history and changes the edit times please do not use this feature until this bug is fixed more information can be found in the talk of and thank you,2002,True,user,random,train,0.000000,False
103624,i removed the following all names of early polish rulers are ficticious and therefore this index naming oda von haldensleben and her husband dagome records for the first time rulers of the polanen tribe therefore it is indicated as being the first document of the later developing land named poland this is quite a comment all names are fictitious it deserves at least some backing,2002,True,article,random,train,0.000000,False
111032,if you ever claimed in a judaic studies program that ultra orthodox jews dont have rabbis or dont have synagogues you would be laughed out of the room i am beginning to see the problem you have you see you do not know how to read what other people say without attaching your personal bias to it never once did i say that ultra orthodox jews have no rabbis or synagogues i did say that the role of the rabbi and synagogue in ultra orthodox judaism is minimal when compared with say conservative judaism they are not clergy in the traditional western sense of the word that is a fact as for synagogues they exist but they are not essential minyan a quorum of ten adult males in ultra orthodox law is essential you can have a minyan without a synagogue but a synagogue without a minyan is an empty building you can laugh all you want but it doesnt change the facts it may seem strange to you but take the statement to anyone who actually knows something i want to know who laughs then if you ever claimed in a recognized judaica studies program that a significant number of ultra orthodox rabbis accept and follow modern orthodox responsa instead of their own people would look at you as if you had two heads also a silly point if you were to draw such sharp distinctions between ultra orthodox and modern orthodox judaism in a recognized judaic studies program they would not know what you are talking about a that statement is false because responsa in orthodoxy does not work on the basis of someones synagogue affiliation b what is a modern orthodox responsa versus an ultra orthodox responsa c in cases where things like hashgacha kosher certification etc are debated you will find in most cases that it is not a question of responsa if you know something about responsa literature in general which apparently you do not there are degrees of acceptance for example halav yisrael milk under rabbinical supervision during milking to ensure that it comes from a cow most modern orthodox jews do not insist on it t on the basis of a responsa by rabbi moshe feinstein who determined that american government supervision is sufficient most ultra orthodox jews do insist on it they do not reject the responsa in fact moshe feinstein reb moshe as he was called was considered the leading halachic authority for the american ultra orthodox community in the past fifty years they will say that the ruling is right but they want to be machmir more strict on themselves meaning more pious it is not a rejection it is simply being stricter ovadya yoseph said women could wear pants instead of skirts would any of his daughters or daughters in law be caught dead in pants no way they are machmir you just cant make this stuff up danny i dunno i seem to be giving answers to everything you say and my answers are based on sources i am not making things up i am simply stating things that no matter how inconceivable it may be you do not know stop trying to re make the ultra orthodox in your own image its not good history frankly i dont care what is on your resume it doesnt justify writing such nonsense rk personal attacks i will say that your intellectual arrogance coupled with your apparent ignorance do not put you in a very good light in my professional life i have researched and written extensively on this field developed relationships both working and personal with many of the people involved and worked on several important documentary films on the topic i think i have a sense of what npov is and a statement that the ultra orthodox hate the non ultra orthodox just doesnt make the mark is beating the shit out of non orthodox jews an act of love is calling them worse than hitler an act of love is accusing them of creating the holocaust an act of tolerance danny you can protest all you like but you obviously so immersed in an ultra orthodox worldview that you cant see the forest for the trees this is violent hatespeech and it frightens me to see an educuated person making apologetics for it rk then it must frighten you even more knowing that i was actually there and on one occasion actually hit why because i do not feel the same antagonism that you do because i see some distasteful remarks in a certain context you say in your diatribe that you want everyone to sit together and sing a shlomo carlebach niggun just answer me this though would reb shloime have davened in a shul without a mehitza oh and haredi does not mean trembling as you originally wrote at the top of the article in fact that was kind of funny you were apparently confused with the film trembling before god which,2002,True,article,random,dev,0.000000,False
120283,my apologies im english i watch cricket i know nothing maybe i was thinking of the time he spent in the army or maybe i was thinking of elvis or something im glad the page got improved,2002,True,article,random,dev,0.000000,False
128532,someone wrote more recognizable perhaps is a type of what is generally called rock and roll called folk rock or simply folk which included performers such as joan baez bob dylan simon and garfunkel the mamas and the papas and many others ive tried to clarify this folk rock is used very specifically and is typically far more recognised by instrumentation than form many folk musicians of the 60s tom paxton phil ochs etc sang new topical material which distinguished them from traditional folk musicians but in the folk idiom acoustic instruments traditional arrangements and often traditional melodies re the comment about marketers in the first paragraph if language reflects common usage what is now called folk music has as much right to the name as any other form gareth owen to the latter fair enough but does the first paragraph actually imply otherwise lms i like the page in general but wonder if the following is unnecessarily cynical implying as it does a financial rather than artistic incentive to change musical styles some of these performers of which joan baez is an excellent example began their commercial music careers performing traditional music in a traditional idiom but soon transformed their style and accompaniment to suit popular tastes ya know i agree but i dont know how to change it right off anyone else want to give it a stab lms the deletions are merely of things that seemed redundant additions may solve the problem of tone mentioned above one bit of the original puzzles me so i corrected the grammar but left it inbut what does unrecognizable to its source actually mean i like the new additionslots of good new information here i added some more the problem now is that the article is rambling and disorganized and i am probably not the best person to organize and clarify it btw using the word purist without the quotes makes it sound as if the authors of the article are not purists which we dont want to imply see neutral point of view lms perhaps someone who knows the facts could add in skiffle music from whence the beatles sprang which was evidently a british folk form in the 1950s certainly the beatles stole er utilised many folk forms in their music,2002,True,article,random,train,0.000000,False


In [624]:
# combine all features for classifier later
comments['stats'] = comments.apply(lambda row: {'logged_in': str(row['logged_in']), 
                                                 'ns': row['ns'],
                                                 'sample': row['sample']}, axis=1)

# Classifying

### Models:
* **Logistic regression**
  * Best results (AUC): 0.960
  * Percentage points improved: 0.733%
  * Model with best results prior to tuning 
  
* **Naive Bayes'**
  * Best results (AUC): 0.938
  * Percentage points improved: 3.62%
  
* **Linear SVM**
  * Best results (AUC): 0.959
  * Percentage points improved: 0.306%
  * Best results after tuning
  
* **Stochastic gradient descent (SGD)**
  * Best results (AUC): 0.955
  * Percentage points improved: 2.96%

* **Random forest**
  * Best results (AUC): 0.941
  * Percentage points improved: -0.043%

### Helper functions

In [625]:
def print_top10_features(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    top10 = np.argsort(clf.coef_[0])[-20:]
    print("%s:\n%s" % ("most attack",
            " \n".join(feature_names[j] for j in top10)))
    top10nice = np.argsort(clf.coef_[0])[:20]
    print("\n\n%s:\n%s" % ("least attack",
            " \n".join(feature_names[j] for j in top10nice)))

In [626]:
def print_classification_metrics(probabilities):
    """Prints AUC, accuracy, confusion matrix, precision, recall, fscore, and support"""
    auc = roc_auc_score(test_comments['attack'], probabilities)
    predicted = clf.predict(test_comments['comment'])
    
    print "AUC: %.3f" %auc
    print "Accuracy: %(acc).3f" %{"acc": (100*accuracy_score(test_comments['attack'],predicted))} + "%","\n"
    print "Confusion matrix:\n",confusion_matrix(test_comments['attack'],predicted), "\n"
    print "Other stats:\n", classification_report(test_comments['attack'],predicted,target_names = ['Polite', 'Attack'])

In [627]:
# split dataset into train, dev, and test
train_comments = comments.query("split=='train'")
dev_comments = comments.query("split=='dev'")
test_comments = comments.query("split=='test'")

## Logistic regression

In [628]:
# With default values

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.958
Accuracy: 93.891% 

Confusion matrix:
[[20276   146]
 [ 1270  1486]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.94      0.99      0.97     20422
     Attack       0.91      0.54      0.68      2756

avg / total       0.94      0.94      0.93     23178



In [629]:
# Best results

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, min_df=10)),
    ('tfidf', TfidfTransformer(norm='l2', sublinear_tf=True)),
    ('clf', LogisticRegression(penalty='l1')),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.960
Accuracy: 94.624% 

Confusion matrix:
[[20217   205]
 [ 1041  1715]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.95      0.99      0.97     20422
     Attack       0.89      0.62      0.73      2756

avg / total       0.94      0.95      0.94     23178



## Naive Bayes'

In [630]:
# With default values

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.834
Accuracy: 89.805% 

Confusion matrix:
[[20415     7]
 [ 2356   400]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.90      1.00      0.95     20422
     Attack       0.98      0.15      0.25      2756

avg / total       0.91      0.90      0.86     23178



In [631]:
# Best results

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2),min_df=15)),
    ('tfidf', TfidfTransformer(norm='l2', sublinear_tf=True)),
    ('clf', MultinomialNB()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.938
Accuracy: 93.425% 

Confusion matrix:
[[20221   201]
 [ 1323  1433]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.94      0.99      0.96     20422
     Attack       0.88      0.52      0.65      2756

avg / total       0.93      0.93      0.93     23178



## Linear SVM

In [632]:
# With default values

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.decision_function(test_comments['comment']))

AUC: 0.953
Accuracy: 94.728% 

Confusion matrix:
[[20170   252]
 [  970  1786]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.95      0.99      0.97     20422
     Attack       0.88      0.65      0.75      2756

avg / total       0.94      0.95      0.94     23178



In [633]:
# Best results

clf = Pipeline([
    ('vect', CountVectorizer(ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm='l2', sublinear_tf=True)),
    ('clf', LinearSVC(penalty='l1',dual=False, max_iter=500)),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.decision_function(test_comments['comment']))

AUC: 0.959
Accuracy: 95.034% 

Confusion matrix:
[[20124   298]
 [  853  1903]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.96      0.99      0.97     20422
     Attack       0.86      0.69      0.77      2756

avg / total       0.95      0.95      0.95     23178



## SGD Classifier
* ran a grid search for parameters alpha and n_iter because it's recommended in the documentation

In [558]:
# With default values

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.decision_function(test_comments['comment']))

AUC: 0.945
Accuracy: 90.672% 

Confusion matrix:
[[19383    58]
 [ 2104  1633]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.90      1.00      0.95     19441
     Attack       0.97      0.44      0.60      3737

avg / total       0.91      0.91      0.89     23178



In [559]:
# Best with manual tuning

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000,ngram_range = (1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss = 'modified_huber', alpha =0.00006,n_iter=65)),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.decision_function(test_comments['comment']))

AUC: 0.942
Accuracy: 92.316% 

Confusion matrix:
[[19150   291]
 [ 1490  2247]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.93      0.99      0.96     19441
     Attack       0.89      0.60      0.72      3737

avg / total       0.92      0.92      0.92     23178



In [377]:
print("Best score: %0.3f" % gscv.best_score_)
print("Best parameters set:")
best_parameters = gscv.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000,ngram_range = (1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss = 'modified_huber', alpha = best_parameters['clf__alpha'],
                          n_iter = best_parameters['clf__n_iter'])),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.decision_function(test_comments['comment']))

Best score: 0.943
Best parameters set:
	clf__alpha: 0.0001
	clf__n_iter: 60
AUC: 0.957
Accuracy: 94.301% 

Confusion matrix:
[[20293   129]
 [ 1192  1564]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.94      0.99      0.97     20422
     Attack       0.92      0.57      0.70      2756

avg / total       0.94      0.94      0.94     23178



In [536]:
# Grid search

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000,ngram_range = (1,2))),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss = 'modified_huber')),
])

parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': 10.0**-np.arange(1,7),
    #'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (40, 60, 80),
}

gscv = GridSearchCV(clf,parameters)
gscv.fit(train_comments['comment'], train_comments['attack'])

Best score: 0.916
Best parameters set:
	clf__alpha: 0.0001
	clf__n_iter: 40
AUC: 0.945


ValueError: Found input variables with inconsistent numbers of samples: [23178, 9]

## Random Forest

In [266]:
# With default values

clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.940
Accuracy: 93.705% 

Confusion matrix:
[[20284   138]
 [ 1321  1435]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.94      0.99      0.97     20422
     Attack       0.91      0.52      0.66      2756

avg / total       0.94      0.94      0.93     23178



In [278]:
# After tuning

rf = RandomForestClassifier(n_estimators=50)
vect = CountVectorizer(max_features = 10000, ngram_range = (1,2), min_df=5)

clf = Pipeline([
    ('vect', vect),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', rf),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
print_classification_metrics(clf.predict_proba(test_comments['comment'])[:, 1])

AUC: 0.941
Accuracy: 93.662% 

Confusion matrix:
[[20280   142]
 [ 1327  1429]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.94      0.99      0.97     20422
     Attack       0.91      0.52      0.66      2756

avg / total       0.94      0.94      0.93     23178



## Combining features

Got worse accuracy than without adding features, extra features are always classified as 'Polite', not really sure why...

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction import DictVectorizer

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.
    
    Copied from: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]


def print_classification_metrics(probabilities):
    """Prints AUC, accuracy, confusion matrix, precision, recall, fscore, and support.
    Tweaked to work with pipeline containing FeatureUnion.
    """
    auc = roc_auc_score(test_comments['attack'], probabilities)
    predicted = clf.predict(test_comments)
    
    print "AUC: %.3f" %auc
    print "Accuracy: %(acc).3f" %{"acc": (100*accuracy_score(test_comments['attack'],predicted))} + "%","\n"
    print "Confusion matrix:\n",confusion_matrix(test_comments['attack'],predicted), "\n"
    print "Other stats:\n", classification_report(test_comments['attack'],predicted,target_names = ['Polite', 'Attack'])

In [535]:
clf = Pipeline([
    # Use FeatureUnion to combine the features from comments and sample,ns, etc.
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for standard bag-of-words model for comments
            ('comment_bow', Pipeline([
                ('selector', ItemSelector(key='comment')),
                ('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
            ])),
            # Pipeline for pulling ad hoc features from post's body
            ('comment_stats', Pipeline([
                ('selector', ItemSelector(key='stats')),
                ('vect', DictVectorizer()),  # list of dicts -> feature matrix
                ('tfidf', TfidfTransformer()),
            ])),

        ],

        # weight components in FeatureUnion
        transformer_weights={
            'comment_bow': 0.0,
            'comment_stats': 1.0,
        },
    )),
    ('clf', SGDClassifier()),
])

clf = clf.fit(train_comments, train_comments['attack'])
y = clf.predict(test_comments)
print(classification_report(y, test_comments['attack']))
print_classification_metrics(clf.decision_function(test_comments))

             precision    recall  f1-score   support

      False       1.00      0.84      0.91     23178
       True       0.00      0.00      0.00         0

avg / total       1.00      0.84      0.91     23178

AUC: 0.803
Accuracy: 83.877% 

Confusion matrix:
[[19441     0]
 [ 3737     0]] 

Other stats:
             precision    recall  f1-score   support

     Polite       0.84      1.00      0.91     19441
     Attack       0.00      0.00      0.00      3737

avg / total       0.70      0.84      0.77     23178



# Conclusion

#### What are the features you considered using? What features did you use in the final code?
The only features I ended up using in my code were 'comment' and 'attack', although I considered using 'logged_in', 'ns', and 'sample' as well. Ultimately, I couldn't figure out how to use those features to improve the model, so they were left out of the final code.

#### What optimizations did you add in your code, if any?
I used GridSearch on the SGDClassifier model.

#### What hyper-parameter tuning did you do?
For all models, I tried tuning a few parameters each for CountVectorizer (max_features, n_grams, min_df, stop_words) and TfidfTransformer (norm, sublinear_tf). I kept whichever setting improved the model's performance the most. For the individual classifiers, I selected the parameters that seemed most likely to improve performance for that specific model--for example, n_estimators for RandomForest and loss for SGDClassifier. I also tried a few things that didn't work, like excluding stop words (both from a premade list and using max_df), and setting class_weight='balanced' as a parameter on some models. 

#### What did you learn from the different metrics?
The model accuracy and AUC were both very helpful for getting an idea of overall performance. The confusion matrix and other stats (precision, recall, etc.) were more useful for figuring out how the model was actually classifying things; for example, most models were more likely to mistake rude comments for polite ones than vice versa, probably because the training set had more polite comments in it. 

#### Did you try cross-validation?
No :(

#### What are your best final Result Metrics? By how much is it better than the strawman figure?

|                   | strawman | final  | improvement |
| ----------------- |:--------:|:------:| -----------:|
| AUC               | 0.957    | 0.959  | 0.002       |
| Accuracy          | 94.05%   | 95.04% | 0.99%       |
| Average precision | 0.94     | 0.95   | 0.01        |
| Average recall    | 0.94     | 0.95   | 0.01        |
| Average f1-score  | 0.93     | 0.95   | 0.02        |


#### Which model gave you this performance?
My best final results metrics were for the Linear SVM model after tuning.

#### What is the most interesting thing you learned from doing the report?
I looked up the most predictive features for each model to verify that it was working correctly (using the `print_top10_features` function). The most predictive words for the attack comments were all curse words, except for the word 'you', which showed up a few times. This makes sense, since a lot of attacks are directly addressing the person being attacked. Similarly, the most predictive words for polite comments were 'thanks' and 'thank you'. It was highly predictive of both attack and polite comments (depending on the order), which is probably partly why every model did better when stop words were included.

#### What was the hardest thing to do?
As a model improved, it became very hard to get it to improve further with tuning. The Naive Bayes' model had the greatest improvement after tuning, but it started out with the worst performance. The Logistic Regression and LinearSVM models were both fairly high without tuning, and it was difficult to find a combination of parameters where either of them improved significantly.

## Other stuff (will probably delete)

In [140]:
comment_text = test_comments['comment'].values
isattack = test_comments['attack'].values
predicted

array([ 0.06007826,  0.12342553,  0.05361463, ...,  0.05755827,
        0.06621351,  0.01309482])

In [183]:
print_top10(vect, lr, train_comments['attack'])

most attack:
cock 
bastard 
gay 
penis 
cunt 
bullshit 
sucks 
idiots 
moron 
dick 
suck 
bitch 
you 
asshole 
ass 
shit 
stupid 
idiot 
fucking 
fuck


least attack:
thank you 
thank 
article 
thanks 
please 
could 
section 
agree 
but 
there 
welcome 
source 
redirect talk 
redirect 
after 
is not 
which 
discussion 
different 
if


In [45]:
isattack

array([False, False, False, ..., False, False, False], dtype=bool)

In [85]:
def correctprediction(predicted, actual):
    if(actual):
        return predicted >= .50
    else:
        return predicted < .50

In [86]:
correctprediction(predicted[0], isattack[0])

True

In [87]:
for i in range(len(predicted)):
    if not correctprediction(predicted[i], isattack[i]):
        print "Is attack: ", isattack[i], 
        print ", Predicted value: ", predicted[i]
        print "Comment text: ", comment_text[i]
        print ""

Is attack:  True , Predicted value:  0.279472441181
Comment text:  yes  complain to your rabbi and then go shoot some more defenseless arabs to take out your frustration     

Is attack:  False , Predicted value:  0.769515018621
Comment text:  god damn god damn god damn god damn god damn god damn god damn god damn god damn god damn

Is attack:  True , Predicted value:  0.126633873546
Comment text:        hey  polish slavic monkeys     you slavic monkeys can wright here all the shit what you want  but all world knews what realy you are   the trashy mutantes from lower dniepr  your   poland   have no rights to exist   it is only band of slavic trampes    vilnietis 

Is attack:  True , Predicted value:  0.111704527066
Comment text:    obviously lan3y knows more about the band than mark and tom  an  example of stupidity 

Is attack:  False , Predicted value:  0.59561657208
Comment text:        skateboard     i do not own a skateboard knob end 

Is attack:  True , Predicted value:  0.386570

In [90]:
predicted[0] < .50

True