### Data source
This [dataset](https://www.kaggle.com/rtatman/ironic-corpus) contains 1950 comments, which have been labeled as ironic (1) or not ironic (-1) by human annotators. The text was taken from Reddit comments.
    

In [1]:
import csv
data = []
with open('irony-labeled.csv') as datafile:
    csvReader = csv.reader(datafile)
    for row in csvReader:
        data.append(row)


In [2]:
# delete the first element (header) in the data list
del data[0]
data[:3]


[["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nhttp://www. examiner. com/article/obama-and-wright-throw-each-other-under-the-bus\n\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.",
  '-1'],
 ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democr

### Preprocessing

In [3]:
# remove url from texts
import re
for row in data:
    row[0] = re.sub(r'^https?:\/\/.*[\r\n]*', '', row[0], flags=re.MULTILINE)
print data[:3]

[["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.", '-1'], ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', '-1'], ["We are truly following the patterns of how the 

In [4]:
data_texts = [] # build a list to store texts
data_labels = [] # build a list to store labels
for row in data:
    data_texts.append(row[0])
    data_labels.append(row[1])
print data_texts[:3]
print data_labels[:3]

["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.", 'It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', "We are truly following the patterns of how the mandarins took ov

In [5]:
# check the counts of ironic (labeled as -1) and non-ironic texts (labeled as 1)
print data_labels.count('1')
print data_labels.count('-1')

537
1412


So, in this dataset we have 1412 non-ironic texts and 537 ironic texts

### Vectorization

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range = (2,3),min_df=5,max_df=0.8)

features = vectorizer.fit_transform(data_texts)

a = features.toarray()
vectorizer.get_feature_names()


[u'2013 11',
 u'ability to',
 u'able to',
 u'about how',
 u'about it',
 u'about that',
 u'about the',
 u'about this',
 u'access to',
 u'according to',
 u'across the',
 u'act like',
 u'after all',
 u'after the',
 u'against the',
 u'agree with',
 u'all about',
 u'all of',
 u'all of the',
 u'all over',
 u'all over the',
 u'all that',
 u'all the',
 u'all those',
 u'allowed to',
 u'along with',
 u'always been',
 u'am not',
 u'america and',
 u'among the',
 u'amount of',
 u'an actual',
 u'an argument',
 u'an atheist',
 u'and all',
 u'and also',
 u'and am',
 u'and are',
 u'and can',
 u'and don',
 u'and even',
 u'and for',
 u'and get',
 u'and give',
 u'and got',
 u'and have',
 u'and he',
 u'and his',
 u'and how',
 u'and if',
 u'and in',
 u'and is',
 u'and it',
 u'and it is',
 u'and just',
 u'and live',
 u'and make',
 u'and no',
 u'and not',
 u'and now',
 u'and or',
 u'and other',
 u'and people',
 u'and said',
 u'and some',
 u'and that',
 u'and the',
 u'and their',
 u'and then',
 u'and there',
 

### Split training and testing dataset

In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, data_labels, test_size=0.2, random_state=42)

### Import Classifier - SVM

In [45]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Evaluation - SVM

In [47]:
from sklearn.metrics import accuracy_score

predicted = clf.predict(X_test)
# print the accuracy score
from sklearn.metrics import accuracy_score
print("Accuracy score of SVM model:\n"+ str(accuracy_score(y_test,predicted)))

# print evaluation report showing precision, recall, f1, support
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))


Accuracy score of SVM model:
0.725641025641
             precision    recall  f1-score   support

         -1       0.73      1.00      0.84       283
          1       0.00      0.00      0.00       107

avg / total       0.53      0.73      0.61       390



### Import Classifier - Naive Bayes
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [48]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Evaluation-Naive Bayes

In [49]:
from sklearn.metrics import classification_report
mnb_predict = mnb.predict(X_test)
print("Accuracy score of Naive Bayes model:\n"+ str(accuracy_score(y_test,mnb_predict)))

print(classification_report(y_test, mnb_predict))

Accuracy score of Naive Bayes model:
0.679487179487
             precision    recall  f1-score   support

         -1       0.75      0.85      0.79       283
          1       0.37      0.23      0.29       107

avg / total       0.64      0.68      0.65       390



### Accuracy VS Precision
- Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.


- Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. 

Since my dataset is not a symmetric dataset, so I consider to take a look at the precision score.




In my experiments, the 'Precision' score changes when I adjust the parameter 'ngram_range' in the Vectorization procedure.

##### What is 'ngram_range'?
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

e.g.

text = "I do not know what you mean"

If we set ngram_range = (1,1), the vectorizer only vectorize the vocabulary to include 1-gram.

So the vocabulary includes "I""do""not""know""what""you""mean"

If we set ngram_range = (2,2), the vectorizer only vectorize the vocabulary to include 2-grams.

So the vocabulary includes "I do""do not""not know""know what""what you""you""mean"

If we set ngram_range = (1,2), 

we will get:
"I""do""not""know""what""you""mean""I do""do not""not know""know what""what you""you""mean"

##### Here I record the precision scores:
    
- ngram_range = (1,1):

SVM:0.53 NB: 0.64

- ngram_range = (2,2):

SVM:0.53 NB: 0.63

- ngram_range = (3,3):

SVM:0.53 NB: 0.58 

- ngram_range = (4,4):

SVM:0.53 NB: 0.53

- ngram_range = (1,2):

SVM: 0.53 NB: 0.63

- ngram_range = (1,3):

SVM: 0.53 NB: 0.63

- ngram_range = (2,3):

SVM: 0.53 NB: 0.64

Obviously, the precision score of Naive Bayes model reaches to higher value when set the ngram_range as (1,1) and (2,3), which means when the model identify an ironic text, the unigram and the combination of bigrams and trigrams will help the model to perform better.


In [53]:
# print the counts of iconic and non-ironic sentence in predict result (SVM & NB model)
print 'Counts of ironic and non-ironic sentences in predict result (SVM model):'
print 'ironic: ' + str(list(predicted).count('1'))
print 'non-ironic: '+str(list(predicted).count('-1'))

print 'Counts of ironic and non-ironic sentences in predict result (NM model):'
print 'ironic: ' + str(list(mnb_predict).count('1'))
print 'non-ironic: '+str(list(mnb_predict).count('-1'))
# print the counts of iconic and non-ironic sentence in test dataset (true value)
print 'Counts of ironic and non-ironic sentences in testing dataset:'
print 'ironic: ' + str(list(y_test).count('1'))
print 'non-ironic: '+str(list(y_test).count('-1'))

Counts of ironic and non-ironic sentences in predict result (SVM model):
ironic: 0
non-ironic: 390
Counts of ironic and non-ironic sentences in predict result (NM model):
ironic: 68
non-ironic: 322
Counts of ironic and non-ironic sentences in testing dataset:
ironic: 107
non-ironic: 283


In [66]:
NB_result = list(mnb_predict)
test_result = list(y_test)

# create a list to store the index of differences in two results
falseiron = []
falsenoni = []
n = 0
for i, j in zip(NB_result,test_result):
    if i != j and i == '1':
        falseiron.append(n)
    if i != j and i == '-1':
        falsenoni.append(n)
    n += 1
# print the index of ironic sentences being false detected
print len(falseiron) 
# print the index of non-ironic sentences being false detected
print len(falsenoni) #

43
82


### Next
- Term usage in the ironic/non-ironic sentences
- Apart from the ngrams, what features can help to detect irony ?
   - Features used in previous study:
       - ngram
       - sentiments (ironic text maybe more negtive than non-ironic?)
       - topics
       - written-spoken style (We de- signed this set of features to explore the unexpect- edness created by using spoken style words in a mainly written style tweet or vice versa (formal words usually adopted in written text employed in a spoken style context). )
       - Hyperbole (indicates the occurrence of a sequence of three positive or negative words in a row)
       - Punctuation (presence of an ellipses as well as multiple question or excla- mation marks or a combination of the latter two)
       
- How to construct mutiple features?
- How to use multiple features in a model?