In this notebok we have developed a system to detect irony in text. We will use the data from the SemEval-2018 task on irony detection.

```csv
Tweet index     Label   Tweet text
1       1       Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR
2       1       @mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)
3       1       Hey there! Nice to see you Minnesota/ND Winter Weather 
4       0       3 episodes left I'm dying over here
```



Read all the data and find the size of vocabulary of the dataset (ignoring case) and the number of positive and negative examples.

In [89]:
from google.colab import files
files.upload()

{}

In [90]:
%tensorflow_version 2.x 
import numpy as np
import pandas as pd
import nltk
import tensorflow as tf
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from tensorflow import keras

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df = pd.read_csv('1600599.txt', sep="\t")

#Preprocessing
stop = stopwords.words('english')
stem = PorterStemmer()
del df['Tweet index']
df['Tweet text'] = df['Tweet text'].apply(lambda x: " ".join(stem.stem(item.lower()) for item in x.split() if item not in stop))
df['Tweet text'] = df['Tweet text'].apply(lambda x: " ".join(item for item in x.split() if item not in stop))
df['Tweet text'] = df['Tweet text'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)
df['Tweet text'] = df['Tweet text'].replace(to_replace=r'@[^\s]+',value='',regex=True)

pos_neg = df.groupby(['Label']).size().tolist()

count = []
for i in df['Tweet text']:
  w = i.split()
  count.extend(w)
vocab_size = len(set((count)))

In [92]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop = stopwords.words('english')
# def clean(data):
  
df['Tweet text'] = df['Tweet text'].apply(lambda x: " ".join(item.lower() for item in x.split() if item not in stop))
df['Tweet text'].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    sweet unit nation video. time christmas. #imag...
1    rumor talk erv' agent... angel ask ed escobar....
2      hey there! nice see minnesota/nd winter weather
3                                3 episod left i'm die
4    can't breathe! chosen notabl quot year annual ...
Name: Tweet text, dtype: object

Divide the data into a training and test set and justify your split.

Implement a function that calculates the precision, recall and F-Measure for this task.

In [93]:
train_test_cutoff = int(.80 * len(df)) 
training_sentences = df[:train_test_cutoff]
testing_sentences = df[train_test_cutoff:]
X_train = training_sentences['Tweet text']
X_test = testing_sentences['Tweet text']
y_train = training_sentences['Label']
y_test = testing_sentences['Label']
y_test_org = y_test

from sklearn.metrics import precision_recall_fscore_support
def evaluation(y_pred, y_true):
  precision, recall, fscore, support = precision_recall_fscore_support(np.array(y_true),np.array(y_pred), average='macro')
  return precision, recall, fscore
X_train.shape, y_train.shape

((3053,), (3053,))

Suggesting some features to extract from each sentence. Implementing a simple log-linear model to classify tweets as ironic or not ironic.
Train this method and evaluate the results using precision, recall and F-Measure

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer().fit(X_train,X_test)
X_train = cvec.transform(X_train).toarray()
X_test = cvec.transform(X_test).toarray()
X_train.shape, y_train.shape

((3053, 8487), (3053,))

In [95]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)

precision,recall,fscore = evaluation(y_pred,y_test)
print(precision, recall, fscore)

0.6131443920245188 0.6127305665349143 0.6127591213284881


Developed an acceptor or a transducer recurrent neural network that classifiers the sentence as ironic or not ironic.

Evaluate this according to precision, recall or F-Measure

In [96]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

tokenizer = Tokenizer(num_words=vocab_size)

out_dim,max_len = 64,64
X_train = training_sentences['Tweet text'].tolist()
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train,maxlen=max_len)

X_test = testing_sentences['Tweet text'].tolist()
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test,maxlen=max_len)
y_train= to_categorical(y_train)
y_test = to_categorical(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3053, 64), (764, 64), (3053, 2), (764, 2))

In [97]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, out_dim))
model.add(tf.keras.layers.LSTM(out_dim, return_sequences=True))
model.add(tf.keras.layers.LSTM(out_dim, return_sequences=False))
model.add(tf.keras.layers.Dense(2, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x = X_train,y= y_train ,epochs=5,batch_size = 10,
                    validation_data=(X_test, y_test)
                    )

Train on 3053 samples, validate on 764 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [98]:
model.evaluate(x= X_test , y = y_test )
y_pred = model.predict_classes(X_test)
precision,recall,fscore = evaluation(y_pred,y_test_org)
print(precision, recall, fscore)

0.2591623036649215 0.5 0.3413793103448276


  _warn_prf(average, modifier, msg_start, len(result))


In [99]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=vocab_size,output_dim = out_dim))
model.add(tf.keras.layers.LSTM(64,dropout=0.2, recurrent_dropout=0.2,return_sequences=True))
model.add(tf.keras.layers.LSTM(32,dropout=0.2, recurrent_dropout=0.2,return_sequences=True))
model.add(tf.keras.layers.LSTM(8,dropout=0.2, recurrent_dropout=0.2,return_sequences=False))
model.add(tf.keras.layers.Dense(2, activation='sigmoid'))
adam = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)
model.compile(optimizer=adam,
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(x = X_train,y= y_train ,epochs=5,batch_size = 10,
                    validation_data=(X_test, y_test)
                    )

Train on 3053 samples, validate on 764 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [100]:
model.evaluate(x= X_test , y = y_test )
y_pred = model.predict_classes(X_test)
precision,recall,fscore = evaluation(y_pred,y_test_org)
print(precision, recall, fscore)

0.5880041011619959 0.5848155467720686 0.5785525154457194
