Hello everyone and welcome to my first submission on Kaggle.

While I a lot of people might think that complex models with a long training time is required (especially in NLP), I want to show that this is not always the case.

In this notebook, I will train a model that uses pre-trained word vectors already and that is fasttext.

Here is a link of the docs, in case any of you are interested in further reading:

https://fasttext.cc/

Since the dataset is basically already clean, almost no preprocessing is needed for the following reasons:

- News headlines are already short in nature, so there will not be a problem with text length, loss of meaning etc.

- The words are already pre-trained so 99% of the training time will not be required.

- The task is really simple and perhaps other more complex models might perform worse.

# Imports

In [1]:
#!pip install fasttext
#!pip install nltk
import numpy as np
import pandas as pd
import fasttext
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
stopwords = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()

For the people new to nlp, nltk is a great library with a lot of essential features to aid us in sentiment analysis and other NLP tasks.

Here our interest lies on sentiment analysis and with the super 'clean' dataset we will only focus on removing the stopwords and lemmatizing.

Stopwords: words that do not help in defining a sentiment in our case.

Lemma: the core meaning of a word.

Most news headlines will already convey the core idea, so lemmatizing will only serve the purpose for most long and fluffy satirical headlines. It also helps on processing time since less tokens are necessary.


In [2]:
df = pd.read_csv("../input/news-headlines-for-sarcasm-detection/Data.csv")
df.head()

Unnamed: 0,headlines,target
0,CNN Triumphs (At Least in Most Demographic Cat...,Non Sarcastic
1,"‘You Did The Best You Could,’ Says Iron Man Ac...",Sarcastic
2,New Emails Reveal Warm Relationship Between Ka...,Non Sarcastic
3,Donald Trump Jr. Gets Slammed Over Racist Birt...,Non Sarcastic
4,God Urges Rick Perry Not To Run For President,Sarcastic


# Preprocessing

Given that the dataset is small, I will not prioritize computing power and etc.

For that, I will keep doing the process with list since it will also help beginners to understand.

In [3]:
# Separate the data to treat them
headlines = df['headlines'].values.tolist()
target = df["target"].values.tolist()

### apply_re_and_return_lower
For fasttext, punctuation will not help convey a meaning of the word, the .lower() is to avoid differentiation of the same words (Fast and fast are the same words for humans, but different for Python)

### change_labels_for_fasttext
Fasttext has a particularity that required the data to be labeled a certain way and "joined" with the dataset, which will be done later. For now, we only want the labels to be the way we want them.

In [4]:
def apply_re_and_return_lower(headlines):
  re_list = []
  for text in headlines:
    text = re.sub("[^0-9A-Za-z ]", "", text)
    re_list.append(text)
  return_list = []
  for title in re_list:
    tokens = word_tokenize(title) #imported from nltk
    working_list = []
    for word in tokens:
      if word not in stopwords: #check stopwords variable in imports
        working_list.append(stemmer.lemmatize(word.lower()))
    return_list.append(' '.join(working_list))
  return return_list

def change_labels_for_fasttext(target):
  target_cleaned = []
  for label in target:
    if label == 'Sarcastic':
      target_cleaned.append("__label__positive")
    else:
      target_cleaned.append("__label__negative")
  return target_cleaned

In [5]:
# Apply functions
headlines = apply_re_and_return_lower(headlines)
target = change_labels_for_fasttext(target)

In [6]:
# Check if there was any error along the way and create the dataset to join and split
assert len(headlines) == len(target)
dataset = []
for index in range(0, len(headlines)):
  new_text = target[index] + " " + headlines[index]
  dataset.append(new_text)

In [7]:
# Split the data
from sklearn.model_selection import train_test_split
dataset_train, dataset_test = train_test_split(dataset, test_size=0.2, random_state=45)

In [8]:
# Fasttext required the data to be separated in txt files
np.savetxt("x_train_ft.txt", dataset_train, delimiter="\n", fmt="%s")
np.savetxt("x_test_ft.txt", dataset_test, delimiter="\n", fmt="%s")

# Training and Evaluation

In [9]:
# train the model with pre-trained word vectors
model = fasttext.train_supervised(input="x_train_ft.txt", lr=0.075, epoch=10)

Read 0M words
Number of words:  14916
Number of labels: 2
Progress: 100.0% words/sec/thread: 1953703 lr:  0.000000 avg.loss:  0.139741 ETA:   0h 0m 0s


In [10]:
# Test the model
# Outputs (number of words, precision, recall)
num_words, precision, recall = model.test("x_test_ft.txt")

In [11]:
F1_score = 2*((precision*recall)/(precision+recall))
F1_score

0.9012789768185452

# Trying with Real World Examples

In [12]:
# Link: https://www.theonion.com/lawyer-explains-that-just-because-you-accidentally-kill-1848251296
model.predict(("Lawyer Explains That Just Because You Accidentally Kill Santa Doesn't Mean You’re Legally Obligated To Take His Place").lower())

(('__label__positive',), array([0.99080282]))

In [13]:
# Link: https://edition.cnn.com/travel/article/qatar-hilton-salwa-resort/index.html
model.predict(("Hilton Salwa: The gigantic luxury hotel in the middle of nowhere").lower())

(('__label__negative',), array([0.99403822]))

In [14]:
# Link: https://www.theonion.com/want-to-make-these-amazing-glitter-bagels-too-bad-the-1848228492
model.predict(("Want To Make These Amazing Glitter Bagels? Too Bad, They’re My Secret Recipe And I’ll Never Tell, Even If You Shoot My Kids In Front Of Me").lower())

(('__label__positive',), array([0.91794515]))

# Conclusion

Very quickly, one was able to train a model to detect sentiment quickly without a lot of computational resources.

I hope you found this model informative and I am happy to answer any comments.

Thank you for reading!

Luis Sejas