# Sarcasm Prediction

Dataset contains news headlines - which are aimed to be written in a sarcastic manner by the news author. The task is to build our NLP models and predict whether the headline is sarcastic or not.

**About the Data**

Each record of dataset consists of two attributes:

- is_sarcastic: 1 if the record is sarcastic otherwise 0. This is the target variable.

- headline: this is the headline of the news article

 

In [None]:
# Setup drive for colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Install dependecies

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm

import nltk

nltk.download('punkt')
nltk.download('stopwords')

Collecting contractions
  Downloading contractions-0.0.55-py2.py3-none-any.whl (7.9 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 15.4 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 65.8 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85435 sha256=4dafead148e17709070aed89d010a71d5a61d24c5ed5f8a213d3da025cad17d4
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully instal

True

In [None]:
import contractions
import re
import pandas as pd
import numpy as np

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer

## Load and view dataset

In [None]:
train_data = pd.read_csv('/content/drive/MyDrive/dphi_dataset/dphi_nlp_sarcastic_headline_train.csv')
train_data.head()

Unnamed: 0,headline,is_sarcastic
0,supreme court votes 7-2 to legalize all worldl...,1
1,hungover man horrified to learn he made dozens...,1
2,emily's list founder: women are the 'problem s...,0
3,send your kids back to school with confidence,0
4,watch: experts talk pesticides and health,0


## Exploratory Data Analysis

In [None]:
train_data.shape

(44262, 2)

In [None]:
train_data.isna().count()

headline        44262
is_sarcastic    44262
dtype: int64

In [None]:
train_data['is_sarcastic'].value_counts()

0    23958
1    20304
Name: is_sarcastic, dtype: int64

From the EDA we can conclude we have 44262 samples out of which 20304 are sarcastic records and rest are non-sarcastic headlines. There is no null value in entire dataset.

## Text Wrangling

In [None]:
# Text pre-processing and wrangling

# remove some stopwords to capture negation in n-grams if possible
stop_words = nltk.corpus.stopwords.words('english')

# load up a simple porter stemmer - nothing fancy
ps = nltk.porter.PorterStemmer()

def text_preprocessor(document): 
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

clean_text = np.vectorize(text_preprocessor)

In [None]:
def get_freq_words(data, n=2000):
  word_freq = Counter()

  for heading in data['clean_headline'].values:
    word_freq.update(heading.split())
  print(word_freq[n])
  most_freq_words = set()
  for word, freq in word_freq.most_common(n):
    most_freq_words.add(word)
  
  return most_freq_words

In [None]:
def text_postprocessor(document, most_freq_words):
  document = ' '.join([word for word in document.split() if word in most_freq_words])
  return document
parse_freq_words = np.vectorize(text_postprocessor)

In [None]:
def text_wrangling(data, freq_words, n=2000):
  data['clean_headline'] = clean_text(data['headline'].values)
  if freq_words is None:
    freq_words = get_freq_words(data, n)
  print(len(freq_words))
  data['main_headline'] = parse_freq_words(data['clean_headline'].values, freq_words)
  return freq_words

In [None]:
freq_words_set = text_wrangling(train_data, None, 5000)
train_data.head()

0
5000


Unnamed: 0,headline,is_sarcastic,clean_headline,main_headline
0,supreme court votes 7-2 to legalize all worldl...,1,suprem court vote legal worldli vice,suprem court vote legal vice
1,hungover man horrified to learn he made dozens...,1,hungov man horrifi learn made dozen plan last ...,hungov man horrifi learn made dozen plan last ...
2,emily's list founder: women are the 'problem s...,0,emili list founder women problem solver congress,list founder women problem congress
3,send your kids back to school with confidence,0,send kid back school confid,send kid back school confid
4,watch: experts talk pesticides and health,0,watch expert talk pesticid health,watch expert talk health


## Bag of words text representation

In [None]:
# create text representation model
from sklearn.feature_extraction.text import CountVectorizer

def get_bag_of_words(data):
  cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))
  bag_of_words = cv.fit_transform(data['main_headline']).toarray()
  return pd.DataFrame(bag_of_words, columns=cv.get_feature_names())

In [None]:
train_data_cv = get_bag_of_words(train_data)
train_data_cv.shape

(44262, 4982)

## Appyling Logistic Regression ML

In [None]:
# Train-Test Split
X_train, X_test = train_data_cv[:32000], train_data_cv[32000:]
Y_train, Y_test = train_data['is_sarcastic'][:32000], train_data['is_sarcastic'][32000:]

In [None]:
# model training and evaluation
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')

lr.fit(X_train, Y_train)
predictions = lr.predict(X_test)

print(classification_report(Y_test, predictions))
pd.DataFrame(confusion_matrix(Y_test, predictions))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84      6596
           1       0.83      0.79      0.81      5666

    accuracy                           0.83     12262
   macro avg       0.83      0.83      0.83     12262
weighted avg       0.83      0.83      0.83     12262



Unnamed: 0,0,1
0,5682,914
1,1173,4493


Hence we conclude to use Logistic regression ML algorithm on bag of words representation.

## Predicting on actual test data

### Loading data

In [None]:
# loading training and test data
import pandas as pd

train_data = pd.read_csv('/content/drive/MyDrive/dphi_dataset/dphi_nlp_sarcastic_headline_train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/dphi_dataset/dphi_nlp_sarcastic_headline_test.csv')

### Wrangling data

In [None]:
freq_words_set = text_wrangling(train_data, None, 2000)
text_wrangling(test_data, freq_words_set, 2000)

In [None]:
train_data_cv = get_bag_of_words(train_data)
test_data_cv = get_bag_of_words(test_data)

print(train_data_cv.shape, test_data_cv.shape)

(44262, 1983) (11066, 1983)


In [None]:
# generate prediction
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')

lr.fit(train_data_cv, train_data['is_sarcastic'])
predictions = lr.predict(test_data_cv)

### Write output to file

In [None]:
test_data['prediction'] = predictions
test_data.head()

Unnamed: 0,headline,clean_headline,main_headline,prediction
0,area stand-up comedian questions the deal with...,area stand comedian question deal drive thru w...,area stand comedian question deal drive window,1
1,dozens of glowing exit signs mercilessly taunt...,dozen glow exit sign mercilessli taunt multipl...,dozen exit sign employe,1
2,perfect response to heckler somewhere in prop ...,perfect respons heckler somewher prop comedian...,perfect respons comedian,0
3,gop prays for ossoff lossoff,gop pray ossoff lossoff,gop,0
4,trevor noah says the scary truth about trump's...,trevor noah say scari truth trump rumor love c...,trevor noah say truth trump love child,0


In [None]:
output_df = test_data[['prediction']]
output_df.to_csv('/content/drive/MyDrive/dphi_dataset/dphi_nlp_sarcastic_headline_output.csv', index=False)

In [None]:
test_data['prediction'].value_counts()

0    6375
1    4691
Name: prediction, dtype: int64