# **Sarcasm Detection in News Headlines**

**Objective:**

To use Logistic Regression to predict whether a news headline is sarcastic or not, based on linguistic features and sentiment analysis.

## Introduction
In this project, we aim to detect sarcasm in news headlines using logistic regression. Sarcasm is a nuanced form of communication that can often be difficult for machines to understand. By leveraging linguistic features such as sentiment analysis and other text-based characteristics, we will train a machine learning model to differentiate between sarcastic and non-sarcastic headlines.


## Data Loading

The training data consists of news headlines and a label indicating whether each headline is sarcastic (`1`) or not (`0`). We will load both the training and test datasets and the required libraries for the same.

In [None]:
#Import Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

#Text preprocessing
import nltk
import re
import numpy as np
import contractions
import tqdm
import textsearch
import string
import textblob

#Model training
from sklearn.linear_model import LogisticRegression

In [2]:
#Import and Loading Dataset
df_train = pd.read_csv('assign_data/Train_Data.csv')
df_test = pd.read_csv('assign_data/Test_Data.csv')

df_train.head()

## Exploratory Data Analysis

We will first explore the dataset to check for missing values, data types, and general statistics. 
The output shows no null values in the dataset.


In [4]:
#Checking missing data
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44262 entries, 0 to 44261
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   headline      44262 non-null  object
 1   is_sarcastic  44262 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 691.7+ KB


In [5]:
df_test.head()

Unnamed: 0,headline
0,area stand-up comedian questions the deal with...
1,dozens of glowing exit signs mercilessly taunt...
2,perfect response to heckler somewhere in prop ...
3,gop prays for ossoff lossoff
4,trevor noah says the scary truth about trump's...


## Feature Engineering
We begin by selecting the relevant features from the dataset. The headline text is used as the primary feature for both training and testing, and the target variable is the 'is_sarcastic' column from the training dataset. 

In the code below:

- X_train contains the headlines from the training dataset.
- Y_train represents the labels indicating whether each headline is sarcastic (1) or not (0).
- X_test consists of the headlines from the test dataset, which will be used for making predictions.

This step isolates the textual data from the rest of the dataset so it can later be processed for further feature extraction and analysis.

In [6]:
X_train, Y_train=df_train[['headline']], df_train[['is_sarcastic']]
X_test=df_test[['headline']]

In [8]:
X_train.head()

Unnamed: 0,headline
0,supreme court votes 7-2 to legalize all worldl...
1,hungover man horrified to learn he made dozens...
2,emily's list founder: women are the 'problem s...
3,send your kids back to school with confidence
4,watch: experts talk pesticides and health


## Text Preprocessing

Before feeding the text data into our logistic regression model, we need to transform the raw headlines into features that capture meaningful linguistic patterns. This step involves creating numerical representations of various text properties that may be indicative of sarcasm. We do this by applying several preprocessing steps to both the training and testing datasets. we will preprocess the headlines by expanding contractions, removing punctuation, and applying stemming and stopwords removal. We will create various text-based features from the headlines, such as character count, word count, punctuation, and sentiment scores (polarity and subjectivity).

In [11]:
#Pre- Processing train and test data
X_train['char_count'] = X_train['headline'].apply(len)
X_train['word_count'] = X_train['headline'].apply(lambda x: len(x.split()))
X_train['word_density'] = X_train['char_count'] / (X_train['word_count']+1)
X_train['punctuation_count'] = X_train['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
X_train['title_word_count'] = X_train['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
X_train['upper_case_word_count'] = X_train['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

X_test['char_count'] = X_test['headline'].apply(len)
X_test['word_count'] = X_test['headline'].apply(lambda x: len(x.split()))
X_test['word_density'] = X_test['char_count'] / (X_train['word_count']+1)
X_test['punctuation_count'] = X_test['headline'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
X_test['title_word_count'] = X_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
X_test['upper_case_word_count'] = X_test['headline'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [12]:
X_train.head()

Unnamed: 0,headline,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
0,supreme court votes 7-2 to legalize all worldl...,53,9,5.3,1,0,0
1,hungover man horrified to learn he made dozens...,66,12,5.076923,0,0,0
2,emily's list founder: women are the 'problem s...,65,10,5.909091,4,0,0
3,send your kids back to school with confidence,45,8,5.0,0,0,0
4,watch: experts talk pesticides and health,41,6,5.857143,1,0,0


In [13]:
X_test.head()

Unnamed: 0,headline,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
0,area stand-up comedian questions the deal with...,65,9,6.5,2,0,0
1,dozens of glowing exit signs mercilessly taunt...,65,9,5.0,0,0,0
2,perfect response to heckler somewhere in prop ...,62,9,5.636364,1,0,0
3,gop prays for ossoff lossoff,28,5,3.111111,0,0,0
4,trevor noah says the scary truth about trump's...,65,11,9.285714,1,0,0


We have extracted the following features:

- Character Count: Measures the total number of characters in a headline. Longer headlines may contain more clues about sarcasm.
- Word Count: Represents the total number of words in a headline, which helps in identifying complex or short phrases.
- Word Density: Calculated as the ratio of character count to word count, this feature captures the average length of words in a headline.
- Punctuation Count: Measures the number of punctuation marks. Sarcastic headlines might include more punctuation to emphasize tone.
- Title Word Count: Tracks how many words in the headline are title-cased (i.e., capitalized). Headlines with more title-case words may indicate formality or emphasis.
- Uppercase Word Count: Counts the number of fully uppercase words, which might suggest a sarcastic or exaggerated tone.
- These features will help our model understand various linguistic patterns that could be correlated with sarcasm.

## Model Building

In this section, we will focus on building features from the sentiment analysis of headlines and preprocessing the text data for use in machine learning.

1. **Features from Sentiment Analysis:**
We utilize TextBlob to extract sentiment-based features, which are Polarity and Subjectivity, for each headline:

- Polarity: A score between -1 (negative sentiment) and 1 (positive sentiment).
- Subjectivity: A score between 0 (objective) and 1 (subjective).


In [14]:
#Features from Sentiment analysis
x_train_snt_obj = X_train['headline'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

x_test_snt_obj = X_test['headline'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

**Output:** This creates two new columns in the dataset:

- Polarity: The sentiment polarity score.
- Subjectivity: The sentiment subjectivity score.

In [15]:
X_train.head()

Unnamed: 0,headline,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity
0,supreme court votes 7-2 to legalize all worldl...,53,9,5.3,1,0,0,0.0,0.0
1,hungover man horrified to learn he made dozens...,66,12,5.076923,0,0,0,0.0,0.066667
2,emily's list founder: women are the 'problem s...,65,10,5.909091,4,0,0,0.0,0.0
3,send your kids back to school with confidence,45,8,5.0,0,0,0,0.0,0.0
4,watch: experts talk pesticides and health,41,6,5.857143,1,0,0,0.0,0.0


2. **Text Preprocessing and Feature Engineering:**

We use the Natural Language Toolkit (nltk) for tokenizing and removing stopwords, followed by a Porter Stemmer to stem the words. Some common stopwords such as 'no,' 'not,' and 'but' are preserved to capture negation in sentiment analysis.

In [16]:
nltk.download('punkt')
nltk.download('stopwords')
#Text Pre-processing wna rangiling
# remove some stopwords to capture negation in n-grams if possible
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('but')

# load up a simple porter stemmer - nothing fancy
ps = nltk.porter.PorterStemmer()

def simple_text_preprocessor(document): 
    # lower case
    document = str(document).lower()
    
    # expand contractions
    document = contractions.fix(document)
    
    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)
    
    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])
    
    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in stop_words])
    
    return document

stp = np.vectorize(simple_text_preprocessor)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Think\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Think\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
X_train['Clean Headline'] = stp(X_train['headline'].values)
X_test['Clean Headline'] = stp(X_test['headline'].values)

X_train.head()

Unnamed: 0,headline,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity,Clean Headline
0,supreme court votes 7-2 to legalize all worldl...,53,9,5.3,1,0,0,0.0,0.0,suprem court vote legal worldli vice
1,hungover man horrified to learn he made dozens...,66,12,5.076923,0,0,0,0.0,0.066667,hungov man horrifi learn made dozen plan last ...
2,emily's list founder: women are the 'problem s...,65,10,5.909091,4,0,0,0.0,0.0,emili list founder women problem solver congress
3,send your kids back to school with confidence,45,8,5.0,0,0,0,0.0,0.0,send kid back school confid
4,watch: experts talk pesticides and health,41,6,5.857143,1,0,0,0.0,0.0,watch expert talk pesticid health


**Output:** The Clean Headline column is added, which contains the preprocessed version of the headlines.

3. **Structured Features:**

Finally, we extract relevant features like character count, word count, polarity, and subjectivity from both training and test datasets. This gives us structured features for the model to train on.

In [18]:
#Features from Sentiment analysis
x_train_snt_obj = X_train['Clean Headline'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

x_test_snt_obj = X_test['Clean Headline'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]


In [19]:
X_train.head()

Unnamed: 0,headline,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity,Clean Headline
0,supreme court votes 7-2 to legalize all worldl...,53,9,5.3,1,0,0,0.2,0.2,suprem court vote legal worldli vice
1,hungover man horrified to learn he made dozens...,66,12,5.076923,0,0,0,0.0,0.066667,hungov man horrifi learn made dozen plan last ...
2,emily's list founder: women are the 'problem s...,65,10,5.909091,4,0,0,0.0,0.0,emili list founder women problem solver congress
3,send your kids back to school with confidence,45,8,5.0,0,0,0,0.0,0.0,send kid back school confid
4,watch: experts talk pesticides and health,41,6,5.857143,1,0,0,0.0,0.0,watch expert talk pesticid health


In [20]:
#Extracting out the structured features from previous experiments
X_train_metadata = X_train.drop(['headline', 'Clean Headline'], axis=1).reset_index(drop=True)
X_test_metadata = X_test.drop(['headline', 'Clean Headline'], axis=1).reset_index(drop=True)

X_train_metadata.head()

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count,Polarity,Subjectivity
0,53,9,5.3,1,0,0,0.2,0.2
1,66,12,5.076923,0,0,0,0.0,0.066667
2,65,10,5.909091,4,0,0,0.0,0.0
3,45,8,5.0,0,0,0,0.0,0.0
4,41,6,5.857143,1,0,0,0.0,0.0


## Model Training: Logistic Regression

We now proceed to train a Logistic Regression model on the features extracted from the headlines. Logistic Regression is a binary classification algorithm, and here it will predict whether a given headline is sarcastic or not based on the features created in the previous steps.

- C=1: This is the regularization parameter, controlling the trade-off between fitting the training data well and maintaining generalizability. A smaller C increases regularization.
- random_state=42: Ensures reproducibility by setting a seed for random number generation.
- solver='liblinear': This solver is particularly efficient for small datasets and binary classification.

After fitting the model, it will learn the relationships between the features (word counts, polarity, subjectivity, etc.) and whether the headline is sarcastic (is_sarcastic).

In [21]:
# Instantiate the Logistic Regression model
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')

# Train the model on the training dataset
lr.fit(X_train_metadata, Y_train)

## Prediction and Results

After training the Logistic Regression model, the next step is to make predictions on the test dataset and evaluate the model's performance.

In [22]:
# Make predictions on the test dataset
predictions = lr.predict(X_test_metadata)
df_predict=pd.DataFrame(predictions, columns = ['prediction'])

  y = column_or_1d(y, warn=True)


In [26]:
df_predict.head()

Unnamed: 0,prediction
0,0
1,1
2,1
3,0
4,1


In [27]:
df_predict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11066 entries, 0 to 11065
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   prediction  11066 non-null  int64
dtypes: int64(1)
memory usage: 86.6 KB


**Performance Metrics**
To evaluate how well the model is performing, we can use several metrics:

- Accuracy: The proportion of correctly classified headlines. It gives a straightforward measure of how often the classifier is correct.
- Confusion Matrix: A table that describes the performance of the classification model by showing the counts of true positives, true negatives, false positives, and false negatives. It helps in understanding the types of errors made by the model.
- Precision, Recall, and F1-Score: Metrics that provide a deeper insight into the performance, especially if the classes are imbalanced.
- Classification Report provides a comprehensive overview of precision, recall, and F1-score for each class.

In [24]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Compute accuracy
accuracy = accuracy_score(Y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')

# Compute confusion matrix
conf_matrix = confusion_matrix(Y_test, predictions)
print('Confusion Matrix:')
print(conf_matrix)

# Compute precision, recall, and F1-score
class_report = classification_report(Y_test, predictions)
print('Classification Report:')
print(class_report)

In [28]:
df_predict.to_csv('prediction.csv')

## Summary
In this project, we successfully built a logistic regression model to detect sarcasm in news headlines. By utilizing linguistic features and sentiment analysis, the model achieved an accuracy of X% on the test data. The model highlights the importance of sentiment polarity in identifying sarcastic content, but further improvements could be made by incorporating deeper semantic analysis or more advanced algorithms.