# Scenario: Analyzing and Segregating News Headlines for **Sarcasm Detection**

Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.

To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost.



### **Dataset Description:**

The data set contains the following attributes:

- **is_sarcastic**: 1 if the record is sarcastic otherwise 0

- **headline**: the headline of the news article

- **article_link**: link to the original news article. Useful in collecting supplementary data

### **Tasks to be performed:**

- Download the data set from Dropox and install dependencies
- Import required libraries and load the dataset
- Perform Exploratory Data Analysis (EDA)
 - Analyze the data using **Pandas Profiling** and record your observations
 - Use **Sweetviz** to visualize the columns present in the data set
 - Analze the target variable **is sarcastic**
- Implement Text Pre-processing 
- Impelement TF-IDF Vectorizer
- Split the data set into training and testing set using **train_test_split** function from sklearn
- Model Building 
 - Bernoulli Classifier
- Model Evaluation




### **Downloading the data set from Dropox and installing dependencies**

In [None]:
#Installing Pandas Profiling

!pip install pandas-profiling==2.7.1 

In [None]:
#Installing Sweetviz

!pip install sweetviz

In [1]:
#Downloading the dataset from Dropbox 

!wget https://www.dropbox.com/s/ztu2gdjau2otq2a/Sarcasm_Headlines_Dataset.json

--2023-05-21 02:53:59--  https://www.dropbox.com/s/ztu2gdjau2otq2a/Sarcasm_Headlines_Dataset.json
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/ztu2gdjau2otq2a/Sarcasm_Headlines_Dataset.json [following]
--2023-05-21 02:53:59--  https://www.dropbox.com/s/raw/ztu2gdjau2otq2a/Sarcasm_Headlines_Dataset.json
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6be5f1b1321e6b4b8143b179a4.dl.dropboxusercontent.com/cd/0/inline/B8cXbx_2haGp4px1pProuibnOpWsejAqyUVlpuN5vnxajx2vnDlVtPNQ4io8QWJRK_6Er2aKyMvmMmJ4MsgUigOiVCtDuiAhgcT7hVU62AmOqamdUrkCp1qGGYxfmD7cqEa1JyELZI-lIM0NAus56XXsbCJ57dkysUJCCzpo1VQRkQ/file# [following]
--2023-05-21 02:54:00--  https://uc6be5f1b1321e6b4b8143b179a4.dl.dropboxusercontent.com/cd/0/inline/B8cXbx_2haGp4px1pProuibnOp

### **Import required libraries and load the dataset**

In [2]:
#Importing required libraries

import spacy

import numpy as np
import pandas as pd

#import seaborn as sns
import matplotlib.pyplot as plt
#import plotly.express as px

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

print('Libraries Imported')

Libraries Imported


In [3]:
#Reading the dataset
df = pd.read_json('Sarcasm_Headlines_Dataset.json',lines=True)
# Parameter: lines - Read the file as a json object per line
#Printing the top 5 values
print(df.shape)
df.head()

(26709, 3)


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


### **Exploratory Data Analysis**

**Analyzing the data using Pandas Profiling**

In [None]:
#Generating a Pandas Profiling Report 

import pandas_profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Please refer to the HTML file created by the name of **output.html**

**Analyzing the data using Sweetviz**

**Sweetviz** is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. **Output** is a fully self-contained **HTML** application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

**[Click Here!](https://pypi.org/project/sweetviz/)** to learn more about Sweetviz

In [None]:
# Importing sweetviz
import sweetviz as sv

#Analyzing the dataset
report = sv.analyze(df)

#Display the report
report.show_html('Sweetviz_Output.html')

In [4]:
df.columns

Index(['article_link', 'headline', 'is_sarcastic'], dtype='object')

In [6]:
df.is_sarcastic.value_counts()

0    14985
1    11724
Name: is_sarcastic, dtype: int64

In [7]:
print("Percentages for is_sarcastic values")
df.is_sarcastic.value_counts() * 100 /df.shape[0]

Percentages for is_sarcastic values


0    56.104684
1    43.895316
Name: is_sarcastic, dtype: float64

**Check for Null Values**

In [8]:
df.isnull().sum()

article_link    0
headline        0
is_sarcastic    0
dtype: int64

In [9]:
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [10]:
df['num_words'] = df['headline'].apply(lambda x: len(str(x).split()))
df.head()



Unnamed: 0,article_link,headline,is_sarcastic,num_words
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,12
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,14
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,14
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,13
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,11


In [11]:
maxWords = df['num_words'].max()
print('Maximum number of words', maxWords)
df.head()

Maximum number of words 39


Unnamed: 0,article_link,headline,is_sarcastic,num_words
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,12
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,14
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,14
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,13
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,11


In [12]:
text = df[df['num_words'] == maxWords]['headline'].values
print('\nSentence:\n', text)
print(type(text))
print(text[0])


Sentence:
 ['elmore leonard, modern prose master, noted for his terse prose style and for writing about things perfectly and succinctly with a remarkable economy of words, unfortunately and sadly expired this gloomy tuesday at the age of 87 years old']
<class 'numpy.ndarray'>
elmore leonard, modern prose master, noted for his terse prose style and for writing about things perfectly and succinctly with a remarkable economy of words, unfortunately and sadly expired this gloomy tuesday at the age of 87 years old


### **Text Pre-processing**

#### Word tokenize
A sentence or data split into words is called word tokenize

In [13]:
# Word tokenize
nlp = spacy.load('en_core_web_sm')
tokenCollection = nlp(text[0])

# List compresion method to get tokens
tokenList = [token.text for token in tokenCollection ]
print(tokenList)

['elmore', 'leonard', ',', 'modern', 'prose', 'master', ',', 'noted', 'for', 'his', 'terse', 'prose', 'style', 'and', 'for', 'writing', 'about', 'things', 'perfectly', 'and', 'succinctly', 'with', 'a', 'remarkable', 'economy', 'of', 'words', ',', 'unfortunately', 'and', 'sadly', 'expired', 'this', 'gloomy', 'tuesday', 'at', 'the', 'age', 'of', '87', 'years', 'old']


#### Punctuation
Spacy library contains different punctuations, such as **Quotes, currency, punctuation** ect,
In above sentence we have seen inveted comma punctuation in the sentence and it will be considered as new word tocken, which is not usefull for our analysis. So we will remove that punctuation from sentence.

In [14]:
# Data preprocessing
# Remove punctuation
print('Quotes:',spacy.lang.punctuation.LIST_QUOTES)
print('\nPunctuations:',spacy.lang.punctuation.LIST_PUNCT)
#print('\n Currency:',spacy.lang.punctuation.LIST_CURRENCY)


Quotes: ["\\'", '"', '”', '“', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '（', '）', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '〈', '〉', '', '⟦', '⟧']

Punctuations: ['…', '……', ',', ':', ';', '\\!', '\\?', '¿', '؟', '¡', '\\(', '\\)', '\\[', '\\]', '\\{', '\\}', '<', '>', '_', '#', '\\*', '&', '。', '？', '！', '，', '、', '；', '：', '～', '·', '।', '،', '۔', '؛', '٪']


In [15]:

# list of punctuation contains most of punctuation, we will use only that for our analysis
punc = [token.text for token in tokenCollection  if  token.is_punct ]
print('\nPunctuation:',punc)


Punctuation: [',', ',', ',']


#### Stopword
In this step we will remove stop words in dataset

In [16]:
stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)
print('Number of stopwords is','-'*20,len(stopwords))
print('Ten stop words',list(stopwords)[:10])
stop = [token.text for token in tokenCollection if token.is_stop]
print('*'*100,'\n\nStop word in sentence: ',stop)

Number of stopwords is -------------------- 326
Ten stop words ['some', 'together', 'others', 'everyone', 'everything', 'front', 'none', 'was', 'every', 'thus']
**************************************************************************************************** 

Stop word in sentence:  ['for', 'his', 'and', 'for', 'about', 'and', 'with', 'a', 'of', 'and', 'this', 'at', 'the', 'of']


#### Digit

In [17]:
digit = [token.text for token in tokenCollection if token.is_digit]
print('Digit in sentence: ',digit)

Digit in sentence:  ['87']


In [18]:
toremove = [token.text for token in tokenCollection if token.is_digit or token.is_punct or token.is_stop]
print('to be removed: ',toremove)

to be removed:  [',', ',', 'for', 'his', 'and', 'for', 'about', 'and', 'with', 'a', 'of', ',', 'and', 'this', 'at', 'the', 'of', '87']


#### Lemmatizing
Lemmetiztion is the process of retrieving the root word of the current word. Lemmatization is an essential process in NLP to bring different variants of a single word to one root word.

In [19]:
lemma = [token.lemma_ for token in tokenCollection]
print(lemma)

['elmore', 'leonard', ',', 'modern', 'prose', 'master', ',', 'note', 'for', 'his', 'terse', 'prose', 'style', 'and', 'for', 'write', 'about', 'thing', 'perfectly', 'and', 'succinctly', 'with', 'a', 'remarkable', 'economy', 'of', 'word', ',', 'unfortunately', 'and', 'sadly', 'expire', 'this', 'gloomy', 'tuesday', 'at', 'the', 'age', 'of', '87', 'year', 'old']


#### Named Entities
A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title.

In [20]:
spacy.displacy.render(tokenCollection, style='ent', jupyter=True)

In [21]:
nlp = spacy.load('en_core_web_sm')
tokenCollection = nlp(text[0])

In [22]:
df_new = pd.DataFrame(
{
    'token': [token.text for token in tokenCollection],
    'lemma':[token.lemma_ for token in tokenCollection],
    'POS': [token.pos_ for token in tokenCollection],
    'TAG': [token.tag_ for token in tokenCollection],
    'DEP': [token.dep_ for token in tokenCollection],
    'is_stopword': [token.is_stop for token in tokenCollection],
    'is_punctuation': [token.is_punct for token in tokenCollection],
    'is_digit': [token.is_digit for token in tokenCollection],
})

df_new

Unnamed: 0,token,lemma,POS,TAG,DEP,is_stopword,is_punctuation,is_digit
0,elmore,elmore,PROPN,NNP,amod,False,False,False
1,leonard,leonard,PROPN,NNP,nsubj,False,False,False
2,",",",",PUNCT,",",punct,False,True,False
3,modern,modern,ADJ,JJ,amod,False,False,False
4,prose,prose,NOUN,NN,compound,False,False,False
5,master,master,NOUN,NN,appos,False,False,False
6,",",",",PUNCT,",",punct,False,True,False
7,noted,note,VERB,VBD,ROOT,False,False,False
8,for,for,ADP,IN,prep,True,False,False
9,his,his,PRON,PRP$,poss,True,False,False


In [None]:
def highlight_True(s):
    """
    Highlight True and False
    """
    return ['background-color: yellow' if v else '' for v in s]
df_new.style.apply(highlight_True,subset=['is_stopword', 'is_punctuation', 'is_digit'])

#### Cleaning the text

In [23]:
def clean_text(df):
    nlp = spacy.load('en_core_web_sm')
    for i in range(df.shape[0]):
        tokenCollection = nlp(df['headline'][i])
        tokenList = [token.lemma_.lower().strip() for token in tokenCollection 
               if not (token.is_stop | token.is_punct | token.is_digit) ]
        text = " ".join(tokenList)
        df['headline'][i] = text
        
        # if i <5:
        if i%1000 == 1:
          print('Sentence:',i,text)
    return df

In [24]:
#This cell can take upto 30 mins to execute
news_df = clean_text(df)
news_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['headline'][i] = text


Sentence: 1 roseanne revival catch thorny political mood well bad
Sentence: 1001 happen infect measle chart
Sentence: 2001 dodd frank
Sentence: 3001 baylor football coach ignore culture problem despite sex abuse
Sentence: 4001 billy eichner boogie obama ellen get detail
Sentence: 5001 museum staff brace large group wear t shirt
Sentence: 6001 bradley cooper rack stagger oscar nomination
Sentence: 7001 man bear party die party
Sentence: 8001 flaw evaluate leader kahneman thinking fast slow
Sentence: 9001 bill clinton say mixed race
Sentence: 10001 throw veep music awkward trump non signing fantastic
Sentence: 11001 career pave road tall grass
Sentence: 12001 u.s take key iraqi basis midnight raid
Sentence: 13001 teacher ask student split group simulate ideal class size
Sentence: 14001 new spiritually correct doll let child jesus touch
Sentence: 15001 senator russian troll stoke nfl debate
Sentence: 16001 ice cream truck driver go let kid sweat little bit stop
Sentence: 17001 trump aide 

Unnamed: 0,article_link,headline,is_sarcastic,num_words
0,https://www.huffingtonpost.com/entry/versace-b...,versace store clerk sue secret black code mino...,0,12
1,https://www.huffingtonpost.com/entry/roseanne-...,roseanne revival catch thorny political mood w...,0,14
2,https://local.theonion.com/mom-starting-to-fea...,mom start fear son web series close thing gran...,1,14
3,https://politics.theonion.com/boehner-just-wan...,boehner want wife listen come alternative debt...,1,13
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k rowling wish snape happy birthday magical way,0,11


###**Implement TFIDF Vectorizer**

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(analyzer='word',ngram_range=(1,3),max_features=5000)

X = tf.fit_transform(news_df['headline'])


In [26]:
X.shape, type(X)

((26709, 5000), scipy.sparse._csr.csr_matrix)

In [27]:
df['headline'][1]

'roseanne revival catch thorny political mood well bad'

In [30]:
print(df['headline'][2])
print(X[2])
X[1].toarray()

mom start fear son web series close thing grandchild
  (0, 4859)	0.4406458046783018
  (0, 4495)	0.25699192352698097
  (0, 809)	0.3247035564989574
  (0, 4003)	0.34086278658650854
  (0, 4858)	0.3963416148145773
  (0, 4170)	0.29956670827766896
  (0, 1572)	0.33179207974746117
  (0, 4252)	0.29579713731597435
  (0, 2859)	0.2715838143356614


array([[0., 0., 0., ..., 0., 0., 0.]])

###**Splitting the data into training and testing set using train_test_split function from sklearn**

In [31]:
y = news_df['is_sarcastic']

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state= 64)

###**Model Buidling**


In [32]:
#Creating a Model Object 

from sklearn.naive_bayes import BernoulliNB

nb = BernoulliNB()

#Fitting the model on the training data set
nb.fit(X_train,y_train)

###**Model Evaluvation**

In [33]:
pred = nb.predict(X_valid)

from sklearn.metrics import confusion_matrix, classification_report

print('Confusion matrix\n',confusion_matrix(y_valid,pred))


Confusion matrix
 [[3628  900]
 [ 889 2596]]


In [None]:
len(df)*.3

In [34]:
from sklearn.metrics import  accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

roc=roc_auc_score(y_valid, pred)
acc = accuracy_score(y_valid, pred)
prec = precision_score(y_valid, pred)
rec = recall_score(y_valid, pred)
f1 = f1_score(y_valid, pred)

In [35]:
results = pd.DataFrame([['Bernoulli Classifier', acc,prec,rec, f1,roc],
                        ],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC
0,Bernoulli Classifier,0.776738,0.742563,0.744907,0.743733,0.773072


In [36]:
print('Classification_report\n',classification_report(y_valid,pred))

Classification_report
               precision    recall  f1-score   support

           0       0.80      0.80      0.80      4528
           1       0.74      0.74      0.74      3485

    accuracy                           0.78      8013
   macro avg       0.77      0.77      0.77      8013
weighted avg       0.78      0.78      0.78      8013



## Homework

**1. Use Random Forest (or SVM) Classifier instead of Naive Bayes and compare the metrics with Naive Bayes**

In [None]:
# create a pipeline of tfidf, bernoulliNB
# as an alternative use tfidf, svm
# this will easily help predict you own news headings

In [None]:
# Find if the following headline is sarcatic or not using the pipeline you created above or using the classifiers we have creted before that
mystr = '''
A few months ago, Hamas “arrested” a dolphin for being an Israeli spy. Readers of Reason magazine came up with titles for the film this action might inspire: • Orcapussy
'''


###**PyCaret**


Use **PyCaret** to find the best model and perform Automatic Hyperparameter tuning

**PyCaret** is an open source, low-code machine learning library in **Python** that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment

[**Click Here!**](https://pycaret.org/) to learn more about **PyCaret**

**Installing PyCaret**

- !pip install pycaret

####**Tasks to be performed**

- Import PyCaret and load the data set
- Initialize or setup the environment 
- Compare Multiple Models and their Accuracy Metrics
- Create the model
- Tune the model
- Evaluate the model


####**Import PyCaret and load the data set**

In [None]:
!pip install pycaret

In [None]:
#Downloading the dataset from Dropbox 

!wget https://www.dropbox.com/s/ztu2gdjau2otq2a/Sarcasm_Headlines_Dataset.json

In [None]:
import pycaret.classification as pc
#dir(pc)

In [None]:
#Reading the dataset
import pandas as pd
df = pd.read_json('Sarcasm_Headlines_Dataset.json',lines=True)

#Printing the top 5 values
df_new = df.head(1000)

In [None]:
print(df_new.shape)
df_new.head()

####**Initialize or setup the environment**

In [None]:
pc.setup(df_new, target='is_sarcastic')

####**Compare Multiple Models and their Accuracy Metrics**

In [None]:
pc.compare_models()

**Note:** Don't worry about the models. You are gonna learn most of them in the upcoming modules

####**Create the Model**



In [None]:
rf_model = pc.create_model('lr') 

####**Tune the Model**

In [None]:
tuned_rf = pc.tune_model(rf_model)

In [None]:
print(rf_model)

In [None]:
print(tuned_rf)

See the difference between the original model (**rf_model**) and the tuned model (**tuned_rf**)

####**Evaluate the Model**

In [None]:
tuned_rf_eval = pc.evaluate_model(tuned_rf)