# Task 2 - sarcastic JSON 

Enclosed dataset contains data about press headlines – their content and type. Aim of this task is to create binary classification model for headline type (sarcastic / not sarcastic) based on headline content.

**Goal:** Based on attached data build a model that will classify headline types.

## Before I start ...

The task is to classify whether the headline is sarcastic based solely on its content. 

The definition of **sarcasm** according to the Cambridge Dictionary is: 
> "the use of remarks that clearly mean the opposite of what they say, made in order to hurt someone's feelings or to criticize something in a humorous way"

Other definitions point out that usually it is used in **speech**. Detection of sarcasm in writing can sometimes pose a challenge even to a human. Often headlines are accompanied by photos or graphics that help a reader detect the sarcastic tone in the headline. It is very contextual and pre-trained sentiment or emotion classifier can be extremely helpful in that scenario. 

Some labels can be mistakenly assigned. There might be some false negatives: sarcastic comments work perfectly when they are tailored to the person and specific situation. Machine learning models would need to have a high level of world knowledge to classify everything correctly. So some headlines might not be picked up by the model

## Set up all the packages and paths

In [1]:
import pandas as pd
from pandas import json_normalize
import json
import os

In [2]:
file = os.getcwd() + r"/Datasets/Graduate - HEADLINES dataset (2019-06).json"
print(file)

/home/mab/Code4Life_MAB/Datasets/Graduate - HEADLINES dataset (2019-06).json


## Combine all the JSON documents into a pandas dataframe

The original file looks like that:
```
{"headline": "former versace store clerk sues over secret 'black code' for minority shoppers", 
"is_sarcastic": 0}
{"headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", 
"is_sarcastic": 0}
...
```
I am going to transform it into a more convenient format

In [13]:
df = pd.read_json(file, lines = True)
df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [11]:
df.to_csv('dataframe2.csv', index = False) #temporary

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [5]:
df = pd.read_csv('dataframe2.csv')

## Exploratory Data Analysis

In [9]:
print(df.describe())
print(df.count())

       is_sarcastic
count  26709.000000
mean       0.438953
std        0.496269
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
headline        26709
is_sarcastic    26709
dtype: int64


The dataset consists of 26 709 records. Mean of 0.44 indicates that the classes are almost evenly represented so we don't have to counter the effect of unbalanced data. There are no missing values in either column.

## Natual Language Processing

1. Data cleaning
    - removing special characters if they don't add any value
2. Removing stopwords, tokenizing, POS tagging
3. 

https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution

In [None]:
# wordcloud - s / ns

In [None]:
## Number of words in the text
train["num_words"] = train["question_text"].apply(lambda x: len(str(x).split()))
test["num_words"] = test["question_text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text
train["num_unique_words"] = train["question_text"].apply(lambda x: len(set(str(x).split())))
test["num_unique_words"] = test["question_text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text
train["num_chars"] = train["question_text"].apply(lambda x: len(str(x)))
test["num_chars"] = test["question_text"].apply(lambda x: len(str(x)))

## Number of stopwords in the text
train["num_stopwords"] = train["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))
test["num_stopwords"] = test["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))

## Average length of the words in the text
train["mean_word_len"] = train["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test["mean_word_len"] = test["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
In [ ]:
## Truncate some extreme values for better visuals ##
train['num_words'].loc[train['num_words']>50] = 50
train['num_unique_words'].loc[train['num_unique_words']>50] = 50
train['num_chars'].loc[train['num_chars']>300] = 300
train['mean_word_len'].loc[train['mean_word_len']>10] = 10

f, axes = plt.subplots(5, 1, figsize=(15,40))

sns.boxplot(x='target', y='num_words', data=train, ax=axes[0])
axes[0].set_xlabel('Target', fontsize=12)
axes[0].set_title("Number of words in each class", fontsize=15)

sns.boxplot(x='target', y='num_unique_words', data=train, ax=axes[1])
axes[1].set_xlabel('Target', fontsize=12)
axes[1].set_title("Number of unique words in each class", fontsize=15)

sns.boxplot(x='target', y='num_chars', data=train, ax=axes[2])
axes[2].set_xlabel('Target', fontsize=12)
axes[2].set_title("Number of characters in each class", fontsize=15)

sns.boxplot(x='target', y='num_stopwords', data=train, ax=axes[3])
axes[3].set_xlabel('Target', fontsize=12)
axes[3].set_title("Number of stopwords in each class", fontsize=15)

sns.boxplot(x='target', y='mean_word_len', data=train, ax=axes[4])
axes[4].set_xlabel('Target', fontsize=12)
axes[4].set_title("Mean word length in each class", fontsize=15)

plt.show()