# Task 2 - sarcastic JSON 

Enclosed dataset contains data about press headlines – their content and type. Aim of this task is to create binary classification model for headline type (sarcastic / not sarcastic) based on headline content.

Goal: Based on attached data build a model that will classify headline types. Prepare a report where you describe your way of approaching the problem and the steps you took to solve it. Don’t forget to assess the quality of the model you have prepared.

Don’t focus on getting the highest score possible - the more important thing is to show your line of thought and your approach to the problem - especially NLP techniques you used.

If you have any doubts or something seems inconclusive based on the task description, write it down in the report in the way you would ask the customer to clarify.


## Before I start ...

The task is to classify whether the headline is sarcastic based solely on its content. 

The definition of **sarcasm** according to the Cambridge Dictionary is: 
> "the use of remarks that clearly mean the opposite of what they say, made in order to hurt someone's feelings or to criticize something in a humorous way"

Other definitions point out that usually it is used in **speech**. Detection of sarcasm in writing can sometimes pose a challenge even to a human. Often headlines are accompanied by photos or graphics that help a reader detect the sarcastic tone in the headline. Since we only have text, that will have to do for now. We have to trust that the data is labelled correclty and has as few falsely classified examples as possible.

## Set up all the packages and paths

In [46]:
import pandas as pd
from pandas import json_normalize
import json
import os

In [49]:
file = os.getcwd() + r"/Datasets/Graduate - HEADLINES dataset (2019-06).json"
print(file)

/home/mab/Roche_MAB/Datasets/Graduate - HEADLINES dataset (2019-06).json


## Combine all the JSON documents into a pandas dataframe

The original file looks like that:
```
{"headline": "former versace store clerk sues over secret 'black code' for minority shoppers", 
"is_sarcastic": 0}
{"headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", 
"is_sarcastic": 0}
...
```
I am going to transform it into a more convenient format

In [92]:
df = pd.DataFrame({"headline":[], "is_sarcastic":[]}) 

with open(file) as f:
    for line in f.readlines():
        document = json.loads(line)
        df = df.append(json_normalize(document))

df = df.reset_index(drop=True)
df['is_sarcastic'] = df['is_sarcastic'].astype(int)

print(df.head())

                                            headline  is_sarcastic
0  former versace store clerk sues over secret 'b...             0
1  the 'roseanne' revival catches up to our thorn...             0
2  mom starting to fear son's web series closest ...             1
3  boehner just wants wife to listen, not come up...             1
4  j.k. rowling wishes snape happy birthday in th...             0


In [82]:
# df.to_csv('dataframe2.csv', index = False) #temporary

In [93]:
data = pd.read_csv('dataframe2.csv')

                                            headline  is_sarcastic
0  former versace store clerk sues over secret 'b...             0
1  the 'roseanne' revival catches up to our thorn...             0
2  mom starting to fear son's web series closest ...             1
3  boehner just wants wife to listen, not come up...             1
4  j.k. rowling wishes snape happy birthday in th...             0


## Exploratory Data Analysis

In [95]:
df.describe()

Unnamed: 0,is_sarcastic
count,26709.0
mean,0.438953
std,0.496269
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


The dataset consists of 26 709 records. Mean of 0.44 indicates that the classes are almost evenly represented so we don't have to counter the effect of unbalanced data. 

## Natual Language Processing

### Data cleaning

https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution