# Hey there, welcome to your final challenge

You're finally going to have a chance to apply the skills you've been learning on a real-life dataset.

Your task is to predict if a given news headline is sarcastic or not.

<img src="images/you_dont_say.jpg" width="400">

--- 

A good way to think about how a machine might detect sarcasm is to try it yourself.
Which of these headlines is sarcastic?

* "Thinking about the way you look all the time burns 5,000 calories an hour"
* "Safeguarding the well-being of children"

How did you make your decision? Which features of the text gave it away?
This is the thought process you will need to make a good model.

--- 

Feel free to be creative and write your own code wherever you want!

The provided functions are only there to help you if you get stuck :)

### Imports

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from gensim.models.keyedvectors import Word2VecKeyedVectors
from nltk.corpus import stopwords
import string

ModuleNotFoundError: No module named 'gensim'

### Loading the data

In [3]:
# Load the in the data from CSV
fp = '../data/train_df.csv'
df = pd.read_csv(fp)

# Print the number of headlines that we're dealing with
print(f'Size of data: {len(df)}')

# Print the percentage of headlines that are sarcastic
sarcasm_percentage = int(100 * sum(df['is_sarcastic']) / len(df))
print(f'Sarcasm percentage: {sarcasm_percentage}%')

# Show a sample of the data
df.head()

Size of data: 15862
Sarcasm percentage: 43%


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


### Cleaning the data

In [33]:
stop_words = stopwords.words('english')

def basic_tokenizer(sentence):
    out = sentence
    for x in string.punctuation:
        out = out.replace(x, '')
    out = out.lower().split()
    return out

def remove_stopwords(sentence, stop_words):
    return [word for word in sentence if word not in stop_words]

def preprocess(sentence, stop_words):
    clean_sentence = remove_stopwords(basic_tokenizer(sentence), stop_words)
    return clean_sentence

df['clean_head'] = df['headline'].apply(lambda x: preprocess(x, stop_words))

df.head()

Unnamed: 0,article_link,headline,is_sarcastic,clean_head
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,"[former, versace, store, clerk, sues, secret, ..."
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,"[roseanne, revival, catches, thorny, political..."
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,"[mom, starting, fear, sons, web, series, close..."
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,"[boehner, wants, wife, listen, come, alternati..."
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,"[jk, rowling, wishes, snape, happy, birthday, ..."


### Features

The first step we have to take in a machine learning task is to create features.

Below our features will be the counts of each word in the vocab in each sentence.

In [34]:
def sentence_to_bow(sentence, vocab):
    bow = [sentence.count(w) for w in vocab]
    return bow

vocab = ['trump', 'nation', 'area', 'onion']
df['bow'] = df['clean_head'].apply(lambda x: sentence_to_bow(x, vocab))

df.head()

Unnamed: 0,article_link,headline,is_sarcastic,clean_head,bow
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,"[former, versace, store, clerk, sues, secret, ...","[0, 0, 0, 0]"
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,"[roseanne, revival, catches, thorny, political...","[0, 0, 0, 0]"
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,"[mom, starting, fear, sons, web, series, close...","[0, 0, 0, 0]"
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,"[boehner, wants, wife, listen, come, alternati...","[0, 0, 0, 0]"
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,"[jk, rowling, wishes, snape, happy, birthday, ...","[0, 0, 0, 0]"


### Train and test

Once we've created our features, we need to split out data into train and test datasets.



In [78]:
X = np.array(list(df['bow']), dtype=np.float32)
y = df['is_sarcastic'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((23275, 4), (2587, 4), (23275,), (2587,))

### Logistic Regression

Suppose we have a features $x$ with a target output $y$.

For our use case $x$ could be a particular word, and $y$ is whether the headline is sarcastic or not.

---

In school you will (hopefully) have learned about *linear regression*,
where the goal is to learn coefficients $a$ and $b$
that minimize the error

$err = \sqrt {(y_* - y)^2}$

where $y_* = ax + b$

This is a good method when the $y_*$ are unbounded, 
but we need our outputs to be probabilities,
so they have to lie in the closed interval $[0, 1]$.

Exercise: Check that for any real-valued input $y_*, \sigma(y_*)$ lies in $[0, 1]$.

---

We can achieve this with *logistic regression*.

This is an adapation of linear regression where we push $y_*$ through a *logistic function*

$\sigma(y_*) = \frac{1}{1 + e^{-y_*}}$

---

Above we only have one feature, but regression can be generalized to work for multiple features. 

We will be borrowing an implementation of logistic regression from Sklearn.

It also has other classification models which you can explore [here](https://scikit-learn.org/stable/supervised_learning.html).

This model gives us a baseline of about 60% accuracy on the test set.

By changing the features you should be able to achieve over 90% accuracy.

We believe in you!

In [79]:
def fit_lr(X_train, y_train):
    # Train a logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Get the accuracy of the model on the train and the test data
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    print(f'Train accuracy: {int(100*train_score)}%')
    print(f'Test accuracy: {int(100*test_score)}%')
    
    return model

model = fit_lr(X_train, y_train)

Train accuracy: 59%
Test accuracy: 58%


### Alternative features

Inside of using word counts, we can use the mean word embeddings of headlines as features.

This gives us about 70% accuracy on the test set.

In [58]:
w2vmodel = Word2VecKeyedVectors.load("./models/word2vec.model")
w2vmodel.init_sims(replace=True)

2020-12-10 14:35:33,932 [6611] INFO     gensim.utils: loading Word2VecKeyedVectors object from ./models/word2vec.model
2020-12-10 14:35:40,065 [6611] INFO     gensim.utils: loading vectors from ./models/word2vec.model.vectors.npy with mmap=None
2020-12-10 14:35:45,515 [6611] INFO     gensim.utils: setting ignored attribute vectors_norm to None
2020-12-10 14:35:45,516 [6611] INFO     gensim.utils: loaded ./models/word2vec.model
2020-12-10 14:35:49,077 [6611] INFO     gensim.models.keyedvectors: precomputing L2-norms of word weight vectors


In [72]:
def get_mean_vec(head):
    vecs = [w2vmodel[w] for w in head if w in w2vmodel]
    
    if vecs:
        mean_vec = np.array(np.mean(vecs, axis=0), dtype=np.float32)
    else:
        mean_vec = np.zeros(300, dtype=np.float32)
    
    return mean_vec

In [73]:
df['mean_vec'] = df['clean_head'].apply(lambda x: get_mean_vec(x))

In [74]:
X = np.array(list(df['mean_vec']), dtype=np.float32)
y = df['is_sarcastic'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [75]:
model = fit_lr(X_train, y_train)

Train accuracy: 74%
Test accuracy: 74%


### Debugging

To improve our model we'll need to examine where it's going wrong.

Let's have a look at some of the headlines where it's failing.

In [76]:
def debug_output(y_pred, y_test, df, n_samples=5):
    """
    Debug the output of your model
    
    :param y_pred: np.ndarray[bool]
                 : Predictions of the sarcasm of X_test
    :param y_test: np.ndarray[bool]
                 : The true sarcasm of X_test
    :param df: pd.core.frame.DataFrame
             : The sarcasm data
    :param n_samples: int (default = 5)
                    : The number of samples to debug
    """
    n_train = len(df) - len(y_pred)
    test_headlines = df['headline'].to_numpy()[n_train:]
    
    # Find the predictions which were wrong
    failed_idx = np.where(y_pred != y_test)[0]
    
    # Display a sample of the headlines where our model failed
    for idx in failed_idx[:n_samples]:
        print(f'Idx: {idx}')
        print(f'Prediction: {y_pred[idx]}')
        print(f'Actual: {y_test[idx]}')
        print(f'Headline: {test_headlines[idx]}')
        print('')
    
y_pred = model.predict(X_test)
debug_output(y_pred, y_test, df, n_samples=10)

Idx: 3
Prediction: 1
Actual: 0
Headline: how to actually get a bartender's attention

Idx: 12
Prediction: 0
Actual: 1
Headline: albert pujols and the end to down syndrome bullying

Idx: 13
Prediction: 1
Actual: 0
Headline: rex tillerson supposedly shifted exxon mobil's climate position. except he really didn't.

Idx: 17
Prediction: 0
Actual: 1
Headline: area family awakes to find michelle obama tending backyard garden

Idx: 22
Prediction: 0
Actual: 1
Headline: who urges end to routine antibiotic use in farm animals to stem rise of superbugs

Idx: 29
Prediction: 0
Actual: 1
Headline: house bipartisanship throws up pitifully weak toxic chemicals control bill

Idx: 30
Prediction: 1
Actual: 0
Headline: report: bots now make up 22% of twitter executives

Idx: 34
Prediction: 1
Actual: 0
Headline: frustration with husband taken out on soap scum

Idx: 35
Prediction: 0
Actual: 1
Headline: white house probes kushner business loans after ethics questions

Idx: 39
Prediction: 0
Actual: 1
Headline:

### Further debugging

To improve our model it is important to examine how it's making predictions.
We can do this by looking at the importance that it places on each word.

If the coefficient is positive, the word is used mostly in sarcastic headlines.

If the coefficient is negative, the word is used mostly in non-sarcastic headlines.

---

Could this be useful for detecting which features are important?

Could this be useful for detecting overfitting?

--- 

Note - This only works when using the bag-of-words features

In [80]:
def print_important_words(coef, vocab, n_words=10):
    """
    Print the most important words according to the model coefficients
    
    :param coef: list[float]
               : The importance of each coefficient
    :param vocab: list[str]
                : The vocabulary corresponding to the coefficients
    :param n_words: int (default = 10)
                  : The number of top words to show
    """
    # Put the coefficients in order from biggest magnitude to smallest magnitude
    top_idx = np.argsort(-abs(coef))
    
    # Take the top `n_words` coefficients
    top_idx = top_idx[:n_words]
    
    for i in top_idx:
        print(f'{vocab[i]}: {coef[i]:.2f}')

coef = model.coef_[0]
print_important_words(coef, vocab)

area: 3.89
nation: 2.75
onion: 2.35
trump: -1.31


### Headline examination

If we're interested in a particular word, we can print the headlines that contain it.

This will help us to select new features and gain a better understanding of the current features.

In [81]:
def print_headlines_with_word(df, word):
    """
    Find the headlines which contain the given word
    
    :param df: pd.core.frame.DataFrame
             : The sarcasm data
    :param word: str
               : The word of interest
    """
    
    sarcasm = 0
    not_sarcasm = 0
    
    for idx, row in df.iterrows():
        
        # Check if the word is in the headline
        if word in row['headline'].split(' '):
            
            print(f'Label: {row["is_sarcastic"]}')
            print(f'Headline: {row["headline"]}\n')
            
            if row['is_sarcastic']:
                sarcasm += 1
            else:
                not_sarcasm += 1
                
    print(f'Sarcasm: {sarcasm}')
    print(f'Not sarcasm: {not_sarcasm}')
            
print_headlines_with_word(df, 'onion')

Label: 1
Headline: son of edward r. murrow says father 'real dirtbag' compared to onion reporters

Label: 1
Headline: heroic police officer talks man down from edge of purchasing subway footlong sweet onion chicken teriyaki

Label: 1
Headline: the onion apologizes

Label: 1
Headline: obama, romney urge americans to purchase 'the onion book of known knowledge'

Label: 0
Headline: the onion is getting into the movie business

Label: 1
Headline: onion twitter password changed to onionman77

Label: 1
Headline: fabled burger king employee places single onion ring in everyone's fries

Label: 1
Headline: mother of slaying victim glad it was onion reporter who knocked on her door half an hour after funeral

Label: 1
Headline: whales beach selves in attempt to purchase 'the onion book of known knowledge'

Label: 1
Headline: 'arby's has been putting more onion bits on their buns,' reports man sinking into heavy depression

Label: 1
Headline: man regrets straying from sour cream and onion potato 